I am calculating lags from a data frame from which I need to carry out regressions per day.
I used the below code to get the parse the dates by combining the date and input time (timestamp) for the data and the root specification from my pc.
df_fp =pd.DataFrame(pd.read_csv(root+'/GEM10_FullPeriod_LIQ.csv',parse_dates=[['refDate','inputtime']]))
I set up the index using this below code
df_fp.set_index('refDate_inputtime', inplace=True)
The snippet of data parsed
I am supposed to create 6 lags of each variable to start the manipulations to I use the below code
df_fp_lag = df_fp.copy()
trailing_window_size = 6
for window in range(1, trailing_window_size + 1):
shifted = df_fp.shift(window)
shifted.columns = [x + "_lag" + str(window) for x in df_fp.columns]
df_fp_lag = pd.concat((df_fp_lag, shifted), axis=1)
df_fp_lag = df_fp_lag.dropna()
df_fp_lag
When I run the above code I get the below dataframe. It has 222208 rows × 14 columns
Data with the lagged variables
I could not manage to develop a for loop to extract data for each date through iterations. So I just tried to query the above for a single date and the code used is
df_fp_lagd = df_fp_lag.query("refDate_inputtime >= '2008-01-02 08:00:00' and refDate_inputtime <='2008-01-02 17:30:00'")
So from the above data, I opted to manually create new columns to calculate the difference between Real Value of a variable from its lag values. I could not figure out a for loop since I am totally new to Python.
The new formed columns from the differencing variable and its lag values up to 6 lags
Doing so I get this error message.
> <ipython-input-47-0ff0924bacc0>:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I am able to run the OLS regression for the single day (first day) with the below code.
import statsmodels.api as sm
from statsmodels.formula.api import ols
xyxyxy = ols('c_spread ~ relative_spread_lag1+c2_spread+c3_spread+c4_spread+c5_spread', df_fp_lagd).fit()
print(xyxyxyx.summary())
I Appreciate help on how I can be able to develop a for loop to iterate over the many days and run regressions for each day and store the daily data separately so that it can be as well imported and manipulated in other software.
Thank you.
Link to the dataset used in this question
What I have tried:
df_fp_lag = df_fp.copy()
trailing_window_size = 6
for window in range(1, trailing_window_size + 1):
shifted = df_fp.shift(window)
shifted.columns = [x + "_lag" + str(window) for x in df_fp.columns]
df_fp_lag = pd.concat((df_fp_lag, shifted), axis=1)
df_fp_lag = df_fp_lag.dropna()
df_fp_lag
xyxyxy = ols('c_spread ~ relative_spread_lag1+c2_spread+c3_spread+c4_spread+c5_spread', df_fp_lagd).fit()
print(xyxyxyx.summary())