Time Series Data Munging — Lagging Variables that are Distributed Across Multiple Groups

If you lag 2 steps, the first two rows will be null etc.

For this example, the index is already unique but you may want to check this for other data to avoid unintended consequences.

2.

Lag one variable across multiple groups — using “unstack” methodThis method is slightly more involved because there are several groups, but manageable because only one variable needs to be lagged.

Overall, we should be aware that we want to index the data first, then unstack to separate the groups before applying the lag function.

Failure to do this will actually do the wrong thing to your data.

For example, it may shift the last value of the previous group onto the first value of the next group, thus mixing data across groups.

Imagine if the groups were states and the variable was population over time in years.

If you make such a mistake, it will be like assuming that the population of Pennsylvania in its latter years, influences that of Rhode Island in its earlier years because these states are next to each other, alphabetically.

Which makes no sense.

This can be a huge pitfall and will lead to completely wrong analysis.

So, set the index to be your dates and groups.

Then pull out the groups using unstack and then shift the columns just like in the previous approach.

For more information on stacking, unstacking tricks see my other article here.

Now stack back the groups into the rows.

Notice how you can keep the null values and then do whatever you want to them (replace, drop etc.

)I chose to keep the null/missing values and rearrange the data for illustrative purposesdf.

reset_index().

sort_values("group")OK, if you were following closely, you will notice that this method could also works for many variables across many groups, but I wanted to save this scenario for a different technique, using groupby instead in the next section.

So, you could stop reading here and use the method above.

But you are curious, so we proceed…3.

Lag multiple variables distributed across multiple groups, simultaneously — using “groupby” methodThis method relies on the pandas groupby function combined with all the tricks we have learned in the previous methods.

So I will not spend time explaining the concepts we have already covered.

The key steps are grouping the data using the “group” column; making a function that sorts, indexes on date and shifts the data in each group, then iterating over the groupby object to apply the function to each group.

Finally, we use handy list comprehension and then concatenate this list back to its original format.

Some cool concepts used in this approach are:Assign: I recently discovered is the “assign” method available to dataframes.

It cleans up your code, allowing you to generate a new column and assign new values to it without having to use an intermediate dataframe.

It automatically returns a copy of the data.

Method-chaining: The function also uses parenthesis in the return statement in order to allow method chaining, another way to write clean, legible code.

Group the data:grouped_df = df_combo.

groupby(["group"])Make Function with assign and method-chaining:Apply function to each group using list comprehension then concatenate vertically:dflist = [lag_by_group(g, grouped_df.

get_group(g)) for g in grouped_df.

groups.

keys()]pd.

concat(dflist, axis=0).

reset_index())There you go!.Three different approaches for lagging your time series.

There may be more out there e.

g.

you could just shift the data by x steps then delete the first x steps of each group , but I think the above solutions are instead more robust and flexible, allowing you to then do whatever you want with the null values.

Let me know what you think below.

Happy Coding!.

. More details

Leave a Reply