How to Interpolate Time Series Data in Python Pandas

How to Interpolate Time Series Data in Python PandasJessica WalkenhorstBlockedUnblockFollowFollowingJun 11Time Series Interpolation for Pandas: Eating Bamboo Now — Eating Bamboo Later (Photo by Jonathan Meyer on Unsplash)Note: Pandas version 0.

20.

1 (May 2017) changed the grouping API.

This post reflects the functionality of the updated version.

Anyone working with data knows that real-world data is often patchy and cleaning it takes up a considerable amount of your time (80/20 rule anyone?).

Having recently moved from Pandas to Pyspark, I was used to the conveniences that Pandas offers and that Pyspark sometimes lacks due to its distributed nature.

One of the features I have learned to particularly appreciate is the straight-forward way of interpolating (or in-filling) time series data, which Pandas provides.

This post is meant to demonstrate this capability in a straight forward and easily understandable way using the example of sensor read data collected in a set of houses.

The full notebook for this post can be found in my GitHub.

Preparing the Data and Initial VisualizationFirst, we generate a pandas data frame df0 with some test data.

We create a mock data set containing two houses and use a sin and a cos function to generate some sensor read data for a set of dates.

To generate the missing values, we randomly drop half of the entries.

data = {'datetime' : pd.

date_range(start='1/15/2018', end='02/14/2018', freq='D') .

append(pd.

date_range(start='1/15/2018', end='02/14/2018', freq='D')), 'house' : ['house1' for i in range(31)] + ['house2' for i in range(31)], 'readvalue' : [0.

5 + 0.

5*np.

sin(2*np.

pi/30*i) for i in range(31)] + [0.

5 + 0.

5*np.

cos(2*np.

pi/30*i) for i in range(31)]}df0 = pd.

DataFrame(data, columns = ['readdatetime', 'house', 'readvalue'])# Randomly drop half the readsrandom.

seed(42)df0 = df0.

drop(random.

sample(range(df0.

shape[0]), k=int(df0.

shape[0]/2)))This is how the resulting table looks like:Raw read data with missing valuesThe plot below shows the generated data: A sin and a cos function, both with plenty of missing data points.

We will now look at three different methods of interpolating the missing read values: forward-filling, backward-filling and interpolating.

Remember that it is crucial to choose the adequate interpolation method for each task.

Special considerations are required particularly for forecasting tasks, where we need to consider if we will have the data for the interpolation when we do the forecasting.

For example, if you need to interpolate data to forecast the weather then you cannot interpolate the weather of today using the weather of tomorrow since it is still unknown (logical, isn’t it?).

InterpolationTo interpolate the data, we can make use of the groupby()-function followed by resample().

However, first we need to convert the read dates to datetime format and set them as the index of our dataframe:df = df0.

copy()df['datetime'] = pd.

to_datetime(df['datetime'])df.

index = df['datetime']del df['datetime']Since we want to interpolate for each house separately, we need to group our data by ‘house’ before we can use the resample() function with the option ‘D’ to resample the data to a daily frequency.

The next step is then to use mean-filling, forward-filling or backward-filling to determine how the newly generated grid is supposed to be filled.

mean()Since we are strictly upsampling, using the mean() method, all missing read values are filled with NaNs:df.

groupby('house').

resample('D').

mean().

head(4)Filling using mean()pad() — forward fillingUsing pad() instead of mean() forward-fills the NaNs.

df_pad = df.

groupby('house') .

resample('D') .

pad() .

drop('house', axis=1)df_pad.

head(4)Filling using pad()bfill() — backward fillingUsing bfill() instead of mean() backward-fills the NaNs:df_bfill = df.

groupby('house') .

resample('D') .

bfill() .

drop('house', axis=1)df_bfill.

head(4)Filling using bfill()interpolate() — interpolatingIf we want to mean interpolate the missing values, we need to do this in two steps.

First, we generate the underlying data grid by using mean().

This generates the grid with NaNs as values.

Afterwards, we fill the NaNs with interpolated values by calling the interpolate() method on the read value column:df_interpol = df.

groupby('house') .

resample('D') .

mean()df_interpol['readvalue'] = df_interpol['readvalue'].

interpolate()df_interpol.

head(4)Filling using interpolate()Visualizing the ResultsFinally, we can visualize the three different filling methods to get a better idea of their results.

The opaque dots show the raw data, the transparent dots show the interpolated values.

We can see how in the top figure, the gaps have been filled with the previously known value, in the middle figure, the gaps have been filled with the existing value to come and in the bottom figure, the difference has been linearly interpolated.

Note the edges in the interpolated lines due to the linearity of the interpolation process.

Depending on the task, we could use higher-order methods to avoid these kinks, but this would be going too far for this post.

Original data (dark) and interpolated data (light), interpolated using (top) forward filling, (middle) backward filling and (bottom) interpolation.

SummaryIn this post we have seen how we can use Python’s Pandas module to interpolate time series data using either backfill, forward fill or interpolation methods.

Originally published at https://walkenho.

github.

io on January 14, 2019.

.

. More details

Leave a Reply