Machine Learning Data Prep Tips for Time Series Models

Machine Learning Data Prep Tips for Time Series ModelsJen Underwood @idigdataBlockedUnblockFollowFollowingJan 27solarseven/Shutterstock.

comTime series is one of the most popular, profitable, and powerful types of predictions used today.

These models predict future values of data based on history and trends.

You can use statistical or machine learning methods to analyze time series to identify patterns such as seasonality, unusual events and relationships with input variables.

Common use cases for time series models include forecasts for sales, product demand by SKU, predictive maintenance, staffing, inventory, and many other applications.

Time series analysis assumes that there are signals in the data that can at least be partially accounted for by a change in time or other independent variables.

Example independent variables include season, weather, weekends, holiday, planned events, work schedules or macroeconomics factors such as GDP, unemployment rate, or the stock market valuations.

Time series modeling is one of the more complex types of machine learning.

You should start with simple models and build in more complexity over time.

Regularly spaced time intervals such as minutes, day, week, or month may behave quite differently — for different scenarios, products and so on.

You will likely also need to balance lagging variables due to the cause/effect patterns of the real world.

Your past data may not look like your future data.

For example, unless your model has seen a stock market crash like we saw in 2008 it cannot predict that there will be another one.

This is a classic limitation of machine learning.

However, in time series the dynamics of estimating the local behavior generally change much faster.

Structuring Time Series Input DataTime series projects use date time partitioning.

Unlike other types of machine learning projects, time series projects produce different types of models which forecast multiple future predictions instead of an individual prediction for each row.

Your input framework may consist of a Forecast Point (defining a time a prediction is being made), a Feature Derivation Window (a rolling window used to create features), and a Forecast Window (a rolling window of future values to predict).

During data prep, derived time series features such as lags and rolling statistics will be used as input features to train the models.

Depending on your tools, you might need to do this manually or your automated machine learning platform might do it for you automatically creating up to two hundreds or more potential time features.

Automated time series partitions are derived features created from found patterns that spans rows.

For my analytics audience, this concept is similar to creating dimensional time calendars for reporting.

However, time series features might be spans of time that are driven by patterns versus your calendar.

Not Too Much Data, Not Too Little DataUnlike other machine learning modeling techniques, more data doesn’t mean better performance for your time series models.

If you use data from too long ago, your model might learn trends that are no longer relevant anymore.

Often using more recent data is better than using more data to reduce diluting new patterns.

Also keep in mind that you should have enough data.

Trying to reliably predict Black Friday sales for instance will require more than two weeks of prior data.

You’d probably want to feed in two or three prior years of data.

Split into Multiple ProjectsAnother technique that improves model accuracy is building multiple data prep and machine learning projects based on unique patterns of behavior found in your data.

You can usually use a visualization tool to find obvious groups and splits over time.

In a retail use case, you might initially review sales over years for all departments to find seasonal patterns with expected sales peaks during holiday months.

Then you’ll review differences between seasonal departments such as TVs, videos, and toys and non-seasonal departments like grocery items dairy, cheese and eggs or snack foods.

Getting even more granular, you might opt to build SKU level specific models to maximize accuracy.

Unlike seasonal departments in our analysis, grocery items did not decline in year over year sales.

Those non-seasonal items appear to be steadily performing over time with sales.

They also do not seem to be influenced by promotions.

Thus, stop wasting advertising budget on those necessities.

Delving a little deeper, you visually can see two natural groups of departments — seasonal and weekly.

To improve forecast accuracy with a time series model, you would create at least two different time series projects for the two groups of departments.

In other cases, you might end up creating different data prep projects for new items versus existing items, promotional items, discontinued items, etc.

If you used one dataset to predict all the items ignoring the pattern differences, you’d probably get an unreliable forecast.

Measuring Model PerformanceTo examine time series model performance, you’ll use backtesting techniques and measure forecast improvements over baseline models.

Since seasonal purchases are highly variable, time sensitive, and a top revenue generator in our retail example, demand forecast improvement for stock planning decisions can make a massive positive impact on the bottom line.

SeasonalIn contrast, non-seasonal forecasts will have less impact due to their stability over time.

Non-SeasonalEnhancing Input Data with External Data SourcesOne of the most important things in time series is looking at how your predictions perform over time and continuing to enhance your input data with new features and external data.

By seeing where your model makes mistakes, you might find a fascinating pattern from a business process or event that was omitted in your initially prepared data.

This is where the beauty of the human mind shines brightly.

A fun real-world example of this tip was a towing company learning that hometown team football game schedules needed to be included for reliably predicting required car towing staff.

Apparently, beer drinking football fans were smart enough to find another a ride home.

The next morning, they called towing companies.

That significant variable was found by reviewing where the largest errors in forecasts were happening.

This is where the beauty of the human mind shines brightly.

Only a human was able to decipher local football games as the missing data prep ingredient.

Originally published at www.

jenunderwood.

com on January 27, 2019.

.. More details

Leave a Reply