Cold Start Energy Predictions

Cold Start Energy PredictionsDenis VorotyntsevBlockedUnblockFollowFollowingJan 10About three months ago, I participated in “Power Laws: Cold Start Energy Forecasting”, a competition organized by Schneider Electric on DrivenData platform.

The aim was to predict the electricity consumption of several buildings based on the previous consumption and additional factors such as temperatures, holidays information, etc.

An interesting part of the challenge was a little information about consumption available for several households (2 days or less), and, thus, our models had to generalize new buildings well.

As a result, I took the 4th place among about 1300 participants.

This post will describe my solution and main takeaways from this competition.

A final leaderboard of the competitionTask, Data and MetricForecasting the global energy consumption of a building can play a pivotal role in building operations.

It provides an initial check for facility managers and building automation systems to mark any discrepancy between expected and actual energy use.

Accurate energy consumption forecasts are also used by facility managers, utility companies, and building commissioning projects to implement energy-saving policies and optimize the operations of chillers, boilers, and energy storage systems.

Usually, forecasting algorithms use a historical information to compute their forecast.

Most of the time, the bigger the historic data set is, the more accurate forecast will be.

The goal of this challenge is to build an algorithm which provides an accurate forecast from the very start of building instrumentation.

Organizers offered the training data set, which consisted of 758 buildings with known electricity consumption over 672 hours (509 376 data points in total).

Also, we had information about the building type (based on the surface area), its holidays, and temperatures.

If a building is a flat or an apartment, Saturday and Sunday are probably holidays; if a building is a shopping mall, it works at the weekends without holidays.

To get yourself familiar with the data, you might check the organizer’s EDA.

The test dataset consisted of 625 new buildings that were not present in the training dataset.

We had information about consumption of “test” buildings for the last 24–372 hours.

Three time horizons for predictions were distinguished.

For each building the goal was either:To forecast the consumption for each hour for a day (24 predictions);To forecast the consumption for each day for a week (7 predictions);To forecast the consumption for each week for two weeks (2 predictions).

The illustration for the train/test split and what we had to predictThe competition metric was similar to MAPE — NMAE, which made each prediction (hourly, daily or weekly) equally important (the following matric contains mean of true values withinci):The competition metricFinal SolutionDuring this competition, I’ve tested a lot of ideas, but most of them did not give me any significant improvement in a score.

However, it would take too long to tell about everything that I tried; therefore, in this part, I decided to focus only on the main parts of my final solution.

Data PreprocessingData preprocessing consisted of three main steps: filling missing values, removing constant values and data scaling.

At first, I filled missing values for the temperatures with an hourly mean for the given Id and an hour.

After this, the number of missing values decreased from 45% to 2%.

The rest 2% missing values I filled with an average month temperature for the given hour.

Secondly, I cleaned all time-series with a constant target.

There were several examples in the training data set, in which we had a long period with a constant consumption.

Those examples looked like they were missing initially, and organizers decided to fill them with median values (constant values were close to the median value but not exact).

If I kept those values, my prediction would be biased; so, I deleted all data points, for which we observed a constant consumption for the last 6 hours.

One of the most important parts of my solution, as I think, was the scaling of each building consumption independently.

When you are doing preprocessing for the Neural Nets of Linear Regression, a common approach is to use all training data for target scaling:There was not the case in this competition because of the metric.

I normalized the values of each building with minimum and maximum values of the given building.

It allowed to compare buildings with different levels of electricity consumption; thus, my models started to generalize better.

I made a toy example to illustrate the idea.

Presumably, we have two buildings which are shown in the picture below.

They have a similar nature of consumption (1st plot), and the only difference is the level of consumption.

If we scale targets with all the available data (2nd plot), the picture does not change too much.

Our predictions are based on only one feature — an hour of the day.

It will have a high bias (in fact, optimal predictions for such case would be an average consumption for the given hour).

But if we scale each time series independently, they will look identical (3rd plot) and our models will find the underlying pattern easier.

I also tried different strategies for target preprocessing before normalization.

It showed a significant improvement in recent machine learning competitions (magic 1/3 degree in IDAO 2018) in case there are outliers in the training data.

However, none of them gave me even a close score to the simple min-max scaling within a building Id:Target in 1/2, 1/3, or 1/4 degree + min-max scaling or mean-std scaling;A logarithm of target + min-max scaling or mean-std scaling.

Validation StrategyI used a holdout validation set (20%) for checking scores of my models and 10-fold validation (stratification on buildings Id) for optimizing hyperparameters of the pipeline.

I’ve noticed that most time series in this competition were stationary, and scores of time-series split and shuffled folds were pretty much the same.

Therefore, I decided to use shuffled 10 folds instead of common time-series split because it allows me to use different splits afterwards, which would make my predictions more stable.

Now I use a simple rule for time-series tasks: if time series is stationary, it is ok to use KFold for hyperparameter optimization (but not for the evaluation of the whole pipeline though).

Examples of the training data.

It is clearly seen that properties of time series don’t change too much over time, there is no trend in the data, so we might use KFold instead of time-series split for validationFeature EngineeringI used different strategies for feature engineering, but the most important ones, as I think are:Features based on the timestamp: year, a day of the year, a month of the year, a week of the year, a day of the week.

These features were treated as categorical ones in NN, and I also added them as a numerical with sin-cos transformation (see equation below).

It is a well-known approach for dealing with cyclic features.

The idea behind this approach is simple: we want to induce a prior information about the process in our model, i.

e.

the end of the cycle is the beginning of a new one.

Thus, after applying this mapping, the distance between 00:00 and 23:00 will become smaller.

I recommend to check these posts to see code examples: Encoding Cyclical Features for Deep Learning, Feature Engineering — Handling Cyclical Features;I transformed temperatures from Celcius to Kelvin degrees.

Then I added temperature of an hour, a temperature of the next hour, absolute and relative differences between a temperature of this hour and the next hour, an absolute and relative difference between a temperature of the next hour, and this hour as new features.

;Features based on the type of the building: log of a maximum and minimum consumption, building surface category, days off, a building regime of work (is_{weekday}_dayoff combined in a single category).

According to the competition rules, additional data were prohibited to use.

But, anyway, I tested the following idea because of my curiosity.

I used an open database of temperature observations to determine the building location.

We knew temperatures for each hour of the given building, that’s why I searched for similar patterns of temperatures in the database.

The information about the station (latitude, longitude, country, the closest city to it) with the closest observations (sum of absolute distances between actual values and station observation) of temperatures to temperatures of the given building was used as an additional feature.

However, it decreased the score both on validation and leaderboard.

I think it was due to a lot of missing values in the temperature data.

I gave up this great idea.

ModelsI used feed-forward Neural Net as a final model with an architecture shown below.

I used embedding layers for all categorical features.

It is a commonly used approach for dealing with categories in NN.

Embedding maps similar categories closer one to each other (see this paper for a gentle introduction).

In this competition, it was extremely beneficial because of a lot of buildings in the dataset.

I trained NN with the following parameters: 100 epochs, batch size 1024, Adam optimizer, 3 epochs for early stopping.

I tested different losses for training NN in Keras.

I had a feeling that MAPE and MSE losses could give me different predictions with a high score so that I could blend them together and achieve a higher score.

But those approaches did not work good enough to include them in a final pipeline.

I also tried well-known gradient boosting methods (XGBoost and LightGBM) trained on the raw data and on a representation of the data after NN (output of inner layer of NN after several epochs of training), but they were not capable to show a decent result.

The scores of such models were not enough even for including in a final ensemble.

My final submission was just a simple average of 30 models (10 Folds * 3 slightly different architectures).

Staking did not improve the score too much.

Final ResultsMost of the competition time, I was holding the first place with a huge gap from the second place.

But in the last two days, guys from 2nd and 4th places teamed up, and a little shake-up made all the difference.

Shake up be likeBut, anyway, I spent 1.

5 months solving an interesting task, which helped me to organize a pipeline for similar tasks and competitions.

The part of the final solution was used in Junction 2018, where we won “Signaling Heroes” challenge in Smart Cloud track (ODS.

AI team).

.

. More details

Leave a Reply