Predicting Stock Prices with Echo State NetworksMatthew Stewart, PhD ResearcherBlockedUnblockFollowFollowingMar 18People have tried and failed to reliably predict the seemingly chaotic nature of the stock market for decades.

Do neural networks hold the key?“There (is) order and even great beauty in what looks like total chaos.

If we look closely enough at the randomness around us, patterns will start to emerge.

” ― Aaron SorkinThe Motivation for Time Series PredictionThe stock market is typically viewed as a chaotic time series, and advanced stochastic methods are often applied by companies to try and make reasonably accurate predictions so that they can get the upper hand and make money.

This is essentially the idea behind all investment banking, especially those who are market traders.

I do not claim to know much about the stock market (I am, after all, a scientist and not an investment banker), but I do know a reasonable amount about machine learning and stochastic methods.

One of the greatest problems in this area is trying to accurately predict chaotic time series in a reliable manner.

The idea of predicting the dynamics of chaotic systems is somewhat counterintuitive given that something chaotic, by definition, does not behave in a predictable manner.

The study of time series was around before the introduction of the stock market but saw a marked increase in its popularity as individuals tried to leverage the stock market in order to ‘beat the system’ and become wealthy.

In order to do this, people had to develop reliable methods of estimating market trends based on prior information.

First, let us talk about some properties of time series that make them easy to analyze so that we can appreciate why time series analysis can get pretty tough when we look at the stock market.

Time Series PropertiesOne of the most important properties that a time series can have is that it is stationary.

A time series is said to be stationary if its statistical properties such as mean and variance remain constant over time.

But why is it important?Most models actually work on the assumption that the time series is stationary.

Intuitively, we can say that if a time series has a particular behavior over time, there is a very high probability that it will follow the same in the future.

Also, the theories related to stationary series are more mature and easier to implement as compared to non-stationary series.

Stationarity is defined using a very strict criterion.

However, for practical purposes we can assume the series to be stationary if it has constant statistical properties over time, ie.

the following:Constant mean.

This should be intuitive since if the mean is changing then the time series can be seen to be moving, as seen by contrasting the two below figures.

2.

Constant variance.

This property is known as homoscedasticity.

The following figure depicts a stationary vs non-stationary example that violates this property.

3.

An autocorrelation that does not depend on time.

In the below figures you will notice the spread becomes closer as the time increases.

Hence, the covariance is not constant with time for the non-stationary case.

Why do I care about the ‘stationarity’ of a time series?The reason I took up this section first was that unless your time series is stationary, you cannot build a time series model.

In cases where the criteria for a stationary time series are violated, the first requisite is to transform the time series to make it stationary, and then try stochastic models to predict this time series.

There are multiple ways of bringing this stationarity.

Some of them are detrending, differencing etc.

This may seem a little stupid to those of you who are not familiar with time series analysis.

However, it is a little more complicated than it first appears (isn’t it always.

).

It just turns out that the best way to deal with a time series is to first ‘stationarize’ it, and decouple it into several different characteristics such as a linear trend, separate time series with different seasonal qualities, and then add them back together at the end.

An example of decoupling a time series into multiple series with desirable properties.

For anyone familiar with Fourier transforms, this is a very similar analogy.

What a Fourier transform does is separate out different frequency characteristics in a time series and transforms these into the frequency domain so that they can be represented more simply.

These can then be manipulated or analyzed more easily before transforming these back into the time domain.

How do I test for stationarity?It might not always be obvious from visual observations whether a time series is stationary or not.

So, more formally, we can check stationarity using the following:Plotting Rolling Statistics: We can plot the moving average or moving variance and see if it varies with time.

By moving average/variance I mean that at any instant ‘t’, we’ll take the average/variance of the last year, i.

e.

last 12 months.

But again this is more of a visual technique.

Dickey-Fuller Test: This is one of the statistical tests for checking stationarity.

Here the null hypothesis is that the TS is non-stationary.

The test results comprise of a Test Statistic and some Critical Values for difference confidence levels.

If the ‘Test Statistic’ is less than the ‘Critical Value’, we can reject the null hypothesis and say that the series is stationary.

Refer to this article for details.

Now that we know a bit more about time series, we can look at the traditional ways people study time series, how they develop their models, and why they are inadequate for studying the stock market.

Basic Methods for Time Series PredictionThe most basic methods are so simple that I think most people could have come up with them without taking a class on time series analysis.

The simplest model that is of some use is the moving average.

Essentially, the moving average takes the last t values and takes the average of these as the prediction for the next point.

The moving average is surprisingly accurate and the robustness to outliers and short-term fluctuations can be controlled by altering the number of previous points used in the averaging process.

More complex procedures then proceed naturally from this, such as exponential smoothing.

This is similar to the moving average except it is a weighted procedure that puts a higher importance on the most recent data points.

The particular weighting function used in exponential smoothing is (no surprise) an exponential function, but the procedure can be weighted using different methods.

These methods are fine for relatively consistent and periodic time series, but ones that exhibit seasonality combined with a persistent linear trend or any substantial randomness or chaotic nature are difficult to use for this.

For example, if I have a weekly oscillation and I am using a moving average model that averages the data from the last week, I will completely miss this oscillatory behavior with my model.

One very popular method for analyzing time series with different levels of autocorrelation (e.

g.

a weekly trend combined with a monthly and yearly trend) is called Holtz linear model.

Holt extended simple exponential smoothing to allow forecasting of data with a trend.

It is nothing more than exponential smoothing applied to both level (the average value in the series) and trend.

To express this in mathematical notation we now need three equations: one for level, one for the trend and one to combine the level and trend to get the expected forecast ŷ.

The other most popular technique for this is using ARIMA.

This stands for autoregressive integrated moving average.

As you can probably guess, it incorporates the moving averages as well as autoregressive features (looking at correlations between subsequent timesteps).

The ARIMA model follows a specific methodology.

Essentially, we take the original data and do our decoupling of the time series to make it into stationary and non-stationary components.

We can then study charts known as autocorrelation or partial autocorrelation plots, which look at how strongly correlated a specific value is compared to its predecessors.

From this, we can determine how to build the ARIMA model to make the predictions.

All of these methods relied on a stationary time series that had some kind of autocorrelation and/or periodicity.

This is a feature that is inherently not present in the stock market.

There are indeed times where the stock market oscillates, these are studied in great detail in any economics class at university.

These are the Kitchin cycle (3–5 years periodicity), Juglar cycle (7–11 years periodicity), Kuznets swing (15–25 years periodicity) and Kondratiev wave (45–60 years periodicity — although this one is still debated by economists).

However, stocks of individual companies generally do not follow this trend, some win and some lose more than others.

They are affected by political, socioeconomic, and social factors that are essentially random and chaotic when viewed from the view of a time series model.

In addition, these waves are not understood to a degree of accuracy that one can make useful predictions about the future of economic markets based on their existence — which makes sense because otherwise, everyone would do it.

How about neural networks?Neural networks seem to work for just about anything that involves non-linear feature spaces.

In fact, recurrent neural networks can and have been used to predict the stock market.

However, there are several challenges facing recurrent neural networks (RNNs) with regard to predicting stock prices, most noticeably the vanishing gradients problem associated with RNNs, as well as very noisy predictions.

a comprehensive walkthrough showing how to implement a basic type of RNN called an LSTM to predict stock prices can be found here.

By far the most important problem for RNNs is the vanishing gradient problem.

This issue comes from the fact that very deep neural networks that are optimized by a procedure called backpropagation use derivatives between each layer in order to ‘learn’.

These derivatives can be relatively small or relatively large.

If my network has 100 hidden layers, and I multiply a small number by itself 100 times, the value essentially disappears.

That is a problem, my network cannot learn anything if all my gradients are zero, so what can I do?There are 3 solutions to this:Clipping gradients methodSpecial RNN with leaky units such as Long-Short-Term-Memory (LSTM) and Gated Recurrent Units (GRU)Echo states RNNsGradient clipping stops our gradients from getting too big, or from getting too small, but we are still losing information by doing this so it is not an ideal approach.

RNNs with leaky units are fine and are the standard technique used by most individuals and companies using RNNs for commercial purposes.

These algorithms adapt all connections (input, recurrent, output) by some version of gradient descent.

This renders these algorithms slow, and what is maybe even more cumbersome, makes the learning process prone to become disrupted by bifurcations; convergence cannot be guaranteed.

As a consequence, RNNs were rarely fielded in practical engineering applications.

This is where echo state networks come in.

Echo state networks are a relatively new invention, it is essentially a recurrent neural network with a loosely connected hidden layer, called a ‘reservoir’ which works surprisingly well in the presence of chaotic time series.

In an echo state network, we only have to train the output weights of the network, and it speeds up the training of the neural network, generally provides better predictions, and solves all of the previous problems we have discussed with time series analysis.

ESN training, by contrast to other methods, is fast, does not suffer from bifurcations, and is easy to implement.

On a number of benchmark tasks, ESNs have starkly outperformed all other methods of nonlinear dynamical modeling.

The echo state network is part of a category of computational science known as reservoir computing, and we will delve into it in more detail in the next section.

Echo State NetworksSo we have made the case that there is no method out there that can handle chaotic time series, which, unfortunately, just so happens to be how we model the stock market.

An approach to avoid this difficulty is to fix the recurrent and input weights and learn only the output weights: the Echo State Network (ESN).

The hidden units form a ‘reservoir’ of temporal features that capture different aspects from the history inputs.

The mathematical justification behind the ESN is rather involved, so I will try to avoid it for the sake of this article.

Instead, we will discuss the concept behind the ESN and look at how it can be implemented relatively simply using Python.

The description in the original paper outlining why it is called an ‘echo’ network is“The unifying theme throughout all these variations is to use a fixed RNN as a random nonlinear excitable medium, whose high-dimensional dynamical “echo” response to a driving input (and/or output feedback) is used as a non-orthogonal signal basis to reconstruct the desired output by a linear combination, minimizing some error criteria.

”An ESN takes an arbitrary length sequence input vector (u) and (1) maps it into a high-dimensional feature space (i.

e.

the recurrent reservoir state h), and applies a linear predictor (linear regression) to find ŷ.

Schematic diagram of an echo state network.

We essentially train only the output weights, which drastically speeds up the training.

This is the great advantage of reservoir computing.

By setting and fixing the input and recurrent weights to represent a rich history, we obtain:Recurrent states as dynamical systems near to the stability — stability means Jacobians are all close to one (no vanishing or exploding gradients)Leaky hidden units that partially remember the previous state — They avoid exploding/vanishing gradients, whilst at the same time have no need of training.

Echo state propertyIn order for the ESN principle to work, the reservoir must have the echo state property (ESP), which relates asymptotic properties of the excited reservoir dynamics to the driving signal.

Intuitively, the ESP states that the reservoir will asymptotically wash out any information from initial conditions.

The ESP is guaranteed for additive-sigmoid neuron reservoirs, if the reservoir weight matrix (and the leaking rates) satisfy certain algebraic conditions in terms of singular values.

TrainingSo you might be wondering, how do we pick the values for the hidden state in the first place?.The input and recurrent weights are initialized randomly and then are fixed.

So, we are not training them.

How should we fix them to optimize the prediction?The training is very easy and fast but there are hyperparameters such as hyperparameters that govern the random generation of the weights, the degree of reservoir nodes, the sparsity of the reservoir nodes, the spectral radius.

Unfortunately, no systematic method exists to optimize the hyperparameters, and so this is typically done using a validation set.

Cross-validation is not feasible with time series data due to the inherent autocorrelation present in the feature space.

To recap:The network nodes each have distinct dynamical behaviorTime delays of signal may occur along the network linksThe network hidden part has recurrent connectionsThe input and internal weights are fixed and randomly chosenOnly the output weights are adjusted during the training.

Coding an Echo State NetworkNow for the moment you’ve all been waiting for, how do you actually code these mysterious networks?.We use a Python library which is available from this GitHub repository.

The library is called PyESN.

In order to install this library, you must clone the repository and put the pyESN.

py file in your current Jupyter Notebook folder.

Then when you are in the Python 3 notebook you can simply called import pyESN .

Overview for the pyESN library for the RC implementationYou call the RC as:For a brief explanation of the parameters:n_inputs: number of input dimensionsn_outputs: number of output dimensionsn_reservoir: number of reservoir neuronsranodom_state: seed for the random generatorsparsity: the proportion of recurrent weights set to zerospectral_radius: spectral radius of the recurrent weight matrixnoise: noise added to each neuron (regularization)Predicting Amazon Stock PricesI will now go over an example of using echo state networks to predict future Amazon stock prices.

First, we import all of the necessary libraries and also import out data (which in this case was scraped from the internet).

We then use the ESN from the pyESN library to employ an RC network.

The task here is to predict two days ahead by using the previous 1500 points and do that for 100 future points (check the figure below).

So, in the end you will have a 100 time step prediction with prediction-window = 2.

We will use this as the validation set.

First, we create our echo state network implementation using some reasonable values and specify our training and validation length.

We then create functions to calculate the mean squared error as well as the run an echo state network for specific input arguments of the spectral radius, noise, and the window length.

Now we can simply run one function and obtain our prediction, and then we can plot this to see how well we did.

The above code produces the following plot.

And if we zoom in on this plot, we can see just how impressive this prediction actually is.

Not bad right?.The only caveat is that it seems to work well for short time periods (on the order of 1 or two days) with reasonable accuracy, but the errors become increasingly large as that estimate is extrapolated further.

The above model was made with a prediction window of two days, meaning that we are only ever predicting 2 days into the future at any given time.

We can illustrate this by increasing the window length.

For a window length of 10 days, the prediction is still surprisingly accurate, although it is noticeably noisier than the two-day prediction window.

The error caused by extrapolating can be reduced by selecting appropriate hyperparameters for the given problem.

This is entirely task-dependent and must be done by iteratively testing on the validation set to find the optimal subset of hyperparameters.

Fortunately, due to the smaller number of weights required to train ESNs, this does not take as long as for a traditional RNN.

Final WordThe ability of the echo state network to analyze chaotic time series makes it an interesting tool for financial forecasting where the data is highly nonlinear and chaotic.

But, we can do more with these networks than predict the stock market.

We can also:Forecast the weatherControl complex dynamical systemsPerform pattern recognitionExpect to hear a lot more about these kinds of networks in the future, especially as people move into the development of DeepESN models which are able to work in a much higher dimensional latent space with temporal features that are able to tackle some of the most difficult time series problems.

If you are interested in these networks, there are more articles and research papers discussing these this that are freely accessible.

Echo state network — ScholarpediaEcho state networks (ESN) provide an architecture and supervised learning principle for recurrent neural networks…www.

scholarpedia.

orgDeep Echo State Network (DeepESN): A Brief SurveyThe study of deep recurrent neural networks (RNNs) and, in particular, of deep Reservoir Computing (RC) is gaining an…arxiv.

org.