# Variational Autoencoder In Finance

Variational Autoencoder In FinanceDimensionality Reduction of Financial Time Series and Index ConstructionMarie ImokoyendeBlockedUnblockFollowFollowingApr 14This article explores the use of a variational autoencoder to reduce the dimensions of financial time series with Keras and Python.

We will further detect similarities between financial instruments in different markets and will use the results obtained to construct a custom index.

Disclaimer: The research presented in this article comes from our Winter 2019 Term Project for the Deep Learning course at the University of Toronto School of Continuing Studies.

It was done in collaboration with Humberto Ribeiro de Souza.

The concepts and ideas are our own.

We are in no way representing our current or previous employers.

Part 1: Dimensionality Reduction Using a Variational AutoencoderIn this section, we will discuss:Creating the geometric moving average datasetAugmenting the data with stochastic simulationBuilding the variational autoencoder modelObtaining the predictions.

Creating The Geometric Moving Average DatasetIn order to compare time series of various price ranges, we have chosen to compute geometric moving average time series of returns defined as:We chose d=5, as it represents a typical trading week of 5 business days.

The dataset used in this article contains 423 geometric moving average time series for a period going from January 4th, 2016 to March 1st, 2019.

Readers can follow the steps described in the data treatment notebook to build their own dataset.

It should be similar to this one:Results can be verified by plotting some sample stock price time series and their geometric moving average curves:Then, the dataframe just built can be divided in two time periods of equal length, transposing the one for the first period only.

Period 1 goes from January 12th, 2016 to August 4th, 2017.

Period 2, goes from August 7th, 2017 to March 1st, 2019.

We will only use the period 1 data to obtain predictions.

# Divide in twogeoMA_5d_stocks_p1 = geoMA_5d_stocks.

tail(int(len(geoMA_5d_stocks)/2))# Transpose the dataframe for period 1geoMA_5d_stocks_p1_T = geoMA_5d_stocks_p1.

TWe transpose the dataframe so that each row will represent a time series for a given stock:Augmenting the data with stochastic simulationWe will use stochastic simulation to generate synthetic geometric moving average curves.

The objective is not to precisely model returns but to obtain curves with a behavior similar to real data.

By training the model with only simulated curves we can keep the real data to obtain the predictions.

The synthetic curves are generated using Geometric Brownian Motion.

We followed the steps below:Using the first period dataframe, select 100 tickers randomlyFor each ticker selected, calculate a vector of log returns such that:Then for each ticker selected, we will generate 100 paths such that:Here is a sample of a simulated curve and a real curve:We have expanded a dataset of 423 time series to 100*100 = 10,000 new time series similar (but not equal) to the stock dataset.

This will allow us to keep the actual stock dataset universe for predictions and not even have to use it for the validation.

Before building the VAE model, create the training and test sets (using a 80%-20% ratio):# Shuffle the generated curvesshuffled_array = np.

random.

permutation(sim_paths_matrix)# Split the simulated time series into a training and test setx_train = shuffled_array[0:8000]x_test = shuffled_array[8000:]Readers should also note that there is no need to remove the seasonality and trend of the time series before training the model.

Building the Variational Autoencoder (VAE) ModelWe will use a variational autoencoder to reduce the dimensions of a time series vector with 388 items to a two-dimensional point.

Autoencoders are unsupervised algorithms used to compress data.

They are built with an encoder, a decoder and a loss function to measure the information loss between the compressed and decompressed data representations.

Our goal is not to write yet another autoencoder article.

Readers who are not familiar with autoencoders can read more on the Keras Blog and the Auto-Encoding Variational Bayes paper by Diederik Kingma and Max Welling.

We will use a simple VAE architecture similar to the one described in the Keras blog.

The encoder model has:One input vector of length 388One intermediate layer of length 300 with a rectified linear unit (ReLu) activation functionOne encoder with two dimensions.

Encoder Model SummaryThe decoded model has:One input vector of two dimensions (sampled from the latent variables)One intermediate layer of length 300 with a rectified linear unit (ReLu) activation functionThe decoded vector of length 388 with a sigmoid activation function.

Decoder Model SummaryThe code below is adapted from variational_autoencoder.

py on the Keras team Github.

It is used to build and train the VAE model.

After training, we plot the training and validation loss curves:Obtaining the PredictionsWe will only use the encoder to obtain the predictions.

We will use a matrix of real values including both the stock dataset and one or multiple time series of interest.

In our project, we tested a stock dataset against a front month futures contract listed in another country and in a different currency.

# Obtaining the predictions:encoded_p1 = encoder.

predict(matrix_to_test_p1, batch_size=batch_size)# Convert the predictions into a dataframeencoded_p1_df = pd.

DataFrame(data = encoded_p1, columns = ['x','y'], index = dataframe_to_test_p1.

T.

index)We obtained the following results:Before plotting the results, we have to:Calculate the distance between the futures contract point and all the other stocks in the dataframeSelect the 50 pints closest to the futures contract# Calculate the distances between the futures contract point and all other points in the stocks datasetref_point = encoded_p1_df.

loc['Futures'].

valuesencoded_p1_df['Distance'] = scipy.

spatial.

distance.

cdist([ref_point], encoded_p1_df, metric='euclidean')[0]# Get the 50 closest points:closest_points = encoded_p1_df.

sort_values('Distance', ascending = True)closest_points_top50 = closest_points.

head(51)[1:] #We take head(51), because the Futures reference point is the first entryclosest_points_top50['Ticker'] = closest_points_top50.

indexWe can now plot the results obtained to visualize the closest 50 stocks:We’ve done our analysis for a futures contract listed in another country.

However it is possible to follow the same steps in Part 1 for stocks from the same exchange.

Part 2: Index ConstructionLet’s use the results obtained in Part 1 to create an index.

Due to the randomness of the VAE model, we will not obtain the same exact list of top 50 stocks on each run.

To get a fair representation of the closest 50 points, we will run the VAE model 10 times (re-initializing and retraining it on each run).

Then we will take the 50 closest points found on each run to create a dataframe closest_points_df dataframe of length 500.

Once the closest_points_df dataframe is built:Sort the points by distanceDrop the duplicate tickers, keeping only the first occurrencesorted_by_dist = results_df.

sort_values('Distance', ascending = True)sorted_by_dist.

drop_duplicates(subset='Ticker', keep='first', inplace = True)After dropping the duplicates, we will only keep the 50 closest points.

Compute the weights of each stockIn index construction, stock weights are calculated by using different methodologies such as market capitalization or stock prices.

Instead, we will calculate the weight of each stock such that the points closest to the futures contract point will get a higher weight than the ones further from it.

With non-anonymized stock data, it is important to filter the results obtained before computing the stock weights.

Outliers should be removed and the market capitalization range should be refined.

# Calculate the weightstop50 = sorted_by_dist.

copy() # Keep the closest 50 pointstop50['Weight'] = (1/top50['Distance'])/np.

sum(1/top50['Distance'])Sample of weights calculatedCompute the number of shares of each stockAfter computing the weights, we calculate the number of shares of each stock in our custom index.

We need to:Get the price of each stock on January 4th, 2016 (the first day of period 1)Define the net assets amountCompute the number of shares#Get the stock prices on January 4th 2016jan4_2016_stockPrice = np.

zeros(len(stock_data_top50.

columns))for i in range(len(jan4_2016_stockPrice)): if stock_data_top50.

columns[i] == top50['Ticker'].

iloc[i]: jan4_2016_stockPrice[i] = stock_data_top50[stock_data_top50.

columns[i]].

iloc[0]top50['Price Jan4_2016'] = jan4_2016_stockPriceWe add a column for the stock prices on January 4th, 2016# We compute the number of sharesnet_assets = 10000000 # We chose net assets = 10 million (in the currency of the stock market)numShares = np.

zeros(len(stock_data_top50.

columns))for i in range(len(jan4_2016_stockPrice)): if stock_data_top50.

columns[i] == top50['Ticker'].

iloc[i]: numShares[i] = int(net_assets*top50['Weight'].

iloc[i]/top50['Price Jan4_2016'].

iloc[i]) top50['numShares'] = numSharesWe add a column for the number of sharesConstruct the indexTo build the index, we will use the Laspeyres index computed as:stock_index = np.

zeros(len(stock_data_top50))for i in range(len(stock_data_top50)): sum_num = 0 sum_denom = 0 for j in range(len(stock_data_top50.

columns)): sum_num = sum_num + stock_data_top50[stock_data_top50.

columns[j]].

iloc[i]*top50['numShares'].

iloc[j] sum_denom = sum_denom + stock_data_top50[stock_data_top50.

columns[j]].

iloc[0]*top50['numShares'].

iloc[j] stock_index[i] = sum_num /sum_denom# We arbitrarily start the index at 100stock_index_df = pd.

DataFrame(stock_index*100, columns = ['stock_index'], index = stock_data_top50.

index)We plot the custom index obtained:Compare our custom index with the futures time seriesWe have to scale the futures price data in order to plot it in the same graph as our custom index.

To do so we have to:Calculate the daily percentage change of the futures price dataSet S_0 = 100# Calculate the percentage changefutures_data_stock_data_pct_change = futures_data_stock_data.

pct_change()futures_data_stock_data_pct_change.

dropna(inplace = True)# Scale the time seriesfutures_theoretical = np.

zeros(len(stock_index_df))futures_theoretical[0] = stock_index_df.

iloc[0]for i in range(len(futures_theoretical)-1): futures_theoretical[i+1] = (1+futures_data_stock_data_pct_change.

iloc[i])*futures_theoretical[i]We now plot both curves in the same graph:Our index has mostly the same trend as the reference futures time series except for the second half of 2018.

Because we use anonymized data, we did not filter the stocks for outliers and market capitalization limits.

Furthermore there was no re-balancing throughout the two time periods observed and we ignored distributions.

It is absolutely possible for the custom index to beat the futures index if tickers are identified and outliers are removed.

We encourage our readers to take advantage of the free GPU instances available online to create their own indices.

It was a fun experiment for us and we discovered some interesting stock patterns.

Feel free to download the two notebooks available on GitHub:3546 Deep Learning Project — Data Treatment.

ipynb3546 Deep Learning Project — VAE & Index Construction.

ipynbConclusionThe use of variational autoencoders can speed up the development of new indices in foreign stock markets, even if analysts are unfamiliar with them.

Furthermore, niche indices or portfolios could be created to match customers interests.

While this method can be used to create ETFs, we believe that it can also create new investment possibilities for Direct Indexing and Robo Advisors firms worldwide.