Trading Bitcoin for massive profit with deep RL

Essentially, we can use this technique to find the set of hyper-parameters that make our model the most profitable.

We are searching for a needle in a haystack and Bayesian optimization is our magnet.

Let’s get started!ModificationsThe first thing we are going to do, before optimizing our hyper-parameters, is make a couple improvements on the code we wrote in the last article.

If you do not yet have the code, you can grab it from my GitHub.

Recurrent NetworksThe first change we need to make is to update our policy to use a recurrent, Long Short-Term Memory (LSTM) network, in place of our previous, Multi-Layer Perceptron (MLP) network.

Since recurrent networks are capable of maintaining internal state over time, we no longer need a sliding “look-back” window to capture the motion of the price action.

Instead, it is inherently captured by the recursive nature of the network.

At each time step, the input from the data set is passed into the algorithm, along with the output from the last time step.

Source: https://adventuresinmachinelearning.

com/recurrent-neural-networks-lstm-tutorial-tensorflow/This allows the LSTM to maintain an internal state that gets updated at each time step as the agent “remembers” and “forgets” specific data relationships.

Source: https://adventuresinmachinelearning.

com/recurrent-neural-networks-lstm-tutorial-tensorflow/Here we update our PPO2 model to use the MlpLstmPolicy, to take advantage of its recurrent nature.

Stationary DataIt was also pointed out to me on the last article that our data is not stationary, and therefore, any machine learning model is going to have a hard time predicting future values.

The bottom line is that our time series contains an obvious trend and seasonality, which both impact our algorithms ability to predict the time series accurately.

We can fix this by using differencing and transformation techniques to produce a more normal distribution from our existing time series.

Differencing is the process of subtracting the derivative (rate of return) at each time step from the value at that time step.

This has the desired result of removing the trend in our case, however, the data still has a clear seasonality to it.

We can attempt to remove that by taking the logarithm at each time step before differencing, which produces the final, stationary time series, shown below on the right.

We can verify the produced time series is stationary by running it through an Augmented Dickey-Fuller Test.

Doing this gives us a p-value of 0.

00, allowing us to reject the test’s null hypothesis and confirm our time series is stationary.

Here we run the Augmented Dicker-Fuller Test on our transformed data set to ensure stationarity.

Now that we’ve got that out of the way, we are going to further update our observation space using a bit of feature engineering.

Feature EngineeringTo further improve our model, we are going to be doing a bit of feature engineering.

Feature engineering is the process of using domain-specific knowledge to create additional input data that improves a machine learning model.

In our case, we are going to be adding some common, yet insightful technical indicators to our data set, as well as the output from the StatsModels SARIMAX prediction model.

The technical indicators should add some relevant, though lagging information to our data set, which will be complimented well by the forecasted data from our prediction model.

This combination of features should provide a nice balance of useful observations for our model to learn from.

Technical AnalysisTo choose our set of technical indicators, we are going to compare the correlation of all 32 indicators (58 features) available in the ta library.

We can use pandas to find the correlation between each indicator of the same type (momentum, volume, trend, volatility), then select only the least correlated indicators from each type to use as features.

That way, we can get as much benefit out of these technical indicators as possible, without adding too much noise to our observation space.

Seaborn heatmap of technical indicator correlation on BTC data set.

It turns out that the volatility indicators are all highly correlated, as well as a couple of the momentum indicators.

When we remove all duplicate features (features with an absolute mean correlation > 0.

5 within their group), we are left with 38 technical features to add to our observation space.

This is perfect, so we’ll create a utility method named add_indicators to add these features to our data frame, and call it within our environment’s initialization to avoid having to calculate these values on each time step.

Here we initialize our environment, adding the indicators to our data frame before making it stationary.

Statistical AnalysisNext we need to add our prediction model.

We’ve chosen to use the Seasonal Auto Regressive Integrated Moving Average (SARIMA) model to provide price predictions because it can be calculated very quickly at each step, and it is decently accurate on our stationary data set.

As a bonus, it‘s pretty simple to implement and it allows us to create a confidence interval for its future predictions, which is often much more insightful than a single value.

For example, our agent can be learn to be more cautious trusting predictions when the confidence interval is small and take more risk when the interval is large.

Here we add the SARIMAX predictions and confidence intervals to our observation space.

Now that we’ve updated our policy to use a more applicable, recurrent network and improved our observation space through contextual feature engineering, it’s time to optimize all of the things.

Reward OptimizationOne might think our reward function from the previous article (i.


rewarding incremental net worth gains) is the best we can do, however, further inspection shows this is far from the truth.

While our simple reward function from last time was able to profit, it produced volatile strategies that often lead to stark losses in capital.

To improve on this, we are going to need to consider other metrics to reward, besides simply unrealized profit.

A simple improvement to this strategy, as mentioned by Sean O’Gordman in the comments of my last article, is to not only reward profits from holding BTC while it is increasing in price, but also reward profits from not holding BTC while it is decreasing in price.

For example, we could reward our agent for any incremental increase in net worth while it is holding a BTC/USD position, and again for the incremental decrease in value of BTC/USD while it is not holding any positions.

While this strategy is great at rewarding increased returns, it fails to take into account the risk of producing those high returns.

Investors have long since discovered this flaw with simple profit measures, and have traditionally turned to risk-adjusted return metrics to account for it.

Volatility-Based MetricsThe most common risk-adjusted return metric is the Sharpe ratio.

This is a simple ratio of a portfolio’s excess returns to volatility, measured over a specific period of time.

To maintain a high Sharpe ratio, an investment must have both high returns and low volatility (i.



The math for this goes as follows:This metric has stood the test of time, however it too is flawed for our purposes, as it penalizes upside volatility.

For Bitcoin, this can be problematic as upside volatility (wild upwards price movement) can often be quite profitable to be a part of.

This leads us to the first rewards metric we will be testing with our agents.

The Sortino ratio is very similar to the Sharpe ratio, except it only considers downside volatility as risk, rather than overall volatility.

As a result, this ratio does not penalize upside volatility.

Here’s the math:Additional MetricsThe second rewards metric that we will be testing on this data set will be the Calmar ratio.

All of our metrics up to this point have failed to take into account drawdown.

Drawdown is the measure of a specific loss in value to a portfolio, from peak to trough.

Large drawdowns can be detrimental to successful trading strategies, as long periods of high returns can be quickly reversed by a sudden, large drawdown.

To encourage strategies that actively prevent large drawdowns, we can use a rewards metric that specifically accounts for these losses in capital, such as the Calmar ratio.

This ratio is identical to the Sharpe ratio, except that it uses maximum drawdown in place of the portfolio value’s standard deviation.

Our final metric, used heavily in the hedge fund industry, is the Omega ratio.

On paper, the Omega ratio should be better than both the Sortino and Calmar ratios at measuring risk vs.

return, as it is able to account for the entirety of the risk over return distribution in a single metric.

To find it, we need to calculate the probability distributions of a portfolio moving above or below a specific benchmark, and then take the ratio of the two.

The higher the ratio, the higher the probability of upside potential over downside potential.

If this looks complicated, don’t worry.

It get’s simpler in code.

The CodeWhile writing the code for each of these rewards metrics sounds really fun, I have opted to use the empyrical library to calculate them instead.

Luckily enough, this library just happens to include the three rewards metrics we’ve defined above.

Getting a ratio at each time step is as simple as providing the list of returns and benchmark returns for a time period to the corresponding Empyrical function.

Here we set the reward at each time step based on our pre-defined reward functionNow that we’ve decided how to measure a successful trading strategy, it’s time to figure out which of these metrics produces the most appealing results.

Let’s plug each of these reward functions into Optuna and use good old Bayesian optimization to find the best strategy for our data set.

The ToolsetAny great technician needs a great toolset.

Instead of re-inventing the wheel, we are going to take advantage of the pain and suffering of the programmers that have come before us.

For today’s job, our most important tool is going to be the optuna library, which implements Bayesian optimization using Tree-structured Parzen Estimators (TPEs).

TPEs are parallelizable, which allows us to take advantage of our GPU, dramatically decreasing our overall search time.

Let’s install it.

pip install optunaImplementing OptunaOptimizing hyper-parameters with Optuna is fairly simple.

First, we’ll need to create an optuna study, which is the parent container for all of our hyper-parameter trials.

A trial contains a specific configuration of hyper-parameters and its resulting cost from the objective function.

We can then call study.

optimize() and pass in our objective function, and Optuna will use Bayesian optimization to find the configuration of hyper-parameters that produces the lowest cost.

In this case, our objective function consists of training and testing our PPO2 model on our Bitcoin trading environment.

The cost we return from our function is the average reward over the testing period, negated.

We need to negate the average reward, because Optuna interprets lower return value as better trials.

The optimize function provides a trial object to our objective function, which we then use to specify each variable to optimize.

The optimize_ppo2() and optimize_envs() methods take in a trial object and return a dictionary of parameters to test.

The search space for each of our variables is defined by the specific suggest function we call on the trial, and the parameters we pass in to that function.

For example, trial.

suggest_loguniform('n_steps', 16, 2048) will suggest a new float between 16–2048 in a logarithmic manner (16, 32, 64, …, 1024, 2048).

Further, trial.

suggest_uniform('cliprange’, 0.

1, 0.

4) will suggest floats in a simple, additive manner (0.

1, 0.

2, 0.

3, 0.


We don’t use it here, but Optuna also provides a method for suggesting categorical variables: suggest_categorical('categorical', ['option_one', ‘option_two']).

Later, after running our optimization function overnight with a decent CPU/GPU combination, we can load up the study from the sqlite database we told Optuna to create.

The study keeps track of the best trial from its tests, which we can use to grab the best set of hyper-parameters for our environment.

We’ve revamped our model, improved our feature set, and optimized all of our hyper-parameters.

Now it’s time to see how our agents do with their new reward mechanisms.

I have trained an agent to optimize each of our four return metrics: simple profit, the Sortino ratio, the Calmar ratio, and the Omega ratio.

Let’s run each of these optimized agents on a test environment, which is initialized with price data they’ve not been trained on, and see profitable they are.

BenchmarkingBefore we look at the results, we need to know what a successful trading strategy looks like.

For this treason, we are going to benchmark against a couple common, yet effective strategies for trading Bitcoin profitably.

Believe it or not, one of the most effective strategies for trading BTC over the last ten years has been to simply buy and hold.

The other two strategies we will be testing use very simple, yet effective technical analysis to create buy and sell signals.

Buy and holdThe idea is to buy as much as possible and Hold On for Dear Life (HODL).

While this strategy is not particularly complex, it has seen very high success rates in the past.


RSI divergenceWhen consecutive closing price continues to rise as the RSI continues to drop, a negative trend reversal (sell) is signaled.

A positive trend reversal (buy) is signaled when closing price consecutively drops as the RSI consecutively rises.


Simple Moving Average (SMA) CrossoverWhen the longer-term SMA crosses above the shorter-term SMA, a negative trend reversal (sell) is signaled.

A positive trend reversal (buy) is signaled when the shorter-term SMA crosses above the longer-term SMA.

The purpose of testing against these simple benchmarks is to prove that our RL agents are actually creating alpha over the market.

If we can’t beat these simple benchmarks, then we are wasting countless hours of development time and GPU cycles, just to make a cool science project.

Let’s prove that this is not the case.

The ResultsLet’s quickly move through the losers so we can get to the good stuff.

First, we’ve got the Omega strategy, which ends up being fairly useless trading against our data set.

Average net worth of Omega-based agents over 3500 hours of tradingWatching this agent trade, it was clear this reward mechanism produces strategies that over-trade and are not able to capitalize on market opportunities.

The Calmar-based strategies came in with a small improvement over the Omega-based strategies, but ultimately were very similar.

It’s starting to look like we’ve put in a ton of time and effort, just to make things worse…Average net worth of Calmar-based agents over 3500 hours of tradingThen came the strategies based on our old friend, simple incremental profit.

While this reward mechanism didn’t prove to be too successful in our last article, all the modifications and optimizations we’ve done seem to have massively improved the success of the agents.

The average profit is just over 350% of the initial account balance, over a four month test period.

If you are unaware of average market returns, these kind of results would be absolutely insane.

Surely this is the best we can do with reinforcement learning… right?Average net worth of Profit-based agents over 3500 hours of tradingWrong.

Sortino, the OG, has stolen the show.

The average profit produced by agents based on the Sortino ratio was nearly 850%.

When I saw the success of these strategies, I had to quickly check to make sure there were no bugs.

After a thorough inspection, it is clear that the code is bug free and these agents really are very good at trading Bitcoin.

Average net worth of Sortino-based agents over 3500 hours of tradingInstead of over-trading and under-capitalizing, these agents seem to understand the importance of buying low and selling high, while minimizing risk of holding BTC.

If you don’t believe me, see for yourself.

One of our Sortino-based agents trading BTC/USDNotice how the agent buys (green triangle) a bunch right before the massive price jump and then sells (red triangle) as soon as the price rises?.The agent seems to have learned that it should take profits early, and not get caught in trades too long.

Regardless of what specific strategy the agents have learned, our trading bots have clearly learned to trade Bitcoin profitably.

Now, I am no fool.

I understand that the success in these tests may not generalize to live trading.

That being said, these results are far more impressive than any algorithmic trading strategies I’ve seen to date.

It is truly amazing considering these agents were given no prior knowledge of how to trade profitably, and instead learned to be massively successful through trial and error alone.

Lots and lots of trial and error.

ConclusionIn this article, we’ve optimized our Bitcoin trading RL agents to make even better decisions, and therefore, make a ton more money!.It took quite a bit of work, but we’ve accomplished it by doing the following:Upgraded our existing model to use a recurrent, LSTM policy network with stationary dataEngineered 40+ new features for our agent to learn from using domain-specific technical and statistical analysisImproved the agent’s reward system to account for risk, instead of simply profitFine tuned the model’s hyper-parameters using Bayesian optimizationBenchmarked against common trading strategies to ensure we are always beating the marketA highly profitable trading bot is great, in theory.

However, I’ve received quite a bit of feedback claiming these agents are simply learning to fit a curve, and therefore, would never be profitable trading on live data.

While our method of training/testing on separate data sets should address this issue, it is true that our model could be overfitting to this data set and might not generalize to new data very well.

That being said, I’ve got a feeling these agents are learning quite a bit more than simple curve fitting, and as a result, will be able to profit in live trading situations.

To experiment on this hypothesis, the next article will be focused on bringing these RL agents live into the wild.

We are first going to update our environment to support multiple other cryptocurrency pairs such as ETH/USD and LTC/USD, and then we’ll set our agents loose to trade these assets live on Coinbase Pro.

It’s going to be exciting and insightful, whether or not we make money, so you’re not going to want to miss it!As an aside, there is still much that could be done to improve the performance of these agents, however I only have so much time and I have already been working on this article for far too long to delay posting any longer.

If you’re interested, take what I’ve built and improve on it!.If you can beat my results, send me what you’ve got and let’s talk.

Thanks for reading!.As always, all of the code for this tutorial can be found on my Github.

Leave a comment below if you have any questions or feedback, I’d love to hear from you!.I can also be reached on Twitter at @notadamking.








pdf.. More details

Leave a Reply