Hyperparameter optimization in python.
Part 1: Scikit-Optimize.
Jakub CzakonBlockedUnblockFollowFollowingApr 24In this blog series, I am comparing python HPO libraries.
Before reading this post, I would highly advise that you read Part 0: Introduction where I:talked about what HPO is,selected libraries to compare,selected the evaluation criteria,defined an example problem for the HPO.
Code for this blog post and other parts of the series is available on github while all the experiments with scripts, hyperparameters, charts, and results (that you can download) are available for you on Neptune.
Without further ado, let’s dive in, shall we?Scikit-OptimizeBSD licensed project with almost 1300 starts, 255 forks and 40 contributors(4 main ones).
Even though the master branch hasn’t been updated for the past 5 months, there is a lot of activity with recently opened PR’s and issues.
I certainly hope that this project will be maintained.
Ease of setup and APIThe API is just awesome.
It is so simple, that you can almost guess it without reading the docs.
Seriously, let me show you.
You define the search space:You define the objective function that you want to minimize (decorate it, to keep the parameter names):And run the optimization:That’s it.
All the information you need, like best parameters, or scores for each iteration are kept in the resultsobject.
Go here for an example of a full script with some additional bells and whistles.
I give it a score of 10/10 for the super easy setup and intuitive API.
Score 10/10Options, methods, and (hyper)hyperparametersThis part gets a bit technical and long at times.
Feel free to just skim through or skip it.
Search SpaceWhen it comes to hyperparameter search space you can choose from three options:space.
Real -float parameters are sampled by uniformor log-uniform from the(a,b) range,space.
Integer -integer parameters are sampled uniformly from the(a,b) range,space.
Categorical -for categorical (text) parameters.
A value will be sampled from a list of options.
For example, you could pass['gbdt','dart','goss'] if you are training lightGBM.
I couldn’t find any option to have nested search spaces that account for the situations where some combinations of hyperparameters are simply invalid.
It really comes in handy sometimes.
Optimization methodsThere are four optimization algorithms to try.
dummy_minimizeYou can just run a simple random search over the parameters.
Nothing fancy here but it is useful to have this option with the same API.
forest_minimizeand gbrt_minimizeThe idea behind this approach is to estimate the user-defined objective function with the random forest, extra trees, or gradient boosted trees regressor.
After each run of hyperparameters on the objective function, the algorithm needs to make an educated guess which set of hyperparameters is the most likely to improve the score and should be tried in the next run.
It is done by getting regressor predictions on many points (hyperparameter sets) and choosing the point that is the best guess based on the so-called acquisition function.
There are quite a few acquisition function options to choose from:EI and PI: Negative expected improvement and Negative probability improvement.
If you choose one of those you should tweak the xi parameter as well.
Basically, when your algorithm is looking for the next set of hyperparameters, you can decide how small of the expected improvement you are willing to try on the actual objective function.
The higher the value, the bigger the improvement (or probability of improvement) your regressor expects.
LCB: Lower confidence bound.
In this case, you want to choose your next point carefully, limiting the downside risk.
You can decide how much risk you want to take at each run.
By making the kappa parameter small you lean toward exploitation of what you know, by making it larger you lean toward exploration of the search space.
There are also options EIPS and PIPS which take into account both the score produced by the objective function and the execution time but I haven’t tried themgp_minimizeInstead of using the tree regressors, the objective function is approximated by the Gaussian process.
From a user perspective, the added value of this method is that instead of deciding beforehand on one of the acquisition functions, you can let the algorithm select the best one of EI, PI, and LCB at every iteration.
Just set acquisition function to gp_hedge and try it out.
One more thing to consider is the optimization method used at each iteration, sampling or lbfgs.
For both of them, the acquisition function is calculated over randomly selected number of points (n_points) in the search space.
If you go with sampling, then the point with the lowest value is selected.
If you choose lbfgs, the algorithm will take some number (n_restarts_optimizer) of the best, randomly tried points, and will run the lbfgs optimization starting at each of them.
So basically the lbfgs method is just an improvement over the sampling method if you don’t care about the execution time.
CallbacksI really like when there is an easy option to pass callbacks.
For example, I can monitor my training by simply adding 3 lines of code:Other things that you could use this option for, are early stopping or saving results every iteration.
Overall, there are a lot of options for tuning (hyper)hyperparameters and you can control the training with callbacks.
On the flip side, you can only search through a flat space and you need to deal with those forbidden combinations of parameters on your own.
That is why I give it 7/10.
Score 7/10DocumentationPiece of art.
It’s extensive with a lot of examples, docstrings for all the functions and methods.
It took me just a few minutes to get into the groove of things and get things off the ground.
Go to the documentation webpage to see for yourself.
It could be a bit better, with more explanations in the docstrings, but the overall experience is just great.
I give it 9/10.
Score 9/10VisualizationThis is one of my favorite features of this library.
There are three plotting utilities in theskopt.
plotsmodule, that I really love:plot_convergence -it visualizes the progress of your optimization by showing the best to date result at each iteration.
What is cool about it, is that you can compare the progress of many strategies by simply passing a list ofresults objects or a list of (name, results) tuples.
plot_evaluations -this plot lets you see the evolution of the search.
For each hyperparameter, we see the histogram of explored values.
For each pair of hyperparameters, the scatter plot of sampled values is plotted with the evolution represented by color, from blue to yellow.
For example, when we look at the random search strategy we can see there is no evolution.
It is just randomly searched:But for the forest_minimzestrategy we can clearly see that it converges to certain parts of space which it explores more heavily.
plot_objective -it lets you gain intuition into the score sensitivity with respect to hyperparameters.
You can decide which parts of the space may require more fine-grained search and which hyperparameters barely affect the score and can potentially be dropped from the search.
Those are incredibly good.
There is nothing like this out there so even 10/10 feels a bit unfair.
Score 10/10NoteI liked it so much, that I’ve created a set of functions that help with conversion between different HPO libraries so that you could use those visualizations for every lib.
I’ve put them in the neptune-contribpackage and you can check how to use it here.
Persisting/RestartingThere areskopt.
dumpandskopt.
loadfunctions that deal with saving and loading theresultsobject:You can restart training from the saved results via x0 and y0arguments.
For example:Simple and works with no problems: 10/10.
Score 10/10Speed and ParallelizationEvery optimization function comes with the n_jobs parameter, which is passed to the base_estimator.
That means, even though the optimization runs go sequentially you can speed up each run by utilizing more resources.
I haven’t run a proper timing benchmark for all the optimization methods and n_jobs.
However, since I kept track of the total execution time for all experiments I decided to present average times for everything I ran:Obviously, the random search method was the fastest, as it doesn’t need any calculations between the runs.
It was followed by the gradient boosted trees regressor and random forest methods.
Optimization via Gaussian process was the slowest by a large margin but I only tested the gp_hedge acquisition function, so that might have been the reason.
Because there is no option to distribute it on the run level, over a cluster of workers, I have to take a few points away.
6/10 feels fair to me.
Score 6/10Experimental resultsAll the experiments are publicly available here.
Every experiment has a script attached to it.
For example, you can see the code for the best experiment here.
You can also download experiment metadata to pandas.
DataFrame by running:Let’s take a look at five best experiments:Theforest_minimizemethod was the clear winner but in order to get good results, it was crucial to tweak the (hyper)hyperparameters a bit.
For the LCB acquisition function, a lower value of kappa (exploitation) was better.
Let’s take a look at the evaluations plot for this experiment:It really exploited the low num_leaves subspace but it was very exploratory for the max_depth and feature_fraction.
It’s important to mention that those plots differed a lot from experiment to experiment.
It makes you wonder how easy it is to get stuck in a local minimum.
However, the best result was achieved with the EI acquisition function.
Again, tweaking the xi parameter was needed.
Looking at the objective plot of this experiment:I get the feeling that by dropping some insensitive dimensions (subsample, max_depth) and running a more fine-grained search on the other hyperparameters could have gotten a bit better result.
Surprisingly, the results for the gp_minimzewere significantly worse when I used the lbfgs optimization of the acquisition function.
They couldn’t beat random search.
Changing the optimization to sampling got better AUC but was still worse thanforest_minimizeandgbrt_minimize.
See for yourself here.
Overall the highest score I could squeeze was 0.
8566 which was better than random search’s 0.
8464 by ~0.
01.
I will translate that to 10 points (0.
01*100).
Score 10ConclusionsAll in all, I really like Scikit-Optimize.
It is a pleasure to use, gives you great and useful visualizations and a lot of options with strong documentation to guide you through it.
On the flip side, it is difficult, if not impossible, to parallelize it run-wise and distribute over a cluster of machines.
I think going forward, this is going to be more and more important and can make this library not suitable for many real-life applications.
Let’s take a look at the results for all criteria:The score of 62 seems pretty high, but there are still a few libraries to evaluate.
What’s next?If you are interested to see the results of the next contender stay tuned for Part 2: Hyperopt.
.