Beating State of the Art by Tuning Baselines

(Assuming they’ve adhered to the rules of the specific competition, of course.

)For academic research in machine learning, however, things work a little differently.

Generally, researchers build a new model for a specific, established task using the data for that task.

Some examples:Recommender systems using the Netflix Prize datasetObject segmentation using the Microsoft COCO datasetSpeech recognition using the TIMIT corpusNatural language understanding using the GLUE BenchmarkThey then compare their algorithm’s performance to a baseline that’s been run on the same dataset.

If it’s a task that’s been around awhile, it’s common to compare your system to the results reported in another paper but for newer tasks researchers generally run baselines themselves.

A benchmark is a task that includes a dataset and way to evaluate performance.

A baseline is a well-known algorithm that can be applied to the benchmark problem.

There’s no “winning” model in research, but papers that propose a new approach and then show that it’s an improvement over previously established methods tend to be the ones that get published.

(For researchers, hiring and promotion and based on the number and quality of papers published, so having your papers published is a good thing!)Probably the biggest difference between these two communities is that there’s pressure for researchers to focus on new types of models.

If an older model is really effective in a Kaggle competition there’s no reason for competitors not to use it… but it’s hard to get papers published if the main finding is “the thing we’ve been doing for ten years is still really good actually”.

Why were the baselines not great to begin with?[R]unning experiments is hard and needs a large effort of experimentation to achieve reliable results.

(Rendle et al, p.

9)To reiterate: running machine learning experiments is hard.

Rendle & co-authors point out two places where tuning can go wrong:1) You can pretty easily miss the sweet spots when you’re tuning hyperparameters.

The boundaries of your search space might be too close together, or shifted too far in one direction.

You might not be using the right search grid.

The results you got on a smaller model might not actually scale up.

In short: tuning models well is hard, and easy to mess up.

2) You might be missing small but important steps in your preprocessing or during learning.

Did you shuffle your training data?.Do a z-transform?.Are you using early stopping?.All these choices will affect your model outcome.

Especially if you’re trying to reproduce someone else’s results, these steps can be hard to get right.

(Particularly if they’re not reported and the original authors didn’t release their code!)Even if you do everything you’re supposed to and have a pretty solid baseline, it might be that that baseline algorithms could perform even better with additional tuning — particularly by someone very familiar with that method who knows all the tips & tricks.

It takes a lot of trial and error to learn what works best for a specific model; which is part of what makes tuning baselines so hard.

Even if the authors of a paper did genuinely give their best effort to tuning the baselines, someone with more experience or a slightly different approach might have gotten even better results.

(Sort of like a Kaggle competition; someone who’s very familiar with tuning XGBoost models is probably going to get a better result with them than someone who’s only used them a couple of times.

)And, while there is an incentive to invest a lot of time in tuning models, including your baselines, in Kaggle-style competitions (it doesn’t matter what tool you use if it performs well!), research venues don’t generally reward spending a lot of time tuning baselines.

In fact, if that tuning shows that the baseline outperforms the method you’ve come up with, it may actually make it harder to get a paper published describing your new model.

This is similar to the “file drawer problem”, where researchers aren’t as likely to report negative results.

So what can we do about this?Rendle et al propose two things that can help here: standardizing benchmarks and incentivizing the machine learning community to invest time in tuning those benchmarks.

Don’t reinvent the wheel“Reinventing the wheel” means re-doing work that’s already been done.

If there’s a well-established benchmark for a specific task that’s already been well-tuned, then you should compare against that benchmark!.(As a bonus, if you’re working on an established task and can use reported results rather than re-running them then you save on both time and compute.

????)Incentivize running (and tuning!) baselinesThis one’s a little trickier: it’s extremely hard to change the incentive structure or value system of an established community.

The authors point out that machine learning competitions, whether on a platform like Kaggle or organized within a research community, are one way of doing that.

I would also add that, if you review machine learning papers consider both novelty (is this method new?) and utility (does this paper benefit the community as a whole?) when writing your reviews.

TL;DRJust because a modelling technique was proposed more recently doesn’t mean it’s necessarily going to outperform an older method (even if the results in the paper suggest that it can).

Tuning models takes time and expertise, but with careful tuning established baselines can perform extremely well and even beat state of the art.

Relying on standardized benchmarks with well-tuned baselines can help reduce replicated work and also lead to more reliable findings.

And, if you’re just looking for the best method for a specific project, you might be better served starting with better understood models.

If you’re interested in reading the whole paper, particularly the details about the baseline tuning for the different recommender systems, you can check it out on Arxiv.

.

. More details

Leave a Reply