Let’s assume that pre-training takes 400 days on one 1080Ti and let’s work from there:Starting from pre-trained vectors / n-grams — maybe x10 faster? ;Not using a large softmax layer (even if it is linked to your embedding layer) but using cosine loss or something inspired by this.

By the way these guys also start from FastText — x2 faster?;A lighter model — x4 faster?;Using an embedding bag layer that works well for Russian language;All in all with all of these “optimizations” it seems feasible to be able to pre-train / tune a transformer in a week or so) And it is real, the only problem is that the actual pre-trained model did not really seem to beat a model just initilized with FastText.

Pre-traininig experiments* We used 2 GPU setup for each model, but in the end we found out that the newer version of the embedding bag was roughly 25% slower + due to large embedding bag size;** Classification task from BERT paper;Other “failed” approached we tested:All models trained from scratch converged much slower and plateaued quickly;All BPE based models initialized with FastText converged much slower and plateaued quickly around 65% sequential task accuracy;FastText + embedding freeze — minus 5pp sequential task accuracy;L2 embedding lossCosine embedding lossActually trying out the pre-trained modelThis was by far the most disappointing part of this whole exercise.

As mentioned in the intro — any sort of transformer (from scratch, pre-trained, from FastText) did not help in our “easy” classification task on a complex domain (but FastText was the best).

On a challenging SberSQUAD task, we has the following results:A FastText initialized model trained with a high lr of 1e-3 to about 37%-40% EM.

Probably more can be achieved with LR decay.

Remarkably model diverged frequently and seemed to “jump” on each restart;When we tried the pre-trained model with high lr of 1e-3 it trained much faster than FastText, but overfitted heavily;If we started with lower lr somewhere around 5e-4 – then the pre-trained model traned also much faster than FastText, but overfitted around 30% EM;I suppose if we invested x10 resources into actually tuning the hyper-parameters, then we would achieve a higher result.

But you see — generative pre-training IS NOT A SILVER BULLET.

especially for non generative tasks.

On any SANE task — conventional RNNs / CNNs / TCNs — blow transformers out of the water.

Top performance of FastText initialized transformerSome comparisonsLow learning rate, pre-train vs.

