3 Ways to Optimize and Export BERT Model for Online Serving

3 Ways to Optimize and Export BERT Model for Online ServingNg Wai FoongBlockedUnblockFollowFollowingJul 7Image taken from Google AI BlogThe topic for today is about optimizing and exporting BERT model in order to serve it online based on prediction from a list of string.

Unlike my previous tutorial, I am going to keep this article short and simple.

This article contains the following sections:Problem statementReducing the size of fine-tuned BERT-model.

Exporting the model in pb file.

Prediction from a list of stringConclusion[Section 1] Problem statementIf you have been following my previous article on fine-tuning BERT model for multi-classification task, you will notice the following issues:The output model is 3 times larger than the original model (applicable to both BERT-Base and BERT-Large).

The output model is ckpt file.

You might want a pb file instead.

The code is based on reading an input file and output the probabilities into output file.

You might want a single input text, single probabilities output instead.

By reading this article, you will learn to solve the issues mentioned above.

Having said that, this article does not cover the following:How to use an exported pb file?How to speed up the inference time?How to serve the model online?[Section 2] Reducing the size of fine-tuned BERT modelBased on the response provided by a member from the BERT team, the fine-tuned model is 3 times larger than the distributed checkpoint due to the inclusion of Adam momentum and variance variables for each weight variable.

Both variables are needed to be able to pause and resume training.

In order words, if you intend to serve your fine-tuned model without any further training, you can remove both variables and and the size will be more or less similar to the distributed model.

Create a python file and type the following code in it (modify it accordingly):It is advisable to run it in terminal/command prompt instead of jupyter notebook due to memory issues (if you encounter error with unable to load tensors, simply run it via cmd).

You should get 3 ckpt files as follows (depends on the name that you have set during saver.







meta[Section 3] Exporting the model in pb fileFor those that prefer pb graph, download the py file from the following link (credit goes to the original creator).

Put it in the directory of your choice and run the following command in terminal:CUDA_VISIBLE_DEVICES=0 python model_exporter.

py –data_path=.

/bert_output/ –labels_num=3 –export_path=.

/model_export/pb_graph/You will need to modify three parameters:data_path: Path to you fine-tuned model that contains three ckpt file.

I have all the files in bert_output folder.

labels: Number of labels that you have.

I have 3 classes.

export_path: Path of the output model.

It will be created automatically.

If you encounter error related to output path.

Kindly delete the output path from the directory.

Once you have completed it, you should have a pb_graph folder that contains the following files:save_model.

pbvariablesIn the following github link, the owner made a script that can be used to train a BERT model for multi-label classification.

Based on the current model, we can only perform multi-classification task that is single label from all of the classes.

If you intend to predict multi-label from all of the classes, you can try to use this script.

I have not tested it out yet at the time of this writing but feel free to explore it on your own.

[Section 4] Prediction from a list of stringThis section is about performing inference on a list of strings instead of reading it from a file.

Create another py file in the same directory as the run_classifier.

py and type in the following:The function accepts a list of string as parameters.

Let’s have a look at each line of the function:Using list comprehension to loop through each element in the list and convert it into appropriates format needed for features conversion.

Converts it into features that is required for prediction based on the label, max sequence length and tokenization.

We will initialize the required variables later on.

Creates an input_fn closure to be passed to TPUEstimator.

Perform prediction based on the input features.

The conversion time is typically quite long depends on the memory and GPU that you have.

If you intend to optimize it further, kindly check the convert_examples_to_features and input_fn_builder.

For your information, both of the functions mentioned above consume the most time and memory.

We will now initialize the required variables.

Continue to append the following code:Once you are done, the last part is to call the getListPrediction function by passing a list as parameters:pred_sentences = ['I am a little burnt out after the event.

']result = list(getListPrediction(pred_sentences)))You are required to convert the results as list in order to view it.

You should be able to obtain the following results (example for 3 classes):[{'probabilities': array([2.

4908668e-06, 2.

4828103e-06, 9.

9996364e-01], dtype=float32)}]The result represent the probabilities for each label that you have with the highest being the prediction label.

You can simply use the output results and map it to the respective label.

[Section 5] ConclusionAs I have mentioned earlier, this is a rather short article to solve a few issues related to BERT in order to serve it online.

If you are interested to speed up the inference time via GPU optimization, kindly check out the following link.

Bear in mind that one of the team member from BERT have froze the repository in the mean time to prevent any code failure.

However, there might be a separate repository in the future that include the codes for GPU optimization.

Stay tuned for more upcoming articles on programming and artificial intelligence!Referencehttps://towardsdatascience.







. More details

Leave a Reply