Introducing Mercury-ML: an open-source “messenger of the machine learning gods”

Introducing Mercury-ML: an open-source “messenger of the machine learning gods”Karl SchriekBlockedUnblockFollowFollowingMar 18In the ancient Roman mythology, the god Mercury was known as “the messenger of the gods”.

Wearing winged shoes and a winged hat, he zipped between Mount Olympus and the kingdoms of men and saw to it that the will of the gods was known.

He wasn’t the strongest, the wisest, the most revered, or the most feared of the gods, but he was fleet of foot and cunning and could be relied upon to steer events to their desired outcomes.

Without him Perseus could not have defeated Medusa; Odysseus would have fallen to Circe’s spells; and Hercules could not have dragged Cerberus from Hades, thereby completing the final of his 12 mythical labours…With this post I would like to introduce a new initiative called Mercury-ML, and open-source “messenger of the machine learning gods”.

Machine learning workflows are deceptively simple yet rely on a complex arrangement of multiple tools and technologiesRecent developments in machine learning and data processing tools have led to a myriad of open source libraries, each of which provide well-developed and transparent APIs and each of which have a role to play when constructing a robust machine learning workflow.

Frequently used machine learning libraries such as TensorFlow, PyTorch, Keras or SciKit-Learn will often form the backbone of such a workflow, but there are nevertheless a myriad of functions from different libraries that typically need to be strung together in order to complete the workflow.

As an example, consider an Image Recognition workflow where you need to fetch images from HDFS; fit a Keras model on those images; evaluate the model using SciKit-Learn; save the trained model to S3 on AWS; store metadata on the training run in MongoDB; and eventually serve the model using TensorFlow Serving.

How would you set up such a workflow?In addition, what if your requirements change in the middle of the project so that you need to fetch data from somewhere else, or need to use a different machine learning engine?.How would you plug out one of these components and replace it with another?.How much new code would you have to introduce (and test!) in order for this to work correctly?A messenger of the gods for machine learning workflowsThese are some of the very real problems that we at Alexander Thamm GmbH are frequently faced with when developing machine learning solutions for our clients.

In recent times it became quite clear to us that we needed a library that could break down machine learning projects into their typical components (such as read data, transform data, fit model, evaluate model, serve model etc.

) that are modular and generic enough to allow us to slot in different technologies depending on what we needed.

What started as an internal project to help us be better at our own work has grown into something that we believe is worth making available to the wider community as an open-source library.

We’ve therefore decided to make Mercury-ML — our internally developed “messenger of the machine learning gods” — available on GitHub under the MIT license.

The latest stable version will also always be available on PyPi and can be installed with `pip install mercury-ml`(Note that although the library is already feature-rich, it is still far from feature-complete.

We’ve tagged it for now as a “pre-release”, meaning that some wide-ranging changes may still happen).

How it worksMercury-ML aims to offer simplified access to functionality at different levels of abstraction.

As a brief example of how this works, let’s look at two small components that typically form part of a machine learning workflow: saving a fitted model and then storing the saved object in a remote location (from which it could later be served).

We’ll look at four levels of abstraction for doing this:Without using Mercury-ML (i.

e.

directly using the underlying dependencies)Using the providers APIUsing the containers APIUsing the tasks API (in conjunction with the containers API)Each of these approaches are perfectly valid, though in certain circumstances one may make more sense than the other.

Let’s have a look:Parameterisation:For this example we’ll save a Keras model and store it to an S3 bucket on AWS.

Let’s assume we have the following inputs:model = … # assume a fitted Keras model fetched herefilename = “my_model”local_dir = “.

/local_models”extension = “.

h5”remote_dir = “my-bucket/remote-model”1.

Example via directly accessing the underlying libraries (i.

e.

without using Mercury-ML)Using the underlying libraries rather than using the Mercury-ML APIs makes sense when you want the maximum flexibility to configure how these libraries are used.

Below is a typical script one might put together in order to do this.

import os# save modelif not os.

path.

exists(local_dir):os.

makedirs(local_dir)filename = filename + extensionlocal_path = os.

path.

join(local_dir + "/" + filename)model.

save(local_path)# copy to s3import boto3session = boto3.

Session() #assuming connection parameters are implicits3 = session.

resource("s3")s3_bucket, s3_partial_key = remote_dir.

split("/", 1)s3_key = s3_partial_key + "/" + filename + extensions3.

Object(s3_bucket, s3_key).

put(Body=open(local_path, "rb"))There is nothing terribly compilcated here, but there are lots of small steps that need to happen correctly.

You need to manually check if the directory you wish to locally save your model to exists (and have to be aware that Keras doesn’t do this for you).

You have to know that Keras saves its model objects as HDF5 files, which take a “.

h5” extension.

You have to know how to open an S3 connection and know that the function call that saves to an S3 location takes “bucket” and “key” inputs (which taken together could be regarded in simplified terms as a “path”).

To get this right you have to fiddle around with lots of calls that join and split string paths.

(And if you for example decide to store your model objects in Google Cloud Storage, you would need to do this again).

2.

Example via providersThe providers in Mercury-ML aim to abstract most of this away and take care of the nitty-gritty while exposing a simple (but also highly configurable) API.

from mercury_ml.

keras.

providers import model_savingfrom mercury_ml.

common.

providers.

artifact_copying import from_diskimport os# save modelpath = model_saving.

save_keras_hdf5( model=model, filename=filename, local_dir=local_dir, extension=extension)# copy to s3from_disk.

copy_from_disk_to_s3( source_dir=local_dir, target_dir=remote_dir, filename=os.

path.

basename(path))Using the providers API (instead of the containers of tasks APIs) makes the most sense if you want to hardcode the providers you want to use.

For example in the code snippet above, you can only use model_saving.

save_keras_hdf5 and from_disk.

copy_from_disk_to_s3.

If you want to save the model in a different format, or copy it to a different store you must change your code to do so.

For example, to store to Google Cloud Storage you would replace from_disk.

copy_from_disk_to_s3 with from_disk.

copy_from_disk_to_gcs.

3.

Example via containersUsing the containers API makes the most sense when you want to steer your workflow via a configuration file.

The containers are just light-weight classes that allow you to access various similar providers from a single location.

For example, the function used above, model_saving.

save_keras_hdf5 can also be accessed via a container as ModelSavers.

save_hdf5.

Using the getattr function this can also be accessed as getattr(ModelSavers, “save_hdf5”) allowing us to easily parameterise this in a config.

from mercury_ml.

keras.

containers import ModelSaversfrom mercury_ml.

common.

containers import ArtifactCopiersimport osconfig = { "save_model": "save_hdf5", "copy_model": "copy_from_disk_to_s3"}save_model = getattr(ModelSavers, config["save_model"])copy_from_local_to_remote = getattr( ArtifactCopiers, config["copy_model"])# save modelpath = save_model( model=model, filename=filename, local_dir=local_dir, extension=extension)# copy to s3copy_from_local_to_remote( source_dir=local_dir, target_dir=remote_dir, filename=os.

path.

basename(path))Want instead to save the model as a tensorflow graph?.And want to store in in Google Cloud Storage instead?.Simply change the config:config = { "save_model": "save_tensorflow_graph", "copy_model": "copy_from_disk_to_gcs"}4.

Example via tasks (in conjunction with containers)Using the tasks API makes sense when you want to use a single function that defines a small workflow that involves more than one provider and requires multiple steps.

For example, the store_model task below is injected with a save_model and a copy_from_local_to_remote provider and proceeds to use those providers first to save a model locally and then to copy it to a remote location.

from mercury_ml.

keras.

containers import ModelSaversfrom mercury_ml.

common.

containers import ArtifactCopiersfrom mercury_ml.

common.

tasks import store_modelsave_model = getattr(ModelSavers, config["save_model"])copy_from_local_to_remote = getattr( ArtifactCopiers, config["copy_model"])# save model and copy to s3store_model( save_model=save_model, copy_from_local_to_remote=copy_from_local_to_remote, model=model, filename=filename, local_dir=local_dir, remote_dir=local_dir, extension=extension)The tasks API offers convenience more than anything else, but can be quite useful for small blocks of workflows that frequently occur together.

Find out moreThe above is just a small example.

Fully-fledged example workflows can be found here.

A key component of how Mercury-ML is able to faciliate workflows also has to do with how it deals with data as it moves through the various phases of a machine learning pipeline.

Mercury-ML wraps data into generic DataWrapper objects (e.

g.

“features” and “targets”) and arranges them into DataSets(e.

g.

“train”, “valid” and “test”) and eventually into DataBunches.

You can find more information on how this is done here.

Become a contributor!This also stands as an open invitation to any developers who are interested in using Mercury-ML.

Tell us what features you need and let us know what isn’t working, or contribute your own changes or additions!.

. More details

Leave a Reply