Why Git and Git-LFS is not enough to solve the Machine Learning Reproducibility crisis

The determining factors include the following, and perhaps more:Training data — the image database or whatever data source is used in training the modelThe scripts used in training the modelThe libraries used by the training scriptsThe scripts used in processing dataThe libraries or other tools used in processing dataThe operating system and CPU/GPU hardwareProduction system codeLibraries used by production system codeObviously the result of training a model depends on a variety of conditions.

Since there are so many variables to this, it is hard to be precise, but the general problem is a lack of what’s now called Configuration Management.

Software engineers have come to recognize the importance of being able to specify the precise system configuration used in deploying systems.

Solutions to machine learning reproducibilityHumans are an inventive lot, and there are many possible solutions to this “crisis”.

Environments like R Studio or Jupyter Notebook offer a kind of interactive Markdown document which can be configured to execute data science or machine learning workflows.

This is useful for documenting machine learning work, and specifying which scripts and libraries are used.

But these systems do not offer a solution to managing data sets.

Likewise, Makefiles and similar workflow scripting tools offer a method to repeatedly execute a series of commands.

The executed commands are determined through file-system time stamps.

These tools offer no solution for data management.

At the other end of the scale are companies like Domino Data Labs or C3 IoT offering hosted platforms for data science and machine learning.

Both package together an offering built upon a wide swath of data science tools.

In some cases, like C3 IoT, users are coding in a proprietary language and storing their data in a proprietary data store.

It can be enticing to use a one-stop-shopping service, but will it offer the needed flexibility?In the rest of this article we’ll discuss DVC.

It was designed to closely match Git functionality, to leverage the familiarity most of us have with Git, but with features making it work well for both workflow and data management in the machine learning context.

DVC (https://dvc.

org) takes on and solves a larger slice of the machine learning reproducibility problem than does Git-LFS or several other potential solutions.

It does this by managing the code (scripts and programs), alongside large data files, in a hybrid between DVC and a source code management (SCM) system like Git.

In addition DVC manages the workflow required for processing files used in machine learning experiments.

The data files and commands-to-execute are described in DVC files which we’ll learn about in the following sections.

Finally, with DVC it is easy to store data on many storage systems from the local disk, to an SSH server, or to cloud systems (S3, GCP, etc).

Data managed by DVC can be easily shared with others using this storage system.

Image courtesy dvc.

orgDVC uses a similar command structure to Git.

As we see here, just like git push and git pull are used for sharing code and configuration with collaborators, dvc push and dvc pull is used for sharing data.

All this is covered in more detail in the coming sections, or if you want to skip right to learning about DVC see the tutorial at https://dvc.

org/doc/tutorial.

DVC remembers precisely which files were used at what point of timeAt the core of DVC is a data store (the DVC cache) optimized for storing and versioning large files.

The team chooses which files to store in the SCM (like Git) and which to store in DVC.

Files managed by DVC are stored such that DVC can maintain multiple versions of each file, and to use file-system links to quickly change which version of each file is being used.

Conceptually the SCM (like Git) and DVC both have repositories holding multiple versions of each file.

One can check out “version N” and the corresponding files will appear in the working directory, then later check out “version N+1” and the files will change around to match.

Image courtesy dvc.

orgOn the DVC side, this is handled in the DVC cache.

Files stored in the cache are indexed by a checksum (MD5 hash) of the content.

As the individual files managed by DVC change, their checksum will of course change, and corresponding cache entries are created.

The cache holds all instances of each file.

For efficiency, DVC uses several linking methods (depending on file system support) to insert files into the workspace without copying.

This way DVC can quickly update the working directory when requested.

DVC uses what are called “DVC files” to describe both the data files and the workflow steps.

Each workspace will have multiple DVC files, with each describing one or more data files with the corresponding checksum, and each describing a command to execute in the workflow.

cmd: python src/prepare.

py data/data.

xmldeps:- md5: b4801c88a83f3bf5024c19a942993a48 path: src/prepare.

py- md5: a304afb96060aad90176268345e10355 path: data/data.

xmlmd5: c3a73109be6c186b9d72e714bcedaddbouts:- cache: true md5: 6836f797f3924fb46fcfd6b9f6aa6416.

dir metric: false path: data/preparedwdir: .

This example DVC file comes from the DVC Getting Started example (https://github.

com/iterative/example-get-started) and shows the initial step of a workflow.

We’ll talk more about workflows in the next section.

For now, note that this command has two dependencies, src/prepare.

py and data/data.

xml, and an output data directory named data/prepared.

Everything has an MD5 hash, and as these files change the MD5 hash will change and a new instance of changed data files are stored in the DVC cache.

DVC files are checked into the SCM managed (Git) repository.

As commits are made to the SCM repository each DVC file is updated (if appropriate) with new checksums of each file.

Therefore with DVC one can recreate exactly the data set present for each commit, and the team can exactly recreate each development step of the project.

DVC files are roughly similar to the “pointer” files used in Git-LFS.

The DVC team recommends using different SCM tags or branches for each experiment.

Therefore accessing the data files, and code, and configuration, appropriate to that experiment is as simple as switching branches.

The SCM will update the code and configuration files, and DVC will update the data files, automatically.

This means there is no more scratching your head trying to remember which data files were used for what experiment.

DVC tracks all that for you.

DVC remembers the exact sequence of commands used at what point of timeThe DVC files remember not only the files used in a particular execution stage, but the command that is executed in that stage.

Reproducing a machine learning result requires not only using the precise same data files, but the same processing steps and the same code/configuration.

Consider a typical step in creating a model, of preparing sample data to use in later steps.

You might have a Python script, prepare.

py, to perform that split, and you might have input data in an XML file named data/data.

xml.

$ dvc run -d data/data.

xml -d code/prepare.

py -o data/prepared python code/prepare.

pyThis is how we use DVC to record that processing step.

The DVC “run” command creates a DVC file based on the command-line options.

The -d option defines dependencies, and in this case we see an input file in XML format, and a Python script.

The -o option records output files, in this case there is an output data directory listed.

Finally, the executed command is a Python script.

Hence, we have input data, code and configuration, and output data, all dutifully recorded in the resulting DVC file, which corresponds to the DVC file shown in the previous section.

If prepare.

py is changed from one commit to the next, the SCM will automatically track the change.

Likewise any change to data.

xml results in a new instance in the DVC cache, which DVC will automatically track.

The resulting data directory will also be tracked by DVC if they change.

A DVC file can also simply refer to a file, like so:md5: 99775a801a1553aae41358eafc2759a9outs:- cache: true md5: ce68b98d82545628782c66192c96f2d2 metric: false path: data/Posts.

xml.

zip persist: falsewdir: .

This results from the “dvc add file” command, which is used when you simply have a data file, and it is not the result of another command.

For example in https://dvc.

org/doc/tutorial/define-ml-pipeline this is shown, which results in the immediately preceeding DVC file:$ wget -P data https://dvc.

org/s3/so/100K/Posts.

xml.

zip$ dvc add data/Posts.

xml.

zipThe file Posts.

xml.

zip is then the data source for a sequence of steps shown in the tutorial that derive information from this data.

Take a step back and recognize these are individual steps in a larger workflow, or what DVC calls a pipeline.

With “dvc add” and “dvc run” you can string together several Stages, each being created with a “dvc run” command, and each being described by a DVC file.

For a complete working example, see https://github.

com/iterative/example-get-started and https://dvc.

org/doc/tutorialThis means that each working directory will have several DVC files, one for each stage in the pipeline used in that project.

DVC scans the DVC files to build up a Directed Acyclic Graph (DAG) of the commands required to reproduce the output(s) of the pipeline.

Each stage is like a mini-Makefile in that DVC executes the command only if the dependencies have changed.

It is also different because DVC does not consider only the file-system timestamps, like Make does, but whether the file content has changed, as determined by the checksum in the DVC file versus the current state of the file.

Bottom line is that this means there is no more scratching your head trying to remember which version of what script was used for each experiment.

DVC tracks all of that for you.

Image courtesy dvc.

orgDVC makes it easy to share data and code between team membersA machine learning researcher is probably working with colleagues, and needs to share data and code and configuration.

Or the researcher may need to deploy data to remote systems, for example to run software on a cloud computing system (AWS, GCP, etc), which often means uploading data to the corresponding cloud storage service (S3, GCP, etc).

The code and configuration side of a DVC workspace is stored in the SCM (like Git).

Using normal SCM commands (like “git clone”) one can easily share it with colleagues.

But how about sharing the data with colleagues?DVC has the concept of remote storage.

A DVC workspace can push data to, or pull data from, remote storage.

The remote storage pool can exist on any of the cloud storage platforms (S3, GCP, etc) as well as an SSH server.

Therefore to share code, configuration and data with a colleague, you first define a remote storage pool.

The configuration file holding remote storage definitions is tracked by the SCM.

You next push the SCM repository to a shared server, which carries with it the DVC configuration file.

When your colleague clones the repository, they can immediately pull the data from the remote cache.

This means your colleagues no longer have to scratch their head wondering how to run your code.

They can easily replicate the exact steps, and the exact data, used to produce the results.

Image courtesy dvc.

orgConclusionThe key to repeatable results is using good practices, to keep proper versioning of not only their data but the code and configuration files, and to automate processing steps.

Successful projects sometimes requires collaboration with colleagues, which is made easier through cloud storage systems.

Some jobs require AI software running on cloud computing platforms, requiring data files to be stored on cloud storage platforms.

With DVC a machine learning research team can ensure their data, configuration and code are in sync with each other.

It is an easy-to-use system which efficiently manages shared data repositories alongside an SCM system (like Git) to store the configuration and code.

ResourcesBack in 2014 Jason Brownlee wrote a checklist he claimed would encourage reproducible machine learning results, by default: https://machinelearningmastery.

com/reproducible-machine-learning-results-by-default/A Practical Taxonomy of Reproducibility for Machine Learning Research — A research paper by staff of Kaggle and the U of Washington http://www.

rctatman.

com/files/2018-7-14-MLReproducability.

pdfA researcher at McGill Univ, Joelle Pineau, has another checklist for Machine Learning reproducibility https://www.

cs.

mcgill.

ca/~jpineau/ReproducibilityChecklist.

pdfShe made a presentation at the NeurIPS 2018 conference: https://videoken.

com/embed/jH0AgVcwIBc (start at about 6 minutes)The 12 Factor Application is a take on reproducibility or reliability of web services https://12factor.

net/A survey of scientists by the journal Nature noted over 50% of scientists agree there is a crisis in reproducing results https://www.

nature.

com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.

19970.

. More details

Leave a Reply