Data version control with DVC. What do the authors have to say?

Any ways that the work that you’re doing either with DVC or any of your other projects tie into that?Dmitry:It’s a good question.

Because, in general toolsets in data projects, are not the same state that tooling set for software projects.

It was in an interview with West McKinsey, like a month ago.

He said, in data science, we are still in the Wild West.

This is actually what is happening (laughing), we don’t have a great support for many scenarios.

But from the tool point of view, what I am seeing today, it’s become, it is quite mature in terms of algorithms.

Because we have PyTorch, we have TensorFlow, we have a bunch of other algorithms, like random tree based algorithms.

Today, there is a race of online monitoring tools.

For example, TensorBoard, when you can report your metrics online when you train and see what actually is going on the training phase now.

It is especially important for deep learning because the algorithms are still quite slow, I would say, they’re a bunch of commercial products in this space, and MLflow, one of the open source projects, which is becoming popular, which helps you track your metrics and visualize training process.

This is a trend today.

Another trend is how to visualize your models, how to understand what is inside your models.

Again, there’s a bunch of tools in order to do this but the state of this tool, it’s still not perfect.

In terms of unit tests, you can use just a regular one, just the regular unit test framework.

But I couldn’t say it works really well for ML projects, specifically, what I have seen for many times is unit test or probably not unit, but functional tests for data engineer.

In the data engineering part, when new set of data came into your system, you can get basic metrics and make sure there are no drifts in the metrics, they are not big changes of the metric.

So this is how unit test, or test at least can work in the data world.

But tools in general are still in the Wild West.

Tobias:Moving on to the data version that is built into the DVC.

As I was reading through the documentation, it mentions that the actual data files themselves are stored in external systems, such as S3, or Google Cloud Storage or something like NFS.

So I’m wondering if you can talk through how DVC actually manages the different versions of that data and any type of sort of incremental changes that it is able to track or any difficulties or challenges that you faced in the process of building that system and integrating it with the software control that you use, Git for in the overall project structure?Dmitry:Yeah, of course, we don’t commit data (laughs) to the repository, we push data to your servers, to your clouds, usually, and you can reconfigure it to go to the cloud.

As said before we treat data as binary blobs.

For each particular commit, we can bring the you actual datasets, and all the data artifact that were in use.

We don’t do any diffs per file, because you need to understand semantic of file in order to do diffs, right, it’s not a diff.

In a Git every file it makes sense to make it make a diff.

In data science, you need to know what exactly what exact format of the data file you use.

However, we track directories as separate types of structure.

If you have a directory with lets imagine 100.

000 files, and then you added a few more files into a directory and committed this is a new version of your data set, then we understand that only a small portion of our file was changed, let’s say 2000 files was modified and 1000 was added, then we version on the diff, so you can easily add your labels in your database in a weekly basis without any concern for the cycle of your of your directory, we do this kind of optimization.

Another optimization, important optimization that we do is optimization your workspace.

When Git checkouts files from our internal structure, it creates a copy in your workspace, however in a data world, sometimes it just does not make sense because you don’t want to create a copy of 100, let’s say gigabytes of data, another copy.

We optimize this process through having some reference.

So you’re not your having duplications of datasets.

So you can easily work with hundreds of gigabytes, dozens of gigabytes without this concerns.

Tobias:For somebody who is on-boarding on to an existing project.

And they’re just checking out the state of the repository for the first time, is there any building capacity for being able to say I only want to pull the code, I’m not ready to pull down all the data yet, or I just want to pull down a subset of the data because somebody who’s working on a multi hundred gigabyte dataset doesn’t necessarily want to have all that located on their laptop as they’re building through these experiments.

Just curious what that overall workflow looks like as you’re training the models when you’re working locally, how it handles and interact with these large data repositories to make sure that it doesn’t just completely fill up your local disk.

Dmitry:This is a good question.

We do this granular pull, we optimize this as well.

You as a data scientist, you can decide what exactly you need.

For example, if you’d like to deploy your model, which probably within 100 megabytes, you probably don’t need to waste time for the 100 GB data set, which was used in order to produce model.

Then you can specify anything on the data file, like clone a new repository of yours with a code and meta data.

Then say I just need a model file.

All the model file will be delivered to your production system to your deployment machine, and the same datasets.

Tobias:There are some other projects that I’ve talked about, with people are building them, such as Quilt or Pachyderm that have built in support for version of data.

I’m wondering if you have any plans currently to work on integrating with those systems, or just the overall process of what’s involved in adding additional support for the data storage piece of DVC?Dmitry:For some of the system integration can be done easily, for example, Pachyderm it’s a project, it’s mostly about data engineering.

They have a concept of our pipelines, kind of a data engineer pipelines.

DVC can can be used in data engineering pipeline.

It has notion of ML pipelines, a lightweight concept.

It’s optimized specifically for machine learning engineers, it doesn’t have all this complexity of data engineering pipelines, but it can easily be used as a single step and engineering pipeline.

We have seen that for many times, when people take, for example a DVC pipeline, and put that inside AirFlow as a single step.

With this kind of design, it’s actually a good design, because you give a lightweight tool for ML engineers and data scientists, so they can easily produce a result.

They can iterate faster in order to create their jobs.

You have a production system with DVC can be easily just injected inside.

There is a term I don’t remember what company use.

Netflix, probably — decks inside deck; which means you have a deck of data pipelines, and you have ML deck for one particular problem, and then basically inject a lot of ML decks inside the data engineering deck.

So from this point of view, there are no problem to integrate, there are no issues with integration to Pachyderm or AirFlow or other systems.

Regarding Quilt data, they do versioning, they work with S3 as well, potentially we can do we can be integrated with them.

We are thinking about this, we are trying to be consumer driven, customer driven.

The biggest need today is probably integration with MLflow, because MLflow shines really well with in online metrics tracking side.

Sometimes people like to use MLflow for tracking, tracking metrics online, and DVC for versioning data files.

This is one of the integration that we are thinking about today.

Tobias:In terms of the actual implementation of DVC itself, I know that it’s primarily written in Python.

And you mentioned that’s largely driven by the fact that is becoming the lingua franca for data scientists.

So I’m wondering if now that you have gone a bit further in the overall implementation and maintenance of DVC, If you think that is still the right choice?.If you were to start over today, what are some of the things that you would do differently either in terms of language choice or system design or overall project structure?Dmitry:I believe Python is a good choice for such kind of projects, for two major reasons.

One, we are targeting data scientists, and most of them are comfortable with Python.

We expect the data scientists to contribute to our code base.

If you write this kind of project in, let’s say C or C++, or Golang, probably you won’t see a lot of contribution from the community.

Because the community speaks different languages.

For us for works perfectly, data scientists are contributing code, which is great (laughs).

And second reason was programming APIs.

Before, we were thinking about creating a DVC through APIs, another option of using DVC.

And if you write code in Python, it kind of goes out of the box, you can reuse the DVC as a library and injecting it into your project.

If you will use a different language, it just create some overhead, you need to think about these into a nice form of this way.

These were the reasons.

So far we are happy Python and it works so nicely.

Tobias:You mentioned being able to use DVC is a library as well.

So I’m wondering if there are any use cases that you’ve seen that were particularly interesting or unexpected or novel either in that library use case or just in the command line oriented way that it was originally designed?Dmitry:Sometimes you people ask for library support, because they need to implement some more crazy scenarios (laughing).

I can say, for example, people use DVC, to build their own platforms, if you wish data science platform, they’d like to build continuous integration frameworks, when DVC plays where all of this kind of glue between your local experience and CI experience, and they’re asking for libraries, but we had a such a great, command line support command line tool set, and people just switch back to a command line experience.

But one day, I won’t be surprised if some someone will use DVC just as a library.

Tobias:I’m also interested in what you were talking about of being able to integrate DVC into data engineering pipelines as just wrapping a single implementation step for running the model training piece of it.

So I’m wondering if you can talk a bit more about that and some of the specific implementations that you’ve seen?Dmitry:Yeah, absolutely.

This is actually a really good question.

I believe that data engineers need pipelines, right?.

Data scientists and machine learning engineers need pipelines.

But the fact is, their needs are absolutely different.

Data engineers do care about stable systems, if it fails, the system needs to do something, it has to recover.

This is a primary goal of the data engineering framework.

In data science, it works kind of an opposite.

You fail all the time (laughing).

When you come up with some idea, you write code runs a score that fails, you feel that it failed, and etc.

Your goal is to is to make a framework to check ideas fast, to fail fast.

This is the goal of ML engineers.

This is a good practice to separate two frameworks have two kinds pipelines frameworks.

One is stable engineering, and second one fast, lightweight experimentation pipeline, if you wish.

When you separate these two worlds, you simplify life of a ML engineers a lot.

They don’t need to deal with complicated stuff, we don’t need to waste time on understanding how AirFlow works, how Luigi works, they just live in their world, produce models, and once the model is ready, they need a clear way have to inject this pipeline into data pipelines.

You can build a very simple tool in order to do this.

So I remember when I worked at Microsoft, to me took me maybe like a couple of hours to productionize my pipeline.

Because I have a separate workflow, I had a separate tool for ML pipelines.

This works nicely.

I believe in this kind of a future engineer, we need to separate these two things.

Tobias:I’m also interested in the deployment capabilities DVC provides as far as being able to put models into production or revert the state of the model in the event that it’s producing erroneous output, or that the predictions are the results that it’s providing are providing problems to the business or just some of the overall tooling and workflow that is involved in running machine learning models in production, and particularly as far as metrics tracking to know when the model needs to be retrained and just completing, closing the loop of that overall process of building and deploying these models.

Dmitry:Yeah, deployment, we are waiting to get it because it’s close to business.

And there’s a little funny story about ML model deployment.

Sometimes it goes like this — a software engineer and a data science team, can we do more review of our model to the not the previous one, but the model from the previous week?.And data scientists like — Yeah, sure, we have datasets, and you need five hours to retrain it (laughs).

It does not make any sense right to spend five hours three were a model in software engineering work world, it does not work this way.

You need to have everything available.

You need to review it right away, because waiting five hours means wasting money for business.

DVC basically helps you to organize this process, it helps, it basically creates a common language between your data scientists who produce a model, and ML engineers who take models and deploy models.

So next time, with a proper data management, you won’t be even asking data scientist to give you a previous model, you should have everything in your system with DVC or not DVC it doesn’t matter what you need to have are all the artifact available.

From online from metrics tracking point of view, this is actually a separate question.

Because when you’re talking about metrics tracking, in production, it usually means online metrics.

It’s usually means metrics based on feedback from users.

This is a separate question.

So DVC it’s mostly about deployment or not deployment, mostly about developing phase.

It doesn’t do nothing basically with online metrics.

Tobias:So you are actually building and maintaining this project under the auspices of iterative.

ai, which is actually a venture-backed company.

So I’m curious what the overall value equation is for your investors that makes it worthwhile for them to fund your efforts on building and releasing this open source project.

And just the overall strategy that you have for the business.

Dmitry:You would be surprised (laughs) how investors are interested in open source projects today, especially the last year was super successful for open source projects.

Last year, Mulesoft was acquired for numbers of millions or billions, sorry, last year Elastic Search went IPO.

It’s purely open source company.

And when you do open source, it usually means that you are doing IT infrastructure.

In many cases, you are doing IT infrastructure, they are good for monetization.

With a successful open source project, that bunch of companies, which are monetizing this open source, it’s very important to understand your business model, because with open source there are a few common models.

One is a service model, a kind of consultancy model.

The second thing is open core model, when you build software, and produce a piece of the software with advanced feature for money, or a different version of your software, as a product for enterprises.

And the third model is ecosystem.

When you build a product, an open source product and create services as a separate product.

One example might be Git and GitHub, when they have open source and SaaS service, which is absolutely different product is absolutely different experience and use cases.

You need to understand which model you fit in.

For successful project to be a lot of wish to interested in this experience in this kind of businesses.

Initially, I started the project as it was my pet project for about a year.

And then I was thinking — how to make something big out of this, how to spend more time on this, how to find more resources in order to do this.

It was clear that this project if it’s successful, there will be a few businesses monetizing this area.

Why don’t we be this business, which builds a product and monetize this product.

So it’s a natural path for a modern open source world, I would say.

Tobias:As far as the overall experience of building and maintaining the DVC project and the community, I’m wondering what you have found to be some of the most interesting or challenging or unexpected lessons learned in the process.

Dmitry:One of the lesson that I learned, I think it’s a usual business lesson, actually.

So when you build your project, you know what you do, you know, your roadmap, you know your goal, and you’re just building, but one day users came to you.

And they asked for a lot of different stuff.

Then you have tension, you have tension between your vision, your plans and demands from user side.

And today, where is the point when everyday, we have a few requests from users sometimes, so we had like 10 requests per day (laughing).

It’s not easy to balance the things.

Because if you if you do everything people ask, you have no time for your roadmap.

And actually, you have no time to fix and to implement everything that people asked.

So you need to learn how to prioritize things.

You need to learn how to say it, sometimes say no to users and say we will do this, but probably not right now.

So this is not easy to do.

This is what you need to learn during the process.

As I said, the software experience in open source, it was the same as the same way in the business environment.

I have seen that for many times.

Tobias:Looking forward in terms of the work that you have planned for DVC, I’m curious what types of features or improvements that you have on the roadmap.

Dmitry:Features for the near future, we are going to release better support for datasets versioning, and ML model versioning.

We are introducing newcomers into DVC, which simplify your experience today.

some companies are using mono-repos with a bunch of data sets inside.

We need new commands to do something better, sometimes because these datasets are evolving in a different speed.

You need to work with one version your those dataset, one version with others, sometimes.

So basically, this is one of the steps we are taking.

And the another use case for data sets is a cross repository references.

Other companies are not using the mono-repo, they’re using set of repos.

For example, they might have like 10, 20 repos with data sets and 20 more with models.

They need to cross-reference datasets, this is the next command we are going to implement to support this cross-reference, cross repository scenarios.

This is our near future.

And in longer vision, the next step would be implementing more features for better experiment support, especially when people deal with such scenarios as hyper-parameter tuning.

They need to have let’s say 1000 experiments.

And they need to still control them.

They don’t want to have 1000 branches (laughing).

This is this experience we need to improve.

We have a clear plan on how to do this.

And this pretty much is for the next maybe half a year.

Eventually we believe that the DVC can be this platform, when people can work on the same environment in one team and share ideas between each other.

In the future we believe we can create experiences, great experience when people can share ideas, even between companies, between teams.

This is the big future of DVC that I believe in.

Tobias:Are there any other aspects of the work that you’re doing on DVC or the overall workflow of machine learning projects that we didn’t discuss yet do you think we should cover before we close out the show?Dmitry:I don’t they have something to add.

But what I believe in, is that we need to pay more attention on how we organize our work.

We need to pay more attention how we structure our project, we need to find more places when we waste our time instead of doing actual work.

And this is very important to be more organized, more productive as a data scientist, because today we are still on the Wild West, this needs to be changed as soon as possible.

It is important to pay attention to this, it’s important to understand this problem set.

Tobias:All right.

Well, for anybody who wants to follow along with the work that you’re doing or get in touch I’ll have you add your preferred contact information to the show notes.

And so with that, I’ll move us into the pics and this week, I’m going to choose a tool that I started using recently and experimenting with called otter.

ai.

And it’s billed as a voice note taking service that will transcribe your meeting notes or just mental notes to yourself to text so that they’re searchable.

And I’ve been experimenting with using it to try out generating transcriptions for the podcast.

So looking forward to start using that more frequently and starting transcripts to the show.

So definitely worth checking out if you’re looking for something that does a pretty good job of generating transcripts automatically and at a reasonable price.

So with that, I’ll pass it to you to meet you.

Do you have any pics this week?Dmitry:So I thought the open source part and open source vs.

venture capital is a question that we will discuss.

Actually, I have nothing special to suggest but today it’s a nice weather.

Spring just started.

So just spent more time outside the region outside and walking around the city or town.

Tobias:All right, well, thank you very much for taking the time today to join me and discuss the work that you’re doing on DVC and adding some better structure to the overall project development of machine learning life cycles.

So thank you for that.

I hope you enjoy the rest of your day.

Dmitry:Oh, thank you, Tobias.

Thank you.

Conclusionhttps://dvc.

org/We need testing and versioning to understand what we’re doing.

When we are programming we’re doing a lot of different things all the time, we are testing new things, trying new libraries and more, and it’s not uncommon to mess things up.

It’s the path of data science, fail fast and iterate to success.

DVC is one of the best tools we have at the moment for controlling your data projects, and as you can see it can be connected with other great tools.

I hope this interview and article helped you get started with them.

If you have any questions please write me here:Favio Vazquez – Founder / Chief Data Scientist – Ciencia y Datos | LinkedInJoin LinkedIn ‼️‼️ Important Note: Due to Linkedin technical limitations, I can now only accept connection requests…www.

linkedin.

comHave fun learning :).. More details

Leave a Reply