Data Versioning and Reproducible ML with DVC and MLflow

Machine Learning development involves comparing models and storing the artifacts they produced. We often compare several algorithms to select the most effici...

Databricks•23.4K views•26:46

🔥 Related Trending Topics

LIVE TRENDS

This video may be related to current global trending topics. Click any trend to explore more videos about what's hot right now!

THIS VIDEO IS TRENDING!

This video is currently trending in Singapore under the topic 'itoto system 12'.

About this video

Machine Learning development involves comparing models and storing the artifacts they produced. We often compare several algorithms to select the most efficient ones. We assess different hyper-parameters to fine-tune the model. Git helps us store multiple versions of our code. Additionally, we need to keep track of the datasets we are using. This is important not only for audit purposes but also for assessing the performances of the models, developed at a later time. Git is a standard code versioning tool in software development. It can be used to store your datasets but it does not offer an optimal solution. An alternative solution is to use Data Version Control (DVC). Despite its name, it is not just a data versioning tool, but also enables model and pipeline tracking. It runs on top of Git, which makes it easy to learn for Git users. At the same time, it overcomes the limitations of storing big files by storing them remotely (e.g. Azure, S3) and keeping in Git only their metadata. MLflow is a tool that is easily integrated with the code of your model and can track dependencies, model parameters, metrics, and artifacts. Every run is linked with its corresponding Git commit. Once the model is trained, MLflow can pack it in different flavors (e.g. Python/R function, H2O, Spark, TensorFlow…) ready to be deployed. DVC also runs along with Git. When MLflow helps you manage Machine Learning lifecycle, DVC helps you manage your datasets. In this tutorial, we will learn how to leverage the capabilities of these powerful tools. We will go through a toy ML project and look at the sample code on how to increase the reproducibility of individual steps. About: Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business. Read more here: https://databricks.com/product/unified-data-analytics-platform See all the previous Summit sessions: https://databricks.com/sparkaisummit/north-america/sessions Connect with us: Website: https://databricks.com Facebook: https://www.facebook.com/databricksinc Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/databricks/ Instagram: https://www.instagram.com/databricksinc/ Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner

Video Information

Views
23.4K

Total views since publication

Likes
415

User likes and reactions

Duration
26:46

Video length

Published
Jan 14, 2021

Release date

Quality
hd

Video definition

Captions
Available

Subtitles enabled

Tags and Topics

This video is tagged with the following topics. Click any tag to explore more related content and discover similar videos:

Tags help categorize content and make it easier to find related videos. Browse our collection to discover more content in these categories.