Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
5507dea74f
Upload files using DagsHub client
7 months ago
5507dea74f
Upload files using DagsHub client
7 months ago
0d1dc58480
Update README.md
7 months ago
5507dea74f
Upload files using DagsHub client
7 months ago
f873cf2a67
Uploaded notebook notebook_to_production.ipynb
7 months ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

You have to be logged in to leave a comment. Sign In

3 Lines of Code: Transitioning from Notebook to Production-Ready Machine Learning

TL;DR

Every machine learning project starts simple – that’s a good thing. But at some point you need to promote it to production-grade - and that transition should be as easy and as seamless as possible. Let’s see how we can boil it down to 3 lines of code, with DagsHub, and get from notebook to a central source of truth including your code, data, models, and experiments which you can share with team members.

Machine Learning Has Scrappy Beginnings – It’s a Feature, Not a Bug!

You’ve just started working on a new machine learning project. You want to show results ASAP, so other stakeholders in your organization understand the business value of the model and you can continue working on this project and build a model into some real-world machine learning application.

Starting scrappy is the right way to go – this usually means a notebook. If you can, you might even ask for a non-sensitive data sample and use Colab since it provides strong compute and is easily shareable. Your goal is to arrive at some result as fast as you can, so you don’t want to get bogged down in unnecessary processes and tooling – after all, if this direction is a dead end, you might throw everything out the window, so all that infrastructure and process investment would be a waste.

Throughout the building process you might extract some code into functions, for more convenient use, and even commit it to the team’s machine learning utility repo, but a lot of the meat will remain in the notebook itself.

A few days or weeks later, you show your company’s stakeholders the results, and they’re excited! Let’s get this to production ASAP! You know this means the project will be more long-term, and that requires more rigorous processes and tooling to make sure that the data, models, experiments and code are tracked, work can be split and shared between team members, and the project has a central source of truth.

Now that you’ve spent so much time in the prototyping phase, that’s a non-trivial amount of work. So you put it off for later. You need to get to production. Processes can come later. But what if that didn’t need to be the case? What if you could organize your project and get all those benefits with just 3 lines of code? Let’s see how you can do this with DagsHub, so that you never need to compromise again!

What is DagsHub?

If you’re already familiar with DagsHub, skip this part and get to the juice of the next section – if not, read on.

DagsHub is a platform for managing and organizing machine learning projects. It creates a central source of truth for your code, data, models, experiments and more, and enables them to collaborate more effectively and get their models to production. It is based on popular open source tools like Git, MLflow, DVC, Label Studio so that you aren’t reinventing the wheel, but use agreed upon formats and tools for everything.

3 Lines of Code to Upgrade Your Machine Learning Project

Starting from a Colab (or local) notebook, let’s see how you can do the following with 3 lines of code:

  1. Track Data
  2. Track Experiments
  3. Track Notebooks + Code

The prerequisites to these lines is simply installing the DagsHub client and creating a DagsHub repo.

  • To install the client, simply:

    pip install dagshub
    

    And don’t forget to import dagshub.

  • Then, to create a repo, sign up to DagsHub and click the “Create” button in the top right of the page.

    You can either create a blank repository, use a project template, OR if you already have a code repo you’d like to connect, connect an existing repo to DagsHub – with the integrations to all popular Git providers, you’ll be able to add data, experiments and notebooks to existing repos (this will create Git commits where necessary).

1. Track Data

Let’s assume that in scrappy mode, you just got a CSV, or a bunch of image files that you uploaded to GDrive, into a folder named data, and the drive is mounted to your Colab notebook. The first line of code we’ll use is:

dagshub.upload_files(repo="<repo_owner>/<repo_name>", local_path="drive/MyDrive/data", remote_path="data")

If you have your data in a different folder, just change the local_path= argument.

✅ Phase 1 DONE! You should now see a folder named data with your data file on DagsHub.

2. Track Experiments

For experiment tracking, we'll use MLflow – the most popular open source experiment tracking tool. DagsHub is integrated into MLflow, so assuming your code uses MLflow for experiment tracking, you'll only need one line to track the experiment and it's model to DagsHub.

We’ll use the following:

dagshub.init(repo_owner="<repo_owner>", repo_name="<repo_name>")

✅ Phase 2 DONE! If you go to your experiment table in your repository, located at https://dagshub.com/<repo_owner>/<repo_name>/experiments/ you’ll be able to see your first experiment. In the MLflow UI associated with your repository (located at https://dagshub.com/<repo_owner>/<repo_name>.mlflow, you’ll also be able to see the actual model logged.

Note: The easiest way to instrument your code with MLflow is to use the autolog API which supports most standard ML libraries, e.g. in the example of Scikit Learn, the code you need is:

with mlflow.start_run(run_name="my_run"):
	mlflow.sklearn.autolog()
	# Add code to train your Sklearn model
	...

3. Track Code/Notebook

If you’re working locally, you can use the same line of code from “1. Track Data” to upload your code files too.

dagshub.upload_files(repo="<repo_owner>/<repo_name>", local_path="path/to/notebook.ipynb", remote_path="notebook_to_production.ipynb")

However, if you’re in Google Colab, that might be more of an involved process (you’d need to upload it’s save location in GDrive, which might be harder to find). That’s why we created a dedicated save_notebook function, which will just save the notebook to DagsHub easily.

dagshub.dagshub.notebook.save_notebook(repo="<repo_owner>/<repo_name>")

✅ Phase 3 DONE! You’ll now see your notebook in your DagsHub project. Which concludes are 3-line adventure.

For Your Next Machine Learning Project

To make sure my promises hold true, I created a sample project that goes through these steps. Find it on DagsHub: https://dagshub.com/Dean/my_first_repo

Now that we’ve seen how easy it is to organize a machine learning project and make it production-ready, Next time you’re working on a machine learning project, you won’t need to worry about organizing it from the get go. You can move fast and get initial results, then easily go through this process to organize it and share it with other stakeholders and collaborators.

Good luck building!

Tip!

Press p or to see the previous file or, n or to see the next file

About

No description

Collaborators 1

Comments

Loading...