Data scientists have challenges and need tools to overcome them. It’s best to use open-source, best-of-breed, modular solutions. It’s also a good idea to think about these challenges from a problem-solution perspective, as opposed to a “give me the awesomest tools” approach. I provide a list of common problems and OSS solutions for those problems, with comments on when they’re better/worse for that issue. Skip to the list
A familiar problem
If you’re a data scientist working on machine learning projects, you might have noticed a recurring issue in your work.
- You have a lot of issues, it seems like there are a lot of tools for them, but everyone is claiming they can do everything, and it’s hard to know what tool to use for each task. Worse yet, you’ve tried a few of these tools and have been disappointed. Since they each have a learning curve, it’s scary to try another one, potentially wasting more time on something you’ll end up needing to build internally.
- Moving fast is of the essence since you need to show the rest of your organization that this new AI thing can actually translate to business value, and you want to focus on what you do best – understanding your data, transforming it, and either translating it into business insights, or training an ML-powered component for your product/company.
- You understand that you’re part of a team – your company is going to be a rocket ship, and you’ll have a huge team soon enough (if you don’t already have one), and you already need to work with other teams in your organization, like domain experts, software engineers, and DevOps.
- Finally, you realize that this field is fast-changing, so committing to a vendor that might be obsolete, or not support the formats/tools you’ll use tomorrow is problematic. But, on the other hand, analysis paralysis isn’t much better (back to point 2).
I’m happy to tell you that this issue is solvable. You don’t have to choose between buying an end-2-end platform that is less customizable and might mean you don’t have the best solution for the challenges you face and building something internally that might be perfectly suited for your needs, but means you’re no longer a data scientist (congrats, you’re now a software engineer – whether this is a promotion or demotion depends on you).
That hack is using open-source software to create an open source data science stack. Due to the fast-changing nature of the ML world, there are hundreds of open source solutions for many problems that data scientists face, which don’t suffer from the same issues stated above, given a few assumptions I’ll make below:
Assuming the following:
- A best-of-breed solution is defined as the best solution for a given problem that can integrate with the other best-of-breed solutions you use relatively easily. What this means is that any awesome solution that requires you to give up other, non-problem-related tools to use it, doesn’t count.
- You don’t need tools for the sake of tools – you have problems that you want to solve, or processes you want to polish, and you need tools to help you accomplish these things. This assumption is important. Otherwise, you’ll ask for everything, which means you’ll end up using an end-2-end platform that might break when you need to solve more specific challenges.
- Data scientists need different tools than developers – You should aspire to NOT reinvent the wheel (e.g. an automation tool for developers might be great for machine learning workloads), but working with code is not the full picture for data science. You need to at least consider how data, models, pipelines, and experiments will be handled. Specifically, reproducibility needs to be properly addressed in ML projects which will be promoted to production.
Now, an open-source stack that complies with the assumptions above solves the challenges in the following ways:
- Knowing which tools to use is what this article is for. Also, by choosing point solutions for your issues that are easy to integrate, you avoid being disappointed – if this tool doesn’t solve your problem an end-2-end platform probably won’t either.
- You can move fast since once a problem crops up, you choose a tool that fits the assumptions – i.e. it solves a real problem and is the best solution for your problem while integrating with the rest of your stack. This means you spend less time choosing the tool, and less time integrating it with the rest of your stack.
- Working in a team is solved since you’ll start by looking for tools that are already adopted in other parts of the organization. If none solve your problem, that means you should look outside, and by choosing best-of-breed solutions, which are integrated with other parts of your stack, you’ll make it easy for everyone else to adopt them as well. This can potentially shorten your time to production significantly.
- The fast-changing nature of the field is addressed because if you choose your tools with the above assumptions, your platform is fully modular. Swapping out an obsolete tool with a new one should only require minimal integration and time.
By choosing an Open Stack (i.e. Open Source tools), you get 2 additional benefits:
- Modifying the tools to your needs is possible in case a unique need arises.
- You get community support from others in your position – using the tools and experiencing challenges. This also means it’s much easier to evaluate the pitfalls of tools (just go to their issues page).
- Of course, using open source is NOT equivalent to avoiding lock-in. You should always prefer generic formats that are widely supported over obscure or new standards.
Of course, everything I’ve written above is an ideal representation of reality. In real life, things are more complicated, and other considerations come into play. However, using this thought process and assumptions will save you time, headspace, and potentially grief and money.
The rest of this article will be problems I’ve heard data scientists raise over the past few years and a few open-source tools that can be used to incorporate them into your stack. I’ll try to add a few notes on each one, for when it is better/less suited for the task.
10 MLOps problems and how to solve them with an open stack:
1. Problem: You have changing data and you want time-traveling/disaster recovery/governance:
Your data is unstructured:
- DVC – https://github.com/iterative/dvc – probably the most lightweight solution in this list. If you want to get started the fastest, this is the choice for you.
- Pachyderm – https://github.com/pachyderm/pachyderm – pachyderm is focused on pipelines on top of K8s, offering data versioning as a byproduct of its pipeline system. If you work on top of K8s and are looking for a heavier-duty pipeline solution, this might be for you.
- LakeFS – https://github.com/treeverse/lakeFS – Versions your entire data lake. This is good for very large scales and has a lot of optimizations for large scale streaming, but probably require your organization to adopt it
Your data is structured or stored in a DB you can query (you need to be ok with changing your DB in this case):
- Dolt – https://github.com/dolthub/dolt – A relational database with versioning capabilities
- TerminusDB – https://github.com/terminusdb/terminusdb – A graph DB with versioning capabilities
- Delta Lake – https://github.com/delta-io/delta – An entire versioned data lake solution with special data ingestion capabilities. This is significantly more heavyweight than the other two solutions in this list.
Note: All three structured data solutions aren’t as lightweight as their unstructured data counterparts, since they require moving your data to dedicated data formats, or databases. That might be a deal-breaker for you or not.
This problem comes up a lot, but usually, the asker asks about a solution for data versioning. In the spirit of solving problems, and not adding tools for the sake of adding them, it’s important to answer two main questions:
- What is the task that data versioning will accomplish:
- Do you want to be able to play around with different versions of data to compare the model’s behavior?
- Do you need a solution to revert to an earlier version in case of bugs.?
- Are you looking for a solution to know which data is used where?
- Are you looking for a way to add/modify data without breaking downstream consumer projects?
- What type of data are you managing:
- Is it stored in a database, and you want to manage versions of that entire database?
- Is it stored in a database, but you want to manage only data that was queried into a specific project?
- Is it unstructured data, like images, text, or audio?
A different answer to each question might demand a different tool.
2. Problem: You’re not sure what structure to use for your ML project:
- Cookiecutter data science – https://github.com/drivendata/cookiecutter-data-science
It might be frustrating to think about which structure makes sense for your project. Should you put your data and models in the same folder, or not. There hasn’t been a lot of work done on this front, but Cookiecutter data science seems like a recurring name, so when in doubt, you can use it, and it will likely suit your needs (while sparing you the mental load).
3. Problem: You’re not sure how to track your experiment metadata*:
This is probably not a problem – but people keep asking about it so I might as well address it., There are so many solutions, so the paradox of choice is strong here, but luckily, you can read our other post, exactly about choosing which experiment tracking solution is best for you: https://dagshub.com/blog/how-to-compare-ml-experiment-tracking-tools-to-fit-your-data-science-workflow/
A few things to note here:
- You should log as much as possible while maintaining a way to differentiate between “trash” experiments and “good” experiments (labels are a good way to do this).
- Avoid completely deleting bad experiments. They can teach you as much as good experiments. It’s better to just put them in a separate view.
- It’s best to use generic formats, as opposed to proprietary or obscure ones. This will let you analyze your experiment metadata easily.
4. Problem: You have a model somewhat ready, and you think you can squeeze a few extra percentage points by fine-tuning its hyperparameters:
Note: In most cases, hyperparameter tuning is less effective than it’s portrayed to be. Usually, it’s a better idea to work on improving your data quality.
Still, here are two great tools:
- Optuna – https://github.com/optuna/optuna – The best simple solution for hyperparameter tuning and black-box optimization.
- Ray Tune – https://github.com/ray-project/ray – Useful if you’re already using Ray for distributed training. It is an adapter layer including integrations with Optuna, among other tuning algorithms.
5. Problem: You need a free GPU (don’t we all):
- Google Colab – https://colab.research.google.com/ – Nuff said.
6. Problem: You want to automate as many things as possible. You decide to start with the training of models, or more generally, the running of pipelines, upon changing parameters/code/data:
- Does it have to work with the tools the developers in your organization are using?
- Jenkins – https://github.com/jenkinsci/jenkins – Probably the most widely used solution for automation in software engineering. It is not the most user-friendly to set up, but if your organization is already using it, integrating with it should be straightforward.
- GitLab CI – https://gitlab.com/gitlab-org/gitlab-runner
- GitHub Actions (Not OSS)*
- Can you create a dedicated service for your ML pipelines:
- Prefect – https://github.com/PrefectHQ/prefect – It’s like Airflow, but with an arguably more intuitive API and improved capabilities for data pipelines. It’s definitely not as mature as Airflow and offers fewer integrations, but the authors of Prefect claim to solve some of the shortcomings of Airflow, such as improved scheduling, dynamic DAG support, and more.
- Dagster – https://github.com/dagster-io/dagster – Also works with Airflow, has arguably a more convenient UI compared to Prefect, but it is more opinionated compared to Prefect, which translates to more dedicated features, but less flexibility.
- KFPipelines – https://github.com/kubeflow/pipelines – A wrapper for Argo pipelines. Especially useful if you’re already using Kubeflow. Otherwise, it probably doesn’t provide an advantage over the other two options.
7. Problem: You want to make your models accessible via an API:
Another issue that usually comes up is “how do you deploy your models to production?”. While this is an important task to complete, it’s usually very case-specific. It’s hard to find a one-size-fits-all solution, and you need to pay attention to the following topics:
- Do I only need to deploy the model? Or do I need a solution for deploying an entire data processing pipeline, otherwise the proper model input will never be generated?
- Am I using a standard framework like Scikit-Learn Pipelines, or have I developed custom data processing steps/models?
- Does my deployment system have “engineering” constraints? Should it support millions of requests per hour? Should it have a latency below some threshold? Should it be scalable in case of stress?
These questions will lead you to tradeoffs. The more constraints put on the system will require more upfront engineering work, but if you only need to deploy a “simple” model inference API, the following might serve you well:
- Seldon – https://github.com/SeldonIO/seldon-core or KFServing – https://github.com/kubeflow/kfserving – Both of these are tied to K8s, so that might be a major consideration. Seldon is integrated into KFServing, but you can also use it as a standalone project, and it is probably more mature. Supports standard model formats and lets you define custom wrappers if needed.
- BentoML – https://github.com/bentoml/BentoML – Provides the wrapper for the model (creating a docker container), but not the orchestration of the container itself. In this sense, it can be used with Seldon. More opinionated compared to Seldon, so might be less flexible.
- Cortex – https://github.com/cortexlabs/cortex – Started as an ML deployment service, and has since then moved on to focus on more broadly “deploying, managing, and scaling serverless containers”. It supports AWS only at the moment but is very widely adopted, and supports multiple types of APIs (real-time, batch, async) which might be critical for certain applications.
8. Problem: You want to better understand your model’s outputs, to increase your trust in its performance:
This can be a serious problem for you, especially if your project/organization requires deterministic and explainable predictions to be promoted to production (e.g. when you can’t take chances that your model might be biased). It is an active area of development, and you’re very likely to need to develop some custom visualizations for your work, especially if it is not a generic domain (face recognition, or text translation). Nevertheless, some frameworks exist to make your life easier:
- SHAP – https://github.com/slundberg/shap – Explain all types of model predictions, arguably performing better for the model, or variable explainability, as opposed to a single prediction explanation.
- LIME – https://github.com/marcotcr/lime – Arguably better at single prediction explanation.
- SHAPash – https://github.com/MAIF/shapash – A more holistic approach that generates a report for your model, explaining its results via multiple charts. It is compatible with both SHAP and LIME
It’s also a good idea to remember that in many cases, the best solution is to just use simpler models. Many companies opt for tree-based models, because they are inherently more explainable than neural networks, while not losing a lot on performance. It’s also a good idea to remember that this is not an either/or situation. You can use all the above options, and take out the insights for your dashboard if needed.
9. Problem: You want to show a demo of your work to someone (probably an internal stakeholder, without learning HTML, CSS, and JS:
This one is pretty self-explanatory.
- Streamlit – https://github.com/streamlit/streamlit – More focused on fast prototyping. Is more opinionated compared to Dash, which means it’s easier to build simpler dashboards.
- Dash – https://github.com/plotly/dash – Is more flexible, but harder to get started. Geared more towards enterprise settings and production.
- Gradio – https://github.com/gradio-app/gradio – An even more opinionated tool, especially oriented towards demos of ML models, as opposed to the more general dashboarding capabilities of Streamlit and Dash.
10. Problem: You already have a model in production and you’re afraid it might misbehave, so you need some way to make sure predictions make sense:
Open source solutions for this problem are relatively few and far in between, the ones below are very early but might be useful for your needs.
- MLWatcher – https://github.com/anodot/MLWatcher – More focused on real-time monitoring, specifically for concept drift, data anomalies, and model metrics.
- Evidently – https://github.com/evidentlyai/evidently – A report-based monitoring system, that analyzes models and provides interactive reports for specific topics such as data drifts, and model performance. As far as I can tell, they do not offer any real-time capabilities, which means if this is important for you, you will need to manually set that up.
This was an honest attempt to alleviate some of the analysis paralysis, and “buyers” remorse you might be having in choosing your ML tool stack. I’m a strong believer that open stacks will emerge as teams better understand the problems they are looking to solve, and the tradeoffs they are willing to make.
By choosing open source tools that are modular, and solve real problems you’re having, you put yourself in the best position to take advantage of the automation and best practices already developed for and by the community, while still retaining the ability to upgrade, change, add new components to your stack.
If you have other problems that you solved by incorporating a best-of-breed open source solution, please let me know and I’ll add it to the article.