The Best of Open Source Data Science Reviewed

Guy Smoilovsky
9 min read
5 years ago

Co-Founder & CTO @ DAGsHub

Table of Contents

Share This Article

TLDR: Open-Source Data Science (OSDS) is maturing and changing the field of Data Science. Just as Open Source has revolutionized the world of software, we believe it will do the same to the field of Data Science. We examine some key trends and tools in the OSDS space.

Open Source Software has come to dominate the software world. 72% of Fortune 50 companies use open-source software and there are more than 56 Million developers on Github, a popular open-source software platform. But what about Data Science and Machine Learning? Has the revolutionary potential of ML been leveraging the opportunity and promise of open source? In this post, we’ll take a deep dive into this question and try to highlight some of the key developments and changes of open source data science and share resources for those that are interested in further information. You can also read deep dive into the importance and challenge of open source data scene.

The Challenge of Open Source Data Science

Open Source Software has changed the way technology develops and has enabled tremendous growth and improvements for developers. However, when we transition into the world of ML and Data Science, the Open Source movement is far from living up to its full potential. In a field like ML that is growing rapidly, making breakthroughs and headlines, this is too big of an opportunity to miss.

Why is this? Well, the main reason for this state of affairs is because Open Source Data Science is a lot more complex than its software counterpart. As opposed to just a code file - Data Science requires data, parameters, environments, experiments, and pipelines. An easy way to imagine the different levels of complexity is by comparing a fried egg with a lemon meringue pie. They both have eggs, but the pie has significantly more components and complexities involved.

Open Source is Essential for Data Scientists and ML Engineers

Challenges aside here are a couple reason why we believe you need to invest in open source for data science:

Open-source data science gives you superpowers. Just like open-source software enabled projects that were unimaginable without open source communities, so too with open source data science. Working on a project within a community means more eyes and partners to help you find bugs and more helping hands to fix those issues. In OSDS this means a more rigorous examination of results and more feedback and suggestions on how to improve your models. Additionally, Open-Source can be leveraged to crowdsource data, which as we all know is one resource that you can never have too much of.
Open Source Data Science can help people just starting out with data science as well as advanced practitioners. Imagine working in a team with the best data scientists in your sub-field in the world. With Open Source Data Science, that can easily happen. Open Source is a great starting point to familiarize yourself with best practices across the industry. Also, Open-Source can make high-quality resources that require scale to create that most users lack, accessible to the broader community. OSDS enables you to collaborate with others and tackle more complex problems than you could not on your own.
One great thing about OSDS is that it gives you the ability to “peer behind the curtain" and see how the sausage is made. This backstage access enables you to change different elements to suit your needs best. This makes for easier troubleshooting in the case of data science bugs and can help you rely on your models with more confidence. For more about this, check out this post on our blog.
By solving the problem of reproducibility we can create reliable open source data science projects, which in turn enable us to achieve the benefits described above such as better results, fairer models, and more reliable performance.

Trends in Open Source Data Science

Like other fast-growing and emerging fields, it can be tricky to stay on top of recent trends and developments in the field of Open Source Data Science. We want to walk you through some of the key trends we have noticed.

1. The Corporate Contribution

One key development is the continuing and constantly improving ease of use and accessibility of machine and deep learning due to work done by large corporate players in the field such as Google, Facebook, Netflix, and others.

This is twofold:

Large Corporations have created tools and frameworks that make building and using ML and specifically DL models relatively easy. Creating these libraries and frameworks is beyond the scope of individuals and even most companies. Until the release of these tools and frameworks, the threshold for building ML applications was ridiculously high and effectively one needed a Ph.D. to begin to build projects independently. The sharing of these tools has lowered the accessibility threshold and has enabled many more people to work in this space which has led to a quantum leap in the field.
The second way BigTech has enabled advances in OSDS is by releasing open source ML/DL model projects that can help you fine-tune your models and benchmark your results. Great examples of this are Fairseq and Tensor2Tensor.

Fairseq is a modeling toolkit released by Facebook that helps researchers study and develop models to advance the field of NLP.
Tensor2Tensor is a library by Google consisting of deep learning models and datasets for many different use cases.

These developments have helped make ML and DS more accessible to the community and helped users of different skill levels make the most of their DS. These large dedicated research teams will continue to influence the state of the art models.

2. Growth of Specialized Tools

Another exciting development in OSDS is the growing number of platforms that offer powerful infrastructure to further collaborations. A few examples are:

DVC - DVC (Data Version Control) is a tool that extends Git for versioning large files, like data, models, and intermediate artifacts. It makes sharing data and results in a reproducible way much easier, which in turn promotes creating truly Open Source Data Science.
DAGsHub - DAGsHub is a community-first platform like GitHub for machine learning. It is built on top of Open-Source tools like Git, DVC (mentioned above), and MLflow. DAGsHub was created to support Open Source Data Science, by enabling things like free data and model hosting, and data science pull requests.
Google Colab – Colab is a hosted notebook, offered by Google with free GPUs. This means that citizen data scientists have access to powerful hardware to run experiments and share them with the community.
Streamlit – Streamlit is an Open Source tool to build data apps, which means it can be used to make data, models, and other data science components accessible to team members and even non-technical collaborators.

3. Biases and Fairness

An emerging issue in the field of OSDS is the growing recognition of the importance and interest in solving issues relating to biases and fairness concerns in AI. These concerns impact a variety of fields and functions. One example is the propensity of facial recognition systems to misidentify minorities at an alarmingly high rate, or algorithm-based risk assessment tools that are used for sentencing which are disturbingly racist. This is a key issue in any conversation on the future and development of AI.

As of now, different solutions have attempted to solve this problem, but haven’t met much success. We’re sure that OSDS will be a major part of the solution in correcting this worrisome phenomenon. For example, users can access datasets and find data bugs, like missing data for underrepresented minorities. Then, collaborators can help "fix" the dataset by submitting appropriate data points or even start crowdsourcing campaigns to contribute to them and remove the extant biases in the dataset. This is a powerful example of how the nature and scale of Open-Source can solve this and other challenging issues.

4. The Rise of Transformers

Transformers are all the rage when it comes to deep learning architecture. New deep learning models are being introduced at an increasing rate but transformers are particularly noteworthy for their effectiveness at processing NLP tasks. Transformers, while exciting in and of themselves, are just the latest iteration in a sequence of architectures that have made ML and specifically DL more accessible and applicable. Transformers can be used for real-world tasks, with minimal changes and effort. While Transformers are exciting, they are just one stage in the continuous improvement of models that make ML and DL more accessible to a larger community. This is important for OSDS as more people from diverse backgrounds, can take part in projects and collaborate to achieve better results. The better and easier to use our models become, the bigger and more powerful the community can become as well. You can learn more about transformers here and here.

5. The next breakthrough domain - Graph Deep Learning?

As anyone who has worked in software long enough knows (especially for job interview questions!), a huge percent of the interesting problems in life can be formalized as graphs, and solved using graph algorithms. From social interactions, to economics, to communication, to physical systems and molecules.
Given its importance and widespread usability, it makes sense that this field would be a big target for gains from new Deep Learning techniques.
A recent big win has been AlphaFold, which models the protein folding problem as a graph, then uses a neural network (plus many other tricks) to solve it in a seriously impressive way - achieving state of the art results that scientists haven't come close to using traditional research methods. This should have huge impact on medicine and basic biology research.
Of course, since graphs are a great way to model many real world and business problems, I expect these new architectures to maybe be "the next transformers" - pretrained models and architectures fine tuned to unforeseen domains and creating big impact for many practitioners.
DGL is an Open Source project aimed at making work with Graph Neural Networks easier, to get you started on this possible wave of the future.

Significant Open Source Data Science Projects

After discussing some of the important trends and currents, let’s continue our journey by discussing some interesting and thought-provoking examples of how OSDS can advance DS and ML.

EleutherAI is a collective of researchers and developers that are focused on “AI alignment, scaling, and open-source AI research.” GPT-3 is one of the latest language models from OpenAI and one of the most advanced in the world. GPT-3 has performed above and beyond expectations on many NLP tasks such as generating code, writing like an attorney, and answering math problems. The issue is that GPT-3 isn’t available on open source. GPT-Neo is Eleuther’s flagship project and is the open-source community’s response to GPT-3. GPT-Neo performs almost identically to GPT-3 and is readily available to the community via open source.
HuggingFace, an AI community dedicated to advancing and democratizing AI through open-source and open-science, recently had a community week with the goal of providing state-of-the-art XLSR-Wav2Vec2 speech recognition models in as many languages as possible. More than 370 users participated and tried to build the best model for their language. You can see details at HuggingFace.
One cool example of the power of OSDS happened at the DAWNBench challenge where a team from the Fast.AI community was able to beat some of the largest corporations in the world. The challenge was to train an algorithm to identify items in an image dataset. Fast.AI’s models were faster and cheaper than their competitors who included Facebook, Google, and Intel. You can read more about this crazy feat here and here.
An exciting example of a tool that was developed on an Open-Source platform is Facebook Prophet. Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality with the added function of holiday effects. Prophet was designed to lower the cost of entry for DS users in this space. It is easy to use, and completely automated. Prophet also includes numerous possibilities for data scientists to adjust forecasts based on parameters and domain knowledge. Check it out.

Getting Started with Open Source for Data Scientists and ML

OSDS can be intimidating with so many platforms, resources, and frameworks.

We planned to finish off this post by compiling a list of some of the key resources in the field. But, we realized pretty quickly that such lists have already been compiled in a far more comprehensive and thorough manner than we would have done here.

So in the interest of giving you the best resources possible here are two awesome lists of resources, communities, frameworks, platforms, and much much more.

The first awesome list is curated by @visenger and can be accessed here.
The second is curated by @Kelvins and can be accessed here.
This mega catalog of Awesome Open Source Machine Learning

The Bottom Line

OSDS is growing and becoming more important. We hope this post will deepen your understanding of the current state of OSDS and help you get involved in this exciting space.