Making a case for Open Source Data Science (OSDS)

Open Source Machine Learning Aug 22, 2020

Open Source Software has changed our lives. But where is Open Source Data Science? and why should we care?

Storm troopers collaborating to charge an iPhone — Storm troopers collaborating. Photo by Will Porada / Unsplash

Almost all of us use Open Source Software (OSS) as part of our lives. Collaborating to create communal software has changed the world of software development and the world in general. But we aren’t seeing the same behavior surrounding data science. Where is Open Source Data Science (OSDS)?

Before claiming that Open Source Data Science makes sense, let’s start by explaining why Open Source Software makes sense.

The Case for Open Source Software

Many consider the benefits of OSS to be obvious, and most developers and software companies are engaged in some relationship with the open source community. But it wasn’t always like this.

A (really) brief history of open source software

Storm trooper looking at star wars intro scene — A long time ago, in a galaxy far, far away. Photo by Daniel Cheung / Unsplash

In the beginning (1950s-1960s) almost all software was open source. It was created by researchers and academics, who shared it with the community in the spirit of science.

But even later, when companies like IBM sold their first computers, software was distributed with source code to enable “Hackers” to fix bugs and create improvements.

Closed software gradually became a standard by the late 1960s, when the prices of software grew significantly in comparison to hardware. The technological shift, like many others, began with an antitrust suit against IBM, which concluded that bundled software was anticompetitive. A few years later, in 1974, the US concluded that software was copyrightable , and thus began the age (and business model) of closed source software.

Although open source was never extinguished (projects such as TeX for example), it experienced a sort of renaissance starting with the GNU project in 1983, but only manifesting significantly in the late 1990s, with project like Linux and Git (started by the same person — Linus Torvalds ), and eventually GitHub.

Why it wasn’t always obvious

There’s a famous quote about Wikipedia that works just as well for Open Source:

“The problem with Wikipedia [read: Open Source] is that it only works in practice. In theory, it’s a total disaster” - Gareth Owen

In the early days of Open Source Software (OSS), there was a strong claim that software companies would never share their source code, as it was their IP — their most valued treasure, and handing it out to competitors just didn’t make any sense.

Closed source software has its advantages — It tends to have better support, a richer feature set, and is sometimes more user-friendly. Indeed, companies have IP that they’d like to protect, and it’s not always clear how Open Source fits in with developers’ or companies’ business models. In other words, it sometimes makes it more difficult to make money.

On the other hand…

The benefits of OSS

A New Hope. Photo by Xenomurphy on Visualhunt / CC BY-NC-ND

Today many benefits of OSS are obvious:

Transparency and freedom — Using OSS means you aren’t locked into a vendor, and can easily understand the inner workings of the tools you use
Reliability and security — Identifying bugs and security risks is much easier when you have thousands of eyes inspecting your code
Faster time to market — It’s much faster to have hundreds of contributors helping with feature development. This is especially true for startups who don’t have a large headcount

Some benefits, however, are less intuitive. For example, one of the hardest things in the tech industry is finding great developers, many of whom care deeply about OSS. Incorporating OSS into a company’s product portfolio enables developers to showcase their abilities and contribute back to the community. This can be a huge draw for talent.

And of course, let’s not mince words. Releasing an internal project as Open Source is great PR (Public Relations, not Pull Requests), especially if you have a non-Open Source product built on top of it, and/or if your users are developers.

It just makes sense (in some cases)

When does it make the most sense to create Open Source projects? I think we can define the ‘Open Source Criterion’ for organizations as following:

It’s a good idea to open source a project when it solves a communal problem and is not part of your core product (This is a rule of thumb though, there are exceptions).

As an example, let’s look at React.

"Draw me like one of your French girls" — The Empire designs their first UI, before React. Photo by Daniel Cheung / Unsplash

React is a JavaScript library for building user interfaces. It was released by Facebook in 2013. It has an MIT license which basically means you can use it for whatever you want, commercial projects included.

Now, analyzing React from a product point-of-view, it solves a serious problem in web development — it’s hard to build user interfaces from scratch. It’s clear that this is a very widespread problem, and also that solving this problem is not the goal of Facebook as a company — they ~~sell ads~~ build social networks^[1].

This means that React is an example of the criteria stated above — a communal problem which is not the company’s core IP. Therefore, releasing it as OSS guarantees Facebook all the benefits, without a lot of downside.

A Case for Open Source Data Science

I think by now you should be convinced that OSS is valuable. Now, let’s talk about how these concepts transfer to the domain of data science. To do that, let’s paint a more detailed vision of how OSDS can work by giving an example that abides by the ‘Open Source Criterion’ — Face recognition.

You can probably easily think of a few applications that can use face recognition. For this example, let’s look at face recognition in the context of disease diagnosis, and fashion recommendation. A quick Google search for each will show that there are companies doing exactly these things.

In the first use case, a photo of a patient would be taken as part of a medical checkup (or at home), which would then be analyzed in order to recognize possible diseases, helping doctors prioritize urgent cases and treat more people.

In the second use case, an app would recommend the best clothes according to your purchase history, but also the customer’s facial features or body structure.

These companies are definitely NOT in competition with one another. Their IP lies in providing a system for disease diagnosis or fashion recommendation respectively. Yet, today, both are most likely developing a crucial part of their system — face recognition — in parallel. This means duplicate work, and a waste of a lot of data scientist time and money.

In an OSDS world, both companies would work on an open face detection project, contributing code and data to help battle edge cases and create a more robust solution. They could dedicate less data scientists to the task compared to the current state, and have them focus on tougher, more critical problems.

New data scientists wanting to learn and showcase their skills to improve their portfolio would jump in to identify bugs, inefficiencies and create alternative models that prioritize various metrics for different use cases. They could see what a serious data science project looks like.

On top of all that, companies looking for prospective data scientists could find and reach out to those that have already contributed to the projects they care about, thus shortening the hiring and on-boarding process for new team members.

Recently the topics of AI transparency, diversity and inclusion have received a central focus in tech. OSDS can have a significant impact in these areas. To give an example, let’s say a user of one of these products, who also happens to be a data scientist, discovers that the face detection performs poorly on her image. She reviews the dataset, and realizes there are no images of a certain ethnic minority — one she belongs to. She adds images from another dataset that fill in the gap, and submits a pull request — And Voilà, the new model performs much better for her, and the companies enjoy a more diverse and improved model.

A true win-win(-win-win) situation.

Almost 200 face recognition algorithms—a majority in the industry—performed worse on Asian, African American, and Native American faces than on Caucasian faces. https://t.co/OUe2tb5eYw
— MIT Technology Review (@techreview) December 24, 2019

When I started writing this article, this was a theoretical example. Now, it is no longer the case.

To finish off this piece, I’d like to write about two examples of how Open Source Data Science should and shouldn’t look in practice.

Open Source Software disguised as Open Source Data Science

Now, I can already hear your objection: “But I cloned tensor2tensor and BERT from GitHub, it’s already Open Sourced Data Science!” or “A lot of papers I read on arXiv have their code posted online”.

The Trinity — code, pre-trained models, and an arXiv paper. BERT on GitHub

This is not a jab at BERT’s authors and this section is not meant to disrespect these projects. These projects are extremely useful, and it takes a lot of hard work to make them accessible to the public. Their authors do everything they can given the tools they have and industry standards.

The argument I would like to make is that they are OSS but not OSDS.

If the code in one such example has a bug, which makes it so that the API for the published model isn’t working for some reason, an independent programmer can usually contribute a bug fix in the form of a pull request. On the other hand, if an independent data scientist realizes there is a problem in the data, say the model is biased for certain ethnic groups, in most of these cases, they could not modify the dataset and create an improved model.

This results in contributions made to the code, but not to the data science parts of the project. In other words, this project effectively becomes a software project.

OSDS can also have an important side effect for data science research.

Improving the quality of data science research

Improving the quality of research is an important goal to aspire to. A few notable efforts towards this goal are:

A lot has been said about the problematic nature of some of the State Of The Art (SOTA) results in data science research. Usually, when a paper is published, you see only the end result, which is the percentage improved compared to the former SOTA. Many times, small improvements might be attributed to a successful choice of random seed or performing many experiments until one has successfully improved on a metric, without giving enough consideration to the statistical significance of the result.

In simpler terms, since the research process is not transparent, we don’t know if a result actually represents an advancement in research or a fluke of luck.

In an OSDS world, research teams could publish their whole research history with the paper being submitted for review. Reviewers will be able to provide high quality feedback and ensure a rigorous research method. Everyone will enjoy higher quality data science research.

Final Thoughts

Example Scenario

At DAGsHub, we spend a lot of time thinking about OSDS and talking to data scientists.

This article is a summary of some of the conversations we’ve had with data scientists in the community. The purpose of this article was to formulate why Open Source is an important part of software development today and to put forward the argument, that it will be an important part of data science in the near future.

The next article will dive into technologies and technicalities — Why creating an OSDS community requires different tools from the ones used by the OSS communities, the difficulties, and how they can be overcome.

If you have had the chance to collaborate on an OSDS project, I’d love to hear in the comments (and get a link to what you’re building).

[1]: Today Facebook has a lot of other products, but building user interfaces is not one of them.

Recommended for you

Active Learning

Active Learning Your Way to Better Models

2 years ago • 10 min read

Computer Vision

Train An Emotion Recognition Model Using Open Source MLOps Tools

10 months ago • 11 min read

CI/CD

CI/CD for Machine Learning: Test and Deploy Your ML Model with GitHub Actions

2 years ago • 9 min read

How to choose MLOps tools (MLOps from first principles)

🍪 Machine Learning in the cookie-less era with Uri Goren

Top Computer Vision Generative Models in 2024

7 Best Machine Learning Workflow and Pipeline Orchestration Tools 2024

Making a case for Open Source Data Science (OSDS)

Open Source Software has changed our lives. But where is Open Source Data Science? and why should we care?

The Case for Open Source Software

A (really) brief history of open source software

The benefits of OSS

A Case for Open Source Data Science

Open Source Software disguised as Open Source Data Science

Improving the quality of data science research

Tags

Dean Pleban

Recommended for you

Active Learning Your Way to Better Models

Train An Emotion Recognition Model Using Open Source MLOps Tools

CI/CD for Machine Learning: Test and Deploy Your ML Model with GitHub Actions

How to choose MLOps tools (MLOps from first principles)

🍪 Machine Learning in the cookie-less era with Uri Goren

Top Computer Vision Generative Models in 2024

7 Best Machine Learning Workflow and Pipeline Orchestration Tools 2024

Open Source Software has changed our lives. But where is Open Source Data Science? and why should we care?

The Case for Open Source Software

A (really) brief history of open source software

The benefits of OSS

A Case for Open Source Data Science

Open Source Software disguised as Open Source Data Science

Improving the quality of data science research

Tags

Join DAGsHub

Dean Pleban

Recommended for you

Active Learning Your Way to Better Models

Train An Emotion Recognition Model Using Open Source MLOps Tools

CI/CD for Machine Learning: Test and Deploy Your ML Model with GitHub Actions