Resources to learn ML have exploded online. Even with a technical background finding the right material and practical exercises can be a challenge. In this blog I share my personal experience as a developer learning machine learning, what worked, the most useful resources and tips for how to structure your own learning path.
I'm Martin, a Software Developer with 10+ years of experience, mostly in PHP and JS (zero experience in machine learning or anything close to it). A while ago I decided to dive into the world of machine learning.
The main reason I wanted to learn machine learning is simply that it is a field that is growing and is growing fast! There are a lot of opportunities in ML and of course huge demand from companies for ML engineers. You can also find machine learning applied in almost every field, from advertising, self-driving cars, smart devices, text recognition apps. Wouldn’t it be nice to really understand how it all works?
For those of you on a similar journey, I decided to document the path I took to start learning and develop practical skills in machine learning. Maybe others would find it useful.
Where do I start? What do I need to do? I began my path by reading a lot of "How to" articles but all of them started the same way, with a list of concepts and skills I should master...
TensorFlow! MLflow! NLP! Python! R! KubeFlow!
You would think I'm enunciating Harry Potter spells.
I was left feeling overwhelmed, confused, and as if I had an endless list of things to learn before even taking my first step. Feeling a bit hopeless, I spoke with some coworkers, and they recommended that I start with a Kaggle challenge and focus on the things that I wasn't understanding.
Since I have coding experience, I saw this approach as the correct one as I could “isolate variables” by starting with a simple example to learn the basics of an ML problem and how to approach it without worrying too much about learning how to code. Of course, I had to learn the nuances of the new language, but that wouldn't stop me.
With this approach in mind, I tackled my first challenge! Solving the Titanic Kaggle competition, Titanic: Machine Learning from Disaster
Solving the Titanic challenge
It was time to solve the challenge presented in the Kaggle competition; from a dataset with all the passengers of the titanic, you need to predict who survives and who does not. (Spoiler alert! Jack dies)
From the moment I opened the competition, I didn't know how to proceed. For instance, I didn't know what a notebook was or even how to create a project.
The first question that arose was should I write the solution in a notebook or a Python script? What's the difference? Why do they exist?
The answer I received was "Think of the notebook as a playground/sandbox where the final result should be a clear direction on where to go"
Another answer was "In a notebook you can run the same code block several times, changing the input parameters and after it is easier to compare the outputs"
Last but not least, to submit your solution to Kaggle, you have to submit a Notebook, so the final decision was going with a notebook.
The next question I had was which platform I should use for my Notebook? I was between Anaconda, Google Colab, or the notebook editor Kaggle offers.
But first, what is a Notebook?
A Notebook is an open-source browser-based application that you can use to create and share documents that contain live code, equations, visualizations, and text.
Think of a Notebook as a notepad where you can run some code and get a view of how it will perform when writing your scripts.
The main benefits of notebooks, as I see them, are:
- Record the code you write in a notebook as you manipulate your data. This is useful to record the process and importantly repeat it if necessary, etc.
- Graphs and other figures are rendered directly in the notebook
- Refresh and update the notebook (or parts thereof) with new data by re-running cells. (You could also copy the cell and re-run the copy only if you want to retain a record of the previous attempt.)
Below is an outline of some of the most popular notebook formats.
Anaconda is a suite of products that has to be installed locally. It includes python and an editor for notebooks, among other tools.
Google Colab, or Google Colaboratory, is a free environment where you can create and run your notebooks.
It is cloud-based and doesn't require much configuration. If any library you want to use is not on Colab, just use `pip install`, as usual, to install it on the virtual environment.
Colab is THE Google Documents of Code. The notebook can be shared and edited in real-time by different team members, they can add comments, see the edition history and go back to previous versions, like a Google Doc.
It has two big disadvantages which are:
1- It only runs online (No offline work)
2- Data files have to be on Google Drive; otherwise, the data will be lost when the virtual machine shuts down.
Kaggle offers two types of editors, Scripts, and Notebooks
You can write down scripts either in R or Python.
Kaggle offers a similar interface to what Google Colab provides, with the ability to import datasets from their registry or copy another user's notebook.
For the sake of the challenge, I decided to use a notebook inside Kaggle. Still, in my opinion, the best tool to use is Google Colab considering everything that it offers. It is online, shareable, no need for installations, and provides some computational power.
I created a new notebook and loaded the data provided for the challenge but again, wasn’t sure how to proceed. Hence, I decided to look for a tutorial notebook that would walk me through the solution and understand the process.
Failed again! The notebook started by selecting some features (columns in the CSV) then applied some manipulation to them, followed by sending everything to a model to predict the survival of the passengers and then storing the results in another file for submission to Kaggle.
Here are some of the things that felt off:
- Feature selection and manipulation seemed utterly arbitrary.
- What is a model?
- Why choose one over the other?
- How do I use a model?
At this moment, everything seemed to be falling apart, and I decided to pivot once more into learning some of the basics of machine learning.
Learning is power!
I realized that it was time to understand the fundamentals. Otherwise, I would end up beaten down without understanding a thing and abandoning my quest.
Here are some of the tutorials and MOOCs I found helpful and why:
- Andrew Ng, Coursera course. He talks about AI, explaining what machine learning is, the different types of problems in the field (supervised vs. unsupervised), what they mean, and how to face each one. Another reason why I decided to start with this course is that he is a star in the ML world. It's a non-technical course but it covers the basics to get you started.
- ZuzooVn GitHub Repo, sort of an awesome list; in his repository, he also talks about going from software engineer to ML engineer and all the steps he will cover. He also has the approach "practice-learn-practice".
I took several articles from this list of resources, especially the one called "A visual introduction to machine learning". It demonstrates in a visual representation how an ML engineer, using a data set about homes, creates a machine learning model to distinguish homes in New York from homes in San Francisco.
It also has several links to different introductions to machine learning, podcasts(separated by seniority), interview questions, links to different communities
- Learn Machine Learning Subreddit, a subreddit where I found several people in the same situation as me—trying to understand the ML world and all the buzzwords. I wouldn't use it as my go-to place, but it's always refreshing to read what is going on out there.
- Kaggle's Hands-On Data Science Education, Kaggle's learning center.
It has several courses to introduce anyone to ML and I found it very useful because it was exactly what I needed at that point. The beginner course starts by presenting a problem - predicting house prices - and then, through several sections, gradually introduces all the ML terminology.
It starts by explaining what a model is and how it works – the first module focusing on a type of model called a "Decision tree".
In the second module, we dive into the basics of data exploration using Pandas. Pandas is one of the primary tools data scientists use for exploring and manipulating data.
Then it was time to build my first model! First, the tutorial guided me through some data exploration to define what we want to predict and which data I was going to use to do it. After that, they explained the steps to build and use a model. Everything was super clear, and you end up with a model that can predict some house prices; not the most accurate one, but it's a start, and I already knew by this point what a model is, how it works, some basic data exploration. Concepts that seemed like black boxes when I started.
After this module, there is a discussion about model validation and a technique to check if the model is accurate. It also focuses on "Overfitting and Underfitting", and how it can affect your model.
Finally, the tutorial jumps into "Random forests", another type of helpful model, and finishes with a small introduction on how to submit to a Kaggle competition. It also introduces us to a competition designed around this tutorial where you can apply all of the knowledge you just acquired.
My path to becoming an ML engineer is far from over, it has only begun. I only scratched the surface but I made quite an improvement from the first time I tried to solve a challenge. Now I have a basic comprehension of what ML is, what it can solve and what it can not solve, and I also know what a model is! My recommendations from following this path so far are – continue learning, create some projects, participate in Kaggle competitions and follow your curiosity!