Are you sure you want to delete this access key?
title | description |
---|---|
Version Control and ML Experimentation Tutorial with DagsHub – Data Exploration | Delve into version control and track machine learning experiments with DagsHub. This tutorial guides you through data exploration, experiment tracking with DagsHub and DVC, and creating a model to classify Stack Exchange questions on machine learning, offering practical workflow enhancements at each step. |
This level of the tutorial covers downloading the data and performing some basic analysis of it to see what we have.
The full analysis can be found in this Colab notebook, but we'll go over the main points and conclusions together.
If you want to just skip ahead to the code, you can go straight to the next level.
Our data is a CSV file describing questions on the Cross Validated Stack Exchange, a Q&A site for statistics.
It was generated from the Stack Exchange API with this query. To make things easier for you, we already ran the query and saved its result in our public storage, so you can download it straight from here.
The data itself looks like this (click to get a full-size view):
The columns are pretty self-explanatory - we have:
Two textual features (Title
& Body
).
We can already tell that this text is full of HTML tags, which we will probably need to clean to get good results.
One string column that is the list of Tags
for this question.
Some numeric features: Score, ViewCount, AnswerCount, CommentCount, FavoriteCount
.
One CreationDate
feature that needs to be processed correctly.
Each question on Cross Validated can be labeled with a set of topic tags, to make it easier for experts to find & answer.
For this tutorial, our goal will be to predict whether a given Cross Validated question should be tagged as a machine-learning
related question.
This is a supervised binary classification task, and the ground truth can be found in the Tags
column:
df['MachineLearning'] = df['Tags'].str.contains('machine-learning')
One important thing to note is that only about 11.1% of the data is labeled positive. This means that we're dealing with an imbalanced classification problem, and we will need to take this into account when choosing our performance metrics, and possibly use special sampling strategies or model configurations.
MachineLearning
is not too strongly related to any other single feature.FavoriteCount
column since it's highly correlated with Score
and contains mostly NaN
.Score, ViewCount, AnswerCount
are highly skewed, so we'll take that into account in data preparation.After massaging the numerical features so that they're scaled and less skewed, here are their distributions (click to get a full-size view):
These scaled numerical features were good enough to train a simple logistic regression classifier, that performs only slightly better than random. This can be seen in the model's precision-recall curve:
It makes sense that most of the information on a question's topic will be contained in its text content.
To turn the two textual features of the data into something we can train an ML model on, we first concatenate them:
df['Text'] = df['Title'] + ' ' + df['Body']
And then train a TfidfVectorizer
using this text column. For now, we don't do any fancy text processing - we just use the default logic contained in TfidfVectorizer
.
This is already enough to get a very decently performing model:
Looking at the terms learned by the trained TfidfVectorizer
, we can note some possible directions for improvement:
00
, 00000000e
etc. It could be useful to prevent this splitting of numbers into many different terms in the vocabulary since it probably won't matter to classifying the text.variable_2
. This is probably an artifact of embedded Python or TeX code. It might help the model if we break these down into separate terms.TfidfVectorizer
- vocabulary size, ngram range, etc.We got a good sense of our data, the type of preprocessing required, and managed to train some decent classifiers with it.
At this point in a Python data science project, it's common to take the conclusions and working code from the exploratory notebook, and turn them into normal Python modules. This enables us to more easily:
In the next level of this tutorial, we'll take what works from this notebook and turn it into a Python project, before going forward with data versioning and experimentation to find the best performing model for our problem.
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?