Skip to content

Level 1 - Setup

Level overview

This level of the tutorial covers setting up our project.

This includes the following tasks:

  • Creating an account and repo in DAGsHub.
  • Cloning it to your local machine.
  • Creating a virtual python environment using venv and installing the needed requirements.

If you are familiar with these steps, you can skip to the next level, where we train some models and set up data and model versioning.

Joining DAGsHub...

...is really easy. Just sign up. Then, after logging in, create a new repo, simply by clicking on the plus sign and create a repository in the navbar.

Screenshot Repo creation button

This opens up a dialog, which should be somewhat familiar, in which you can set the repository name, description, and a few other options.

Screenshot
Repo creation dialog

For this tutorial, fill in the name and description, and leave everything else in the default settings.

Done with repo creation. On to project initialization.

Setting up our project

Create a directory named dagshub_tutorial for the project somewhere on your computer. Open a terminal and input the following:

cd path/to/folder/dagshub_tutorial
git init

Now, we will set the remote to our repo on DAGsHub. This can be done using the following command:

git remote add origin https://dagshub.com/<username>/<repo-name>.git

Finally, let's create 3 folders, for each of our main project components.

mkdir -p data outputs
This will add a clear structure to the project.

Creating a virtual python environment

We assume you have a working Python 3 installation on your local system for the following explanations.

Warning

To ensure that you do, you can open a terminal and type in python3 -V. See that this command succeeds and that you get at least version 3.7 - if it's smaller or if the command fails, you should download the correct version for your operating system.

To create and activate our virtual python environment using venv, type the following commands into your terminal (still in the project folder):

python3 -m venv .venv
echo .venv/ >> .gitignore
echo __pycache__/ >> .gitignore

source .venv/bin/activate
python3 -m venv .venv
echo .venv/ >> .gitignore
echo __pycache__/ >> .gitignore

.venv\Scripts\activate.bat

The first command creates your virtual environment - a directory named .venv, located inside your project directory, where all the Python packages used by your project will be installed without affecting the rest of your computer.

The second and third commands make sure the virtual environment packages and pycache are not tracked by Git.

The fourth command activates our virtual python environment, which ensures that any python packages we use don't contaminate our global python installation.

The rest of this tutorial should be executed in the same shell session.
If you exit the shell session or want to create another, make sure to activate the virtual environment in that shell session first.

Installing requirements

To install the requirements for the first part of this project, simply download this requirements.txt into your project folder.

These are the direct dependencies:

dagshub==0.1.4
dvc==1.10.1
joblib==0.17.0
pandas==1.1.4
scikit-learn==0.23.2

Below is a list of full dependencies, including transitive ones:

Full list of dependencies

Alternatively, you can create a file called requirements.txt and copy the following into it:

appdirs==1.4.4
atpublic==2.1.1
certifi==2020.11.8
chardet==3.0.4
colorama==0.4.4
commonmark==0.9.1
configobj==5.0.6
dagshub==0.1.4
decorator==4.4.2
dictdiffer==0.8.1
distro==1.5.0
dpath==2.0.1
dulwich==0.20.13
dvc==1.10.1
flatten-dict==0.3.0
flufl.lock==3.2
ftfy==5.8
funcy==1.15
future==0.18.2
gitdb==4.0.5
GitPython==3.1.11
grandalf==0.6
idna==2.10
joblib==0.17.0
jsonpath-ng==1.5.2
mailchecker==3.3.17
nanotime==0.5.2
networkx==2.4
numpy==1.19.4
packaging==20.4
pandas==1.1.4
pathlib2==2.3.5
pathspec==0.8.1
phonenumbers==8.12.13
ply==3.11
pyasn1==0.4.8
pydot==1.4.1
Pygments==2.7.2
pygtrie==2.3.2
pyparsing==2.4.7
python-benedict==0.22.0
python-dateutil==2.8.1
python-slugify==4.0.1
pytz==2020.4
PyYAML==5.3.1
requests==2.25.0
rich==9.2.0
ruamel.yaml==0.16.12
ruamel.yaml.clib==0.2.2
scikit-learn==0.23.2
scipy==1.5.4
shortuuid==1.0.1
shtab==1.3.2
six==1.15.0
smmap==3.0.4
tabulate==0.8.7
text-unidecode==1.3
threadpoolctl==2.1.0
toml==0.10.2
tqdm==4.53.0
typing-extensions==3.7.4.3
urllib3==1.26.2
voluptuous==0.12.0
wcwidth==0.2.5
xmltodict==0.12.0
zc.lockfile==2.0

Now, to install them type:

pip install -r requirements.txt

Downloading the data

We'll keep our data in a folder named, oddly enough, data.

It's also important to remember to add this folder to .gitignore! We definitely don't want to accidentally commit large data files to Git.

The following commands should take care of everything:

mkdir -p data
echo /data/ >> .gitignore
wget https://dagshub-public.s3.us-east-2.amazonaws.com/tutorials/stackexchange/CrossValidated-Questions-Nov-2020.csv -O data/CrossValidated-Questions.csv

Committing progress to Git

Let's check the Git status of our project:

$ git status -s
?? .gitignore
?? requirements.txt

Now let's commit this to Git and push to DAGsHub using the command line:

git add .
git commit -m "Initialized project"
git push -u origin master

You can now see the setup files on your DAGsHub repo. So far so good.

Next Steps

In the next level, we'll prepare our data processing code and use DVC to keep track of our data and model versions.