Are you sure you want to delete this access key?
The reference project has been developed in Python, but the same concepts should be applicable to other technology ML projects.
The core of this blog revolves around the Jenkinsfile. Stick till the end, to know the details of all moving parts. :smile:
It is a good practice to define jobs, to run inside a Docker Container.
This enables to have an easy, maintainable, reproducible and isolated job environment setup. Also debugging environment specific issues becomes easier as we can reproduce the jobs execution env conditions anywhere.
To do so Jenkins enalbes us to define agent
's to be a docker container; which can be brought up from an image
or from a customised image, defined in a Dockerfile
.
More on the same can be checked at Using Docker with Pipeline section of their Pipeline documentation.
In next section, we will go through how we have defined our JenkinsAgent.
Here we define the agent
to be a container brought up from this Dockerfine.
Agent Definition:
agent {
dockerfile {
args "-v ${env.WORKSPACE}:/project -w /project -v /extras:/extras -e PYTHONPATH=/project"
}
}
Details:
/project
.-w /project
make sures that all our pipeline stage commands are executed inside our repo directory./extras
volume to cache any files, between multiple job runs. This will help in reducing build latency. For more info check Sync DVC remotes pipeline stage.Agent Dockerfile:
Here we define the job container's base image; install the required software and library dependencies.
FROM python:3.8 # Base image for our job
RUN pip install --upgrade pip && \
pip install -U setuptools==49.6.0
RUN apt-get update && \
apt-get install unzip groff -y
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && \
unzip awscliv2.zip && \
./aws/install # Installing aws-cli to use S3 as remote storage
COPY requirements.txt ./
RUN pip install -r requirements.txt # Installing project dependenices
To help us define docker context when building the container image.
* # Ignores everything
!requirements.txt # except requirements.txt file ;)
In next section we will be defining stages in our pipeline.
Here are few stages that we will be defining in our Jeninks Pipeline:
We have defined all our test cases in the test folder and using pytest to run them for us.
stage('Run Unit Test') {
steps {
sh 'pytest -vvrxXs'
}
}
For linting check, as standard practice we will use flake8 and black.
stage('Run Linting') {
steps {
sh '''
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
flake8 . --count --max-complexity=10 --max-line-length=127 --statistics
black . --check --diff
'''
}
}
Once you have setup credentials in Jenkins, we can use it in a stage as follows.
stage('Setup DVC Creds') {
steps {
withCredentials(
[
usernamePassword(
credentialsId: 'PASSWORD',
passwordVariable: 'PASSWORD',
usernameVariable: 'USER_NAME'),
]
) {
sh '''
dvc remote modify origin --local auth basic
dvc remote modify origin --local user $USER_NAME
dvc remote modify origin --local password $PASSWORD
dvc status -r origin
'''
}
}
}
With dvc status -r origin
we test our connect with the remote. DVC remote informations are defined in the config, .dvc/config file.
Before running any further DVC stanges, we would need to fetch already versioned data and models files from DVC. This can be done with dvc pull
command.
Everytime while we fetch DVC versioned files from S3
or similar remote storages, it increases our network load, build latency and also service usages cost.
To optimise this we can cache already fetched files, say from previous builds. Then in consequent builds we can only fetch the required diff.
We will use the mounted volume /extras
for this and refer it by dvc remote jenkins_local
. More info .dvc/config file.
While origin
is our primary storage, we use jenkins_local
as a secondary local storage! :exploading_head:
stage('Sync DVC Remotes') {
steps {
sh '''
dvc status
dvc status -r jenkins_local
dvc status -r origin
dvc pull -r jenkins_local || echo 'Some files are missing in local cache!' # 1
dvc pull -r origin # 2
dvc push -r jenkins_local # 3
'''
}
}
Explaination:
jenkins_local
.origin
.jenkins_local
.Once you have defined dvc pipeline
running your expeirment is stright forward with dvc repro
cmd. Every run of your dvc pipeline can potentially create new versions of data, models and metrics.
Hence the question is When should you run your Experiments?
Should we run for:
master
branch?Let's analyze pros and cons for each of these options:
Options | Pros | Cons |
---|---|---|
For All Commits | We will never miss any experiment | This will increase build latency. May be overkill to run for all commits/changes |
Only for changes in, master branch |
Only master branch experiments are saved | Only master branch experiments are saved. "Bad" experiments can slip through the PR review process and gets merged to master, before we could catch it. |
Setup a manual trigger | We can decide when we want to run/skip experiment. | Automation is not complete. There is room for manual errors. |
"Special" Commit message syntax | We can decide when we want to run/skip experiment. | Automation is not complete. There is room for manual errors. |
On Pull Request | We can run and compare experiment, before we approve the PR. No "Bad" experiments can now slip through PR review process. | None |
stage('Update DVC Pipeline') {
when { changeRequest() } //# 1
steps {
sh '''
dvc repro --dry -mP
dvc repro -mP # 2
git branch -a
cat dvc.lock
dvc push -r jenkins_local # 3
dvc push -r origin # 3
rm -r /extras/RPPP/repo/$CHANGE_BRANCH || echo 'All clean'
mkdir -p /extras/RPPP/repo/$CHANGE_BRANCH
cp -Rf . /extras/RPPP/repo/$CHANGE_BRANCH
'''
sh 'dvc metrics diff --show-md --precision 2 $CHANGE_TARGET' //# 4
}
}
Explaination:
Here $CHANGE_BRANCH
refers to Pull request source branch and $CHANGE_TARGET
refers to Pull request target branch.
when { changeRequest() }
Makes sures to run this stage
only when Pull Request is open/modified/updated.dvc repro -mP
runs the pipeline end-to-end and also prints the final metrices.dvc push
saves the results (data & models) to remote storages.dvc metrics diff
compares the metrics in PR source vs PR target.Once the DVC pipeline is run, it will version the experiment results and modifies corresponding metadata in the dvc.lock
file.
When we commit this dvc.lock
file into git, we can say the experiment is saved successfully.
This is important because, for a given git commit, looking at dvc.lock
file, DVC will understand which versions of files to be loaded from the cache. We can checkout that perticular version by dvc checkout
cmd.
Now in this stage Commit back Results
; all we have to do is, check if dvc.lock
file got modified?. If yes, then commit and push it to our Git feature/experiment
branch.
stage('Commit back results') {
when { changeRequest() }
steps {
dir("/extras/RPPP/repo/${env.CHANGE_BRANCH}") {
sh '''
git branch -a
git status
if ! git diff --exit-code; then # 1
git add .
git status
git commit -m '$GIT_COMMIT_REV: Update dvc.lock and metrics' # 2
git push origin HEAD:$CHANGE_BRANCH # 3
else
echo 'Nothing to Commit!' # 4
fi
'''
}
}
}
Explaination:
git diff --exit-code
to check if there are un-committed changes.git commit -m '$GIT_COMMIT_REV: ...'
to commit with reference to parent commit $GIT_COMMIT_REV
. This helps us also understand for which user commit the experiment was run by our Jenkins Pipeline.git push origin HEAD:$CHANGE_BRANCH
to push to our experiment/feature branch saved in environment variable $CHANGE_BRANCH
.There are various reasons why you would want to do remote training of your models. Some of them; for example are:
This automation is achievable, with following two stages of the pipeline:
All you need to do is define your new experiment in a branch; either by changing code
, (i.e data-processing
, model-algorithm
, etc), data
, params
, or some dependency
.
As long as it can trigger dvc pipeline
execution, Jenkins and DVC will execute the experiment for you.
Once you make a Pull Request from your experiment branch to a target branch Jenkins will run the above two stages.
Jenkins will commit the results (metrics to Git and data/models to DVC) back. You can check them as follows:
git pull origin {feature/experiment branch} --rebase # 1: Fetches jenkins commit, i.e metadata (metrics and dvc.lock file).
dvc pull origin # 2: Now you fetch the data/models from DVC storage.
Now you have latest metrics, data and models; which Jenkins produced for you.
Ignoring Experiment:
Sometimes you may want to ignore the current execution of experiment.
All you need to do in such case is to ignore the commit from Jenkins and force-push different change.
git push origin {feature/experiment branch} --force
Data Science projects are more dynamic in nature from Software Delivery projects. This is becasue in DS project along with factors that influence Software Delivery projects; some pivotal decisions; and set of next steps can be influenced by many underlying factors. To name a few:
Hence while in Pull Request review, it is not sufficient to "say" that, the build is green, all tests are passing, etc.
We should dig deep and understand the changes in data/model better. Most importantly we should validate, does our fundamental assumption/hypothesis on data still holds true or not.
We have already seen some of the factors, which influence DS projects. We should keep them in mind when doing PR review.
Basically for every Pull request first we should verify if the build is green. i.e (all tests are passing, linting standards are followed, etc). But as we know, that code review + build status is not sufficient. We should also be able to:
dvc metrics diff {source} {target}
. This will show us the metrics of latest experiment from these branches and also the difference in value between them. I call it a first step towards transparency.We will be achieving the same with Update DVC Pipeline stage defined in our pipeline.
To help us with better PR review for DS projects, and also apply learnings from standard practices in software delivery; DagsHub's has developed Data Science Pull Request and DVC has developed CML. They have addressed several neuanses that differentiates DS projects from Software projects. Do check them out to get more ideas on how to achieve more transparent and informed PR reveiw.
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?