Are you sure you want to delete this access key?
The reference project has been developed in Python, but the same concepts should be applicable to other technology ML projects.
The core of this blog revolves around the Jenkinsfile. Stick till the end, to know the details of all moving parts. :smile:
It is a good practice to define jobs, to run inside a Docker Container.
This enables to have an easy, maintainable, reproducible and isolated job environment setup. Also debugging environment specific issues becomes easier as we can reproduce the jobs execution env anywhere.
To do so Jenkins enalbes us to define agent
s to be a docker container, which can be brought up from an image
or from a customised image, defined in a Dockerfile
.
More on the same can be checked at Using Docker with Pipeline section of their Pipeline documentation.
agent
to be a container brought up from this Dockerfine./project
path inside the container./extras
volume to cache any files, between multiple job runs.Agent Definition:
agent {
dockerfile {
args "-v ${env.WORKSPACE}:/project -w /project -v /extras:/extras -e PYTHONPATH=/project"
}
}
Agent Dockerfile:
Here we define the job container's base image; install the required software and library dependencies.
FROM python:3.8 # Base image for our job
RUN pip install --upgrade pip && \
pip install -U setuptools==49.6.0
RUN apt-get update && \
apt-get install unzip groff -y
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && \
unzip awscliv2.zip && \
./aws/install # Installing aws-cli to use S3 as remote storage
COPY requirements.txt ./
RUN pip install -r requirements.txt # Installing project dependenices
To help us define docker context when building the container image.
* # Ignores everything
!requirements.txt # except requirements.txt file
As we have defined our agent
, now we can define stages in our pipeline.
Here are few stages that we will be defining in our Jeninks Pipeline:
We have defined our test cases in test folder and using pytest to run them for us.
stage('Run Unit Test') {
steps {
sh 'pytest -vvrxXs'
}
}
For linting check as standard practice we use flake8 and black.
stage('Run Linting') {
steps {
sh '''
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
flake8 . --count --max-complexity=10 --max-line-length=127 --statistics
black . --check --diff
'''
}
}
Once you have setup credentials in Jenkins, we can use it in a stage as follows.
With dvc status -r origin
we test our connect with the remote. DVC remote informations are defined in the config, .dvc/config file.
stage('Setup DVC Creds') {
steps {
withCredentials(
[
usernamePassword(
credentialsId: 'PASSWORD',
passwordVariable: 'PASSWORD',
usernameVariable: 'USER_NAME'),
]
) {
sh '''
dvc remote modify origin --local auth basic
dvc remote modify origin --local user $USER_NAME
dvc remote modify origin --local password $PASSWORD
dvc status -r origin
'''
}
}
}
Before running any further DVC stanges, we would need to fetch the data and models already versioned by DVC. This can be done with dvc pull
command.
Everytime while we fetch files from S3
or similar remote storages, it increases our network load, build latency and also service usages cost.
To optimise this we can cache already fetched files, say from previous builds and only fetch the diff required for the consequent builds.
We will use the mounted volume /extras
for this and refer it by dvc remote jenkins_local
. More info .dvc/config file.
While origin
is our primary storage, we use jenkins_local
as a secondary local storage!
stage('Sync DVC Remotes') {
steps {
sh '''
dvc status
dvc status -r jenkins_local
dvc status -r origin
dvc pull -r jenkins_local || echo 'Some files are missing in local cache!' # 1
dvc pull -r origin # 2
dvc push -r jenkins_local # 3
'''
}
}
Explaination:
jenkins_local
.origin
.jenkins_local
.Once you have defined dvc pipeline
running your expeirment is stright forward with dvc repro
cmd.
But the question is When should you run your Experiments?
Should we run for:
master
branch?Let's analyze pros and cons for each of these options:
Option | Pros | Cons |
---|---|---|
For All Commits | - We will never miss any experiment | - Will increase build latency - May not be needed to be run for all commits/changes |
Only for changes in, master branch |
Only master branch experiments are saved | - Only master branch experiments are saved - "Bad" experiments or PR gets merged to master, before we could catch it. |
Setup a manual trigger | We can decide when we want to run/skip experiment. | - Automation is not complete. - There is room for manual errors. |
"Special" Commit message syntax | We can decide when we want to run/skip experiment. | - Automation is not complete. - There is room for manual errors. |
On Pull Request | We can run and compare experiment, before we approve the PR. | None |
stage('Update DVC Pipeline') {
when { changeRequest() } //# 1
steps {
sh '''
dvc repro --dry -mP
dvc repro -mP # 2
git branch -a
cat dvc.lock
dvc push -r jenkins_local # 3
dvc push -r origin # 3
rm -r /extras/RPPP/repo/$CHANGE_BRANCH || echo 'All clean'
mkdir -p /extras/RPPP/repo/$CHANGE_BRANCH
cp -Rf . /extras/RPPP/repo/$CHANGE_BRANCH
'''
sh 'dvc metrics diff --show-md --precision 2 $CHANGE_TARGET' //# 4
}
}
Explaination:
$CHANGE_BRANCH
refers to Pull request source and $CHANGE_TARGET
refers to Pull request target
when { changeRequest() }
Makes sures to run this stage
only when Pull Request is open.dvc repro -mP
runs the pipeline end-to-end and prints the metrics at the end.dvc push
saves the results (data & models) to remote storages.dvc metrics diff
compares the metrics in PR source vs PR target.Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?