Cross-species repository

antonkulaga ac7e7e61bd switched from pro-anti to single table 3 weeks ago
.dvc 53cd73ef77 folder refactoring 1 month ago
conda-recipe 7b4b897698 updated second stage 4 weeks ago
data ac7e7e61bd switched from pro-anti to single table 3 weeks ago
notebooks ac7e7e61bd switched from pro-anti to single table 3 weeks ago
parameters ac7e7e61bd switched from pro-anti to single table 3 weeks ago
yspecies 7b4b897698 updated second stage 4 weeks ago
.gitignore 7b4b897698 updated second stage 4 weeks ago
README.md f94e48be78 updated first stage notebook 1 month ago
dvc.lock ac7e7e61bd switched from pro-anti to single table 3 weeks ago
dvc.yaml ac7e7e61bd switched from pro-anti to single table 3 weeks ago
environment.yaml 58ec904b30 updated counts 3 weeks ago
setup.py 7b4b897698 updated second stage 4 weeks ago

Data Pipeline

Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

YSpecies

This repository was created to prototype the DVC-based ML pipelines for the crosspecies project All dependencies are written in conda environment.yaml file, DVC and jupyter lab are also installed there.

Project structure

In the data folder one keeps input, interim and output data.

Before you start running anything do not forget to dvc pull the data and after commiting do not forget to dvc push it!

The pipeline is run by running dvc stages (see stages folder)

Most of the analysis is written in jupyter notebooks in the notebooks folder. Each stage runs (and source controls input-outputs) corresponding notebooks using papermill software (which also stores output of the notebooks to data/notebooks)

Temporaly some classes are copy-pasted from xspecies repository to make notebooks works

yspecies package

The code in yspecies folder is a conda package that is used inside notebooks

Running stages

DVC stages are inside stages folder (together with yaml files in parameters). To run dvc stage just use dvc repro command, like:

dvc repro -f stages/1_select_genes_and_species.dvc

Most of the stages also produce notebooks together with files in the output