Are you sure you want to delete this access key?
Legend |
---|
DVC Managed File |
Git Managed File |
Metric |
Stage File |
External File |
Legend |
---|
DVC Managed File |
Git Managed File |
Metric |
Stage File |
External File |
This repository was created to prototype the DVC-based ML pipelines for the crosspecies project All dependencies are written in conda environment.yaml file, DVC and jupyter lab are also installed there.
In the data folder one keeps input, interim and output data.
Before you start running anything do not forget to dvc pull the data and after commiting do not forget to dvc push it!
The pipeline is run by running dvc stages (see stages folder)
Most of the analysis is written in jupyter notebooks in the notebooks folder. Each stage runs (and source controls input-outputs) corresponding notebooks using papermill software (which also stores output of the notebooks to data/notebooks)
Temporaly some classes are copy-pasted from xspecies repository to make notebooks works
To create environment you can do:
conda env create --file environment.yaml
The code in yspecies folder is a conda package that is used inside notebooks. The package is included in environment.yaml but you can also install it separately from conda https://anaconda.org/antonkulaga/yspecies
conda install -c antonkulaga yspecies
DVC stages are in dvc.yaml file, to run dvc stage just use dvc repro <stage_name>:
dvc repro
Most of the stages also produce notebooks together with files in the output
There are several key notebooks in the projects. All notebooks can be run either from jupyter (by jupyter lab notebooks) or command-line by dvc repro.
You can run notebooks manually by:
jupyter lab notebooks
And then running the notebook of our choice. However, keep in mind that notebooks depend on each other. In particular, select_samples notebook generates the data for all others.
Most of the code is packed into classes. The workflow is build on top of scikitlean Pipelines.
Yspecies package has the following modules:
One of the key classes is ExpressionDataset class:
e = ExpressionDataset("5_tissues", expressions, genes, samples)
e
It allows indexing by genes:
e[["ENSG00000073921", "ENSG00000139687"]]
#or
e.by_genes[["ENSG00000073921", "ENSG00000139687"]]
By samples:
e.by_samples[["SRR2308103","SRR1981979"]]
Both:
e[["ENSG00000073921", "ENSG00000139687"],["SRR2308103","SRR1981979"]]
ExpressionDataset class has by_genes and by_samples properties which allow indexing and filtering. For instance filtering only blood tissue:
e.by_samples.filter(lambda s: s["tissue"]=="Blood")
The class is also Jupyter-friendly with repr_html() method implemented
Key logic from the start until partitioning of the data according to sorted stratification
Classes with data:
Transformers:
This module is responsible for ShapBased selection
Classes with data:
Auxilary classes:
Transformers:
Module that contains final results
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?