Are you sure you want to delete this access key?
Legend |
---|
DVC Managed File |
Git Managed File |
Metric |
Stage File |
External File |
Legend |
---|
DVC Managed File |
Git Managed File |
Metric |
Stage File |
External File |
Predict survival on the Kaggle Titanic dataset using DVC for reproducible machine learning
This repository uses Data Version Control (DVC) to create a machine learning pipeline and track experiments. We will use a modified version of the Team Data Science Process as our Data Science Life cycle template. This repository template is based on the cookiecutter data science project template.
In order to start, clone this repository and install DataVersionControl. Follow the instructions below to proceed through the data science life cycle using DVC to manage parameters, scripts, artifacts, and metrics.
Project Charter:
Problem definition: predict survival on the Kaggle Titanic dataset
Dataset location: Kaggle
Preferred tools and languages: SciKit-Learn, TensorFlow, HyperOpt; Python
The script make_dataset.py will download the dataset from Kaggle, create a data dictionary, and summarize the dataset using TableOne. The key artifacts of this stage are the raw training and testing datasets, the data_dictionary, and the summary table.
In your terminal, use the command-line interface to build the first stage of the pipeline.
dvc run -n make_dataset -p dtypes \
-d src/data/make_dataset.py \
-o data/raw/train.csv \
-o data/raw/test.csv \
-o reports/figures/table_one.tex
-o reports/figures/data_dictionary.tex
--desc "Download data from Kaggle, create data dictionary and summary dtable"\
python3 src/data/make_dataset.py -c titanic -tr train.csv -te test.csv -o "./data/raw"
The script encode_labels.py is an intermediate data processing script that accepts the raw training data, and the "dtypes" parameter from the params.yaml file. It encodes the columns with categorical variables as integer values for machine processing and saves the updated dataset and encoding scheme. Importantly, the training and testing data is processed at the same time to ensure the identical label encoding. Key artifacts from this stage include the interim categorized datasets and the label encoding scheme.
dvc run -n encode_labels -p dtypes \
-d src/data/encode_labels.py \
-d data/raw/train.csv \
-d data/raw/test.csv \
-o data/interim/train_categorized.csv \
-o data/interim/test_categorized.csv \
-o data/interim/label_encoding.yaml \
--desc "Convert categorical labels to integer values and save mapping" \
python3 src/data/encode_labels.py -tr data/raw/train.csv -te data/raw/test.csv -o data/interim
This section involves two scripts to prepare the data for machine learning. First, missing values are imputed from the training data in replace_nan.py and second the features are normalized in normalize_data.py. Key artifacts from this stage include the interim nan-imputed datasets and the final processed dataset after feature normalization.
dvc run -n impute_nan -p imputation
-d src/data/replace_nan.py
-d data/interim/train_categorized.csv
-d data/interim/test_categorized.csv
-o data/interim/test_nan_imputed.csv
-o data/interim/train_nan_imputed.csv
--desc "Replace missing values for age with mean values from training dataset."
python3 src/data/replace_nan.py -tr data/interim/train_categorized.csv -te data/interim/test_categorized.csv -o data/interim
dvc run -n normalize_data -p normalize \
-d src/data/normalize_data.py \
-d data/interim/train_nan_imputed.csv \
-d data/interim/test_nan_imputed.csv \
-o data/processed/train_processed.csv \
-o data/processed/test_processed.csv \
--desc "Optionally normalize features by fitting transforms on the training dataset." \
python3 src/data/normalize_data.py -tr data/interim/train_nan_imputed.csv -te data/interim/test_nan_imputed.csv -o data/processed/
Project based on the cookiecutter data science project template. #cookiecutterdatascience
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?