Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Integration:  dvc git github
Jeff Nirschl 53b8dc48a1
Run dvc install to add hooks that automate dvc push when running git push. Add DVC to requirements.txt
3 years ago
ea4f046e65
Move function create_data_dictionary out of make_dataset.py to reduce code complexity. Create new script data_dictionary.py to manage data dictionary and data summary table. DVC stage 1 working but other stages currently broken.
3 years ago
fa936bf47b
Add script to optionally normalize_data.py. Add stage to DVC and run pipeline.
3 years ago
afecec1ddb
initial commit using cookiecutter data science
3 years ago
6cef1f0386
Re-configure Stage train_model to send outputs to results directory
3 years ago
1d17451bd4
Add function for parameter tuning using hyperopt.
3 years ago
afecec1ddb
initial commit using cookiecutter data science
3 years ago
eba57940da
Add stage 1 = make dataset
3 years ago
581186c153
Deleting previous DVC pipeline to create new pipeline
3 years ago
src
e701e2a1ed
Break function __init__.load_data into separate parts for reading CSV and reading params.yaml. Allow load_data and save_as_csv to accept str or list[str] and iterate over list to read or save multiple dataframes. Refactor existing scripts to use new functions. DVC pipeline running successfully through stage normalize_data
3 years ago
afecec1ddb
initial commit using cookiecutter data science
3 years ago
f9470baea1
Adding original data files
3 years ago
53b8dc48a1
Run dvc install to add hooks that automate dvc push when running git push. Add DVC to requirements.txt
3 years ago
afecec1ddb
initial commit using cookiecutter data science
3 years ago
afecec1ddb
initial commit using cookiecutter data science
3 years ago
ea4f046e65
Move function create_data_dictionary out of make_dataset.py to reduce code complexity. Create new script data_dictionary.py to manage data dictionary and data summary table. DVC stage 1 working but other stages currently broken.
3 years ago
e701e2a1ed
Break function __init__.load_data into separate parts for reading CSV and reading params.yaml. Allow load_data and save_as_csv to accept str or list[str] and iterate over list to read or save multiple dataframes. Refactor existing scripts to use new functions. DVC pipeline running successfully through stage normalize_data
3 years ago
fa936bf47b
Add script to optionally normalize_data.py. Add stage to DVC and run pipeline.
3 years ago
e701e2a1ed
Break function __init__.load_data into separate parts for reading CSV and reading params.yaml. Allow load_data and save_as_csv to accept str or list[str] and iterate over list to read or save multiple dataframes. Refactor existing scripts to use new functions. DVC pipeline running successfully through stage normalize_data
3 years ago
53b8dc48a1
Run dvc install to add hooks that automate dvc push when running git push. Add DVC to requirements.txt
3 years ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

You have to be logged in to leave a comment. Sign In

Titanic DVC

license

Project Goals

Predict survival on the Kaggle Titanic dataset using DVC for reproducible machine learning

Introduction

This repository uses Data Version Control (DVC) to create a machine learning pipeline and track experiments. We will use a modified version of the Team Data Science Process as our Data Science Life cycle template. This repository template is based on the cookiecutter data science project template.

In order to start, clone this repository and install DataVersionControl. Follow the instructions below to proceed through the data science life cycle using DVC to manage parameters, scripts, artifacts, and metrics.

1. Domain understanding/problem definition

Project Charter:

Problem definition: predict survival on the Kaggle Titanic dataset

Dataset location: Kaggle

Preferred tools and languages: SciKit-Learn, TensorFlow, HyperOpt; Python

Downloading the dataset

The script make_dataset.py will download the dataset from Kaggle, create a data dictionary, and summarize the dataset using TableOne. The key artifacts of this stage are the raw training and testing datasets, the data_dictionary, and the summary table.

In your terminal, use the command-line interface to build the first stage of the pipeline.

dvc run -n make_dataset -p dtypes \
-d src/data/make_dataset.py \
-o data/raw/train.csv \
-o data/raw/test.csv \
-o reports/figures/table_one.tex
-o reports/figures/data_dictionary.tex
--desc "Download data from Kaggle, create data dictionary and summary dtable"\
 python3 src/data/make_dataset.py -c titanic -tr train.csv -te test.csv -o "./data/raw"

Encoding categorical labels as integer classes

The script encode_labels.py is an intermediate data processing script that accepts the raw training data, and the "dtypes" parameter from the params.yaml file. It encodes the columns with categorical variables as integer values for machine processing and saves the updated dataset and encoding scheme. Importantly, the training and testing data is processed at the same time to ensure the identical label encoding. Key artifacts from this stage include the interim categorized datasets and the label encoding scheme.

dvc run -n encode_labels -p dtypes \
-d src/data/encode_labels.py \
-d data/raw/train.csv \
-d data/raw/test.csv \
-o data/interim/train_categorized.csv \
-o data/interim/test_categorized.csv \
-o data/interim/label_encoding.yaml \
--desc "Convert categorical labels to integer values and save mapping" \
python3 src/data/encode_labels.py -tr data/raw/train.csv -te data/raw/test.csv -o data/interim

Preparing data

This section involves two scripts to prepare the data for machine learning. First, missing values are imputed from the training data in replace_nan.py and second the features are normalized in normalize_data.py. Key artifacts from this stage include the interim nan-imputed datasets and the final processed dataset after feature normalization.

Replace missing age values using mean imputation
dvc run -n impute_nan -p imputation
-d src/data/replace_nan.py
-d data/interim/train_categorized.csv
-d data/interim/test_categorized.csv
-o data/interim/test_nan_imputed.csv
-o data/interim/train_nan_imputed.csv
--desc "Replace missing values for age with mean values from training dataset."
python3 src/data/replace_nan.py -tr data/interim/train_categorized.csv -te data/interim/test_categorized.csv -o data/interim
Normalize features
dvc run -n normalize_data -p normalize \
-d src/data/normalize_data.py \
-d data/interim/train_nan_imputed.csv \
-d data/interim/test_nan_imputed.csv \
-o data/processed/train_processed.csv \
-o data/processed/test_processed.csv \
--desc "Optionally normalize features by fitting transforms on the training dataset." \
python3 src/data/normalize_data.py -tr data/interim/train_nan_imputed.csv -te data/interim/test_nan_imputed.csv -o data/processed/

Project based on the cookiecutter data science project template. #cookiecutterdatascience

Tip!

Press p or to see the previous file or, n or to see the next file

About

Personal code for using DVC to predict survival on the Kaggle Titanic dataset

Collaborators 1

Comments

Loading...