Browse Source

Initial commit

Bar 11 months ago
commit
e8dd8f5917
12 changed files with 1243 additions and 0 deletions
  1. 8 0
      .dvc/.gitignore
  2. 0 0
      .dvc/config
  3. 3 0
      .gitignore
  4. 1123 0
      Example.ipynb
  5. 51 0
      README.md
  6. 2 0
      data/.gitignore
  7. 8 0
      data/raw.dvc
  8. 14 0
      eval.dvc
  9. 7 0
      metrics/metrics.json
  10. 12 0
      models.dvc
  11. 12 0
      process_data.dvc
  12. 3 0
      requirements.txt

+ 8 - 0
.dvc/.gitignore

@@ -0,0 +1,8 @@
+/state
+/lock
+/config.local
+/updater
+/updater.lock
+/state-journal
+/state-wal
+/cache

+ 0 - 0
.dvc/config


+ 3 - 0
.gitignore

@@ -0,0 +1,3 @@
+/env/
+.ipynb_checkpoints/
+/models/

File diff suppressed because it is too large
+ 1123 - 0
Example.ipynb


+ 51 - 0
README.md

@@ -0,0 +1,51 @@
+# aaa
+
+
+
+## Instructions
+
+1. Clone the repo.
+2. (Recommended) Create and activate a [virtualenv](https://virtualenv.pypa.io/) under the `env/` directory. Git is already configured to ignore it.
+3. Install the very minimal requirements using `pip install -r requirements.txt`
+4. Run [Jupyter](https://jupyter.org/) in whatever way works for you. The simplest would be to run `pip install jupyter && jupyter notebook`.
+5. All relevant code and instructions are in [`Example.ipynb`](/Example.ipynb).
+
+## Explanation
+
+This project structure is as an example of how to work with DVC from inside a Jupyter Notebook.
+
+This workflow should enable you to enjoy the full benefits of working with Jupyter Notebooks, while getting most of the benefit out of DVC - 
+namely, **reproducible and versioned data science**.
+
+The project takes a toy problem as an example - the [California housing dataset](https://scikit-learn.org/stable/datasets/index.html#california-housing-dataset), which comes packaged with scikit-learn.
+You can just replace the relevant parts in the notebook with your own data and code.
+Significantly different project structures might require deeper intervention.  
+
+The idea is to leverage DVC in order to create immutable snapshots of your data and models as part of your git commits.
+To enable this, we created the following DVC stages:
+1. **Raw data** - kept in `data/raw/`, versioned in `data/raw.dvc` 
+2. **Processed data** - kept in `data/processed/`, versioned in `process_data.dvc` 
+3. **Trained models** - kept in `models/`, versioned in `models.dvc` 
+4. **Metrics** - kept in `metrics/metrics.json`, versioned as part of the git commit and referenced in `models.dvc`
+
+Unlike a typical DVC project, which requires you to refactor your code into modules which are runnable from the command line,
+In this project the aim is to enable you to stay in your comfortable notebook home territory.
+
+So, instead of using `dvc repro` or `dvc run` commands, **just run your code as you normally would in [`Example.ipynb`](/Example.ipynb)**. 
+We prepared special cells (marked with green headers) inside this notebook that let you run `dvc commit` commands on the relevant
+DVC stages defined above, immediately after you create the relevant data files from your notebook code.
+
+[`dvc commit`](https://dvc.org/doc/commands-reference/commit) computes the hash of the versioned data and saves that hash
+as text inside the relevant `.dvc` file. The data itself is ignored and not versioned by git, instead being versioned with DVC.
+However, the `.dvc` files, being plain text files, ARE checked into git.
+
+So, to summarize, this workflow should enable you to create a git commit which contains all relevant code, together with
+*references* to the relevant data and the resulting models and metrics. **Painless reproducible data science!**
+
+It's intended as a guideline - definitely feel free to play around with its structure to suit your own needs.
+
+---
+
+To create a project like this, just go to https://dagshub.com/repo/create and select the **Jupyter Notebook + DVC** project template.
+
+Made with 🐶 by [DAGsHub](https://dagshub.com/).

+ 2 - 0
data/.gitignore

@@ -0,0 +1,2 @@
+/raw
+/processed

+ 8 - 0
data/raw.dvc

@@ -0,0 +1,8 @@
+md5: fd982196f18680fbbe87a00b96814861
+outs:
+- cache: true
+  md5: 04d53798ff7017d8749ea065911b5c51.dir
+  metric: false
+  path: data/raw
+  persist: false
+wdir: ..

+ 14 - 0
eval.dvc

@@ -0,0 +1,14 @@
+cmd: echo "This is the model evalualtion stage" >&2 && exit 1
+deps:
+- md5: 62e60cfacc638ead57637dc9e34c7268.dir
+  path: models
+md5: b4a832eaf26dd32827d21ca1ba58b348
+outs:
+- cache: false
+  md5: 323a6f745a4a0c26a9353616785481e8
+  metric:
+    type: json
+    xpath: loss
+  path: metrics/metrics.json
+  persist: false
+wdir: .

+ 7 - 0
metrics/metrics.json

@@ -0,0 +1,7 @@
+{
+  "R2": 0.6098076048023119,
+  "MAE": 0.5284530092971268,
+  "MSE": 0.5151881391739428,
+  "median_absolute_error": 0.41252106231809627,
+  "loss": 0.5151881391739428
+}

+ 12 - 0
models.dvc

@@ -0,0 +1,12 @@
+cmd: echo "This is the model training stage" >&2 && exit 1
+deps:
+- md5: 01fe90153048b85be7638d3780aac84f.dir
+  path: data/processed
+md5: f9c10b09cdd63f3c23de4b015d521b15
+outs:
+- cache: true
+  md5: 62e60cfacc638ead57637dc9e34c7268.dir
+  metric: false
+  path: models
+  persist: false
+wdir: .

+ 12 - 0
process_data.dvc

@@ -0,0 +1,12 @@
+cmd: echo "This is the data processing stage" >&2 && exit 1
+deps:
+- md5: 04d53798ff7017d8749ea065911b5c51.dir
+  path: data/raw
+md5: 59cf0d9bfee6635d5513844b709c897f
+outs:
+- cache: true
+  md5: 01fe90153048b85be7638d3780aac84f.dir
+  metric: false
+  path: data/processed
+  persist: false
+wdir: .

+ 3 - 0
requirements.txt

@@ -0,0 +1,3 @@
+sklearn
+pandas
+dvc