No Description

Martin Fabbri 6087e8e445 Merge branch 'main' of https://github.com/martin-fabbri/tpu-workflow into main 3 days ago
.dvc 5314e41f62 split stage v1 1 week ago
.vscode 0f41110ea9 pipenv settings 5 days ago
checkpoints 0ebfac90c6 Adding checkpoints. 3 days ago
data ae129cd512 train stage 4 days ago
docs 0f41110ea9 pipenv settings 5 days ago
metrics 0ebfac90c6 Adding checkpoints. 3 days ago
models 6272b9ff77 train stage 4 days ago
notebooks 4e946138d7 Local TPU checkpoints. 3 days ago
plots 7817e34171 save train plot metrics series 4 days ago
src 0ebfac90c6 Adding checkpoints. 3 days ago
.dvcignore f6ac65d25a dvc init 1 week ago
.gitignore 90a85b9287 reset dvc 1 week ago
Makefile db139bcfb7 forcing python 3.8 1 week ago
Pipfile 704c4c1790 updating dvc dependency. 5 days ago
Pipfile.lock 704c4c1790 updating dvc dependency. 5 days ago
README.md 419d4de0e2 train stage 6 days ago
dvc.lock 0ebfac90c6 Adding checkpoints. 3 days ago
dvc.yaml 0ebfac90c6 Adding checkpoints. 3 days ago
params.yaml 0ebfac90c6 Adding checkpoints. 3 days ago
plots.html ad9e5c0fa5 generating training plots 4 days ago

Data Pipeline

Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

tpu-worflows

DVC

Remote origin setup

dvc remote add origin https://dagshub.com/martin-fabbri/tpu-workflow.dvc
dvc remote modify origin --local auth basic
dvc remote modify origin --local user "$DAGSHUB_USER"
dvc remote modify origin --local password "$SUPER_SECRET_PASSWORD"

Define pipeline stages

dvc run -n split \
-d src/split.py \
-o data/interim/train_split.json \
-o data/interim/val_split.json \
python3 src/split.py --gcs-path gs://kds-357fde648f21ba86b09520d51e296ad06846fd421d364336db3d426d --batch-size 16 

GCS

dvc run -n download_file \
-d gs://kaggle-data-tpu/test/test.txt \
-o test.txt \
gsutil cp gs://kaggle-data-tpu/test/test.txt test.txt

Remote aliases

['remote "gcs_test"']
    url = gs://kaggle-data-tpu/test

| gs://kaggle-data-tpu/test/test_2.txt

dvc run -n download_file_2 \
          -d remote://gcs_test/test_2.txt \
          -o test_2.txt \
          gsutil cp gs://kaggle-data-tpu/test/test_2.txt test_2.txt
download_file_2:
    cmd: gsutil cp gs://kaggle-data-tpu/test/test_2.txt test_2.txt
    deps:
    - remote://gcs_test/test_2.txt
    outs:
    - test_2.txt

Test python script

import logging
import os
from google.cloud import storage

client = storage.Client()

bucket = client.get_bucket("kaggle-data-tpu")
blob = bucket.get_blob("test/test_2.txt")
print(blob.download_as_string())

Test task stage

dvc run -n test_read_blob_gcs \
          -d remote://gcs_test/test_2.txt \
          python src/test.py

List objects in a bucket

from google.cloud import storage

storage_client = storage.Client()
blobs = storage_client.list_blobs("gs://kaggle-data-tpu/test/")
for blob in blobs:
    print(blob.name)
dvc run -n test_list_objects_gcs \
          -d remote://gcs_test/ \
          python src/test_list_blobs.py
dvc run -n train \
    -d src/pipeline/train.py \
    python src/pipeline/train.py --lr 1