Are you sure you want to delete this access key?
This repo shows examples of how to stream a subset of a large dataset using Direct Data Access (DDA).
Each training script (one for PyTorch and one for TensorFlow) expects a DagsHub Access Token to be in an environment variable named DAGSHUB_TOKEN
. This should be set prior to training:
export DAGSHUB_TOKEN="..."
The scripts then read this token in and use it to authenticate:
import os
DAGSHUB_TOKEN = os.environ.get('DAGSHUB_TOKEN', None)
import dagshub
dagshub.auth.add_app_token(DAGSHUB_TOKEN)
Once that's done, DDA can be set up with two lines of code:
from dagshub.streaming import install_hooks
install_hooks(project_root='.', repo_url='https://dagshub.com/DagsHub-Datasets/LAION-Aesthetics-V2-6.5plus', branch='main')
The PyTorch version of the code uses a custom Dataset
to stream the images and aesthetics scores from the LAION Aesthetics dataset.
The LAIONAestheticsDataset
streams the annotations file, which includes the images names, captions, and aesthetics scores. It also relys on the EfficientNetFeatureExtractor
to stream the images and extract the features.
Streaming happens automatically as the code uses the standard open()
function to read the annotations file and the PIL.Image.open()
method to read the images. The installed hooks take care of all this transparently.
This code can be found in pytorch/data.py.
The TensorFlow data generator, LAIONAestheticsDataGenerator
, also relies on an EfficientNetFeatureExtractor
to stream the images via PIL.Image.open()
calls.
The annotations file is, however, streamed in the train_valid_split
function, which determines ahead of time which samples belong to the training and validation sets.
This code can be found in tensorflow/data.py.
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?