Are you sure you want to delete this access key?
title | description |
---|---|
DagsHub Data Engine - Training a Model | Documentation on using Data Engine to create Dataloaders for model training |
Now that you created new datasets, the next step will be to train and improve your model with it. Data Engine supports the PyTorch and TensorFlow frameworks out of the box to easily create data loaders and datasets for your model training.
Data Engine’s DataLoaders extend the following classes:
torch.utils.data.DataLoader
for PyTorchtf.keras.utils.Sequence
for TensorflowThis means that you get a standard dataloader for each framework, which serves your data and labels for training.
The easiest way to get a dataloader from your enriched dataset is to use the following function:
from dagshub.data_engine import datasets
ds = datasets.get_dataset('<your_dataset_name>')
query_res = ds.all()
dl = query_res.as_ml_dataloader(flavor='torch|tensorflow')
In this case, the flavor will be either torch
or tensorflow
respectively. Behind the scenes, Data Engine
automatically guesses the data type from the path (there’s built-in support for images, audio, video, and numeric
formats), and converts each file to a tensor you can use for training.
By default, files are downloaded from your project only as requested, so you don’t download anything you don’t use. This allows you to start training immediately without needing to first download the entire dataset.
However, all these things are customizable. Below is a full explanation of the dataloader functionality.
Given the above query_res
object, you can convert it to a dataloader with the following options:
dl = query_res.as_ml_dataloader(
flavor='torch|tensorflow',
metadata_columns=["ColumnName1", "ColumnName2"], # Default: None
strategy='preload|background|lazy', # Default: lazy
savedir='/path/to/saved/dataset', # Default: None
processes=8, # Default: 8
tensorizers='auto' | ['image|audio|video', < function >], # Default: auto
** kwargs # Pass additional arguments for the dataloader object e.g. batch size
)
flavor
: supports torch
or tensorflow
. Choose the one you need depending on your training framework.metadata_columns
: To support extensible multimodal data use cases with varying model I/O as well as getting labels
into your dataloader, we added the metadata_columns
argument. Select which columns you would like to extract from
the metadata by specifying a list of metadata column names to metadata_columns
as strings. The dataloaders will
return a list of all the tensorized columns.strategy
**: **dataloaders stream data from the cloud remote in order to facilitate training. There are multiple ways
of doing this, and the best strategy usually depends on the compute at hand as well as the scenario within which you
are using the dataloaders. You can choose from the following inputs:
'lazy'
: Downloads as indices are requested. Intended for compute or storage hardware, where having your
entire dataset at hand is not possible.'background'
: As it’s name suggests – download data in the background while continuing to let you work in your
IDE. If an item is requested that isn't already downloaded, that item will be prioritized and downloaded
immediately. This is the ideal for interactive work, or when the time it takes to train on a batch is much longer
than the time it take to download one.'preload'
: Downloads all the data before finishing the function and returning the dataloader object. This is
best when training on batches is much faster than downloading them, or when avoiding dataloader delays is
critical (GPU clusters where jobs have strict timeouts).tensorizers
: Most data isn’t stored as tensors, and must be converted into tensor format for training. Since objects can be
of a different type (file paths, annotations, numbers, etc.) you can let Data Engine guess how best to convert each
data and metadata column to tensor format, or provide a list of custom functions that get the column value and convert
it to the appropriate tensor. This can help in cases where you need to do custom processing, for example normalizing
an image, before inputting it into your model. The supported options:
'auto'
- in this case the data types of the columns will be automatically declared and tensorized, by checking
the file extension to see if there is a match.
dl = query_res.as_ml_dataloader(flavor='torch|tensorflow')
# Manually detecting file columns; this may take a second.
# `tensorizers` set to 'auto'; guessing the datatypes
'image'|'audio'|'video'
- You can provide a single string if all the datatypes within columns are the same – for
example in an image-to-image task, you can just send in a string and it will use the same tensorizer for all
cases:dl = query_res.all().as_ml_dataloader(flavor='torch|tensorflow', metadata_columns=["label_path"], tensorizers='image')
For multi-modal data, such as image to video, you can provide a list of datatype strings – each type will be
sequentially parsed with column:
dl = query_res.all().as_ml_dataloader(flavor='torch|tensorflow', metadata_columns=["video_path"], tensorizers=['image','video'])
filepath: str → <torch.Tensor>|<tf.Tensor>
depending on your framework of choice. The actual output
type specification is not enforced, which means if you have a model that takes as input a dictionary from a single
column within your dataset, you can write a tensorizer to do just that, and it won’t complain. For example, if we
want to normalize our images in an image-to-image task we can use the following code:def image_norm(file: str) -> torch.Tensor:
img_tensor = torchvision.io.read_image(file).type(torch.float)
img_tensor = (img_tensor - torch.min(img_tensor)) / (torch.max(img_tensor) - torch.min(img_tensor))
return img_tensor
dl = query_res.all().as_ml_dataloader(flavor='torch', metadata_columns=["label_path"], batch_size=4,
tensorizers=[image_norm, image_norm])
**kwargs
: Use Torch documentation
or TensorFlow documentation for
additional keyword arguments you can supply to the data loader.!!! important "List of tensorizers"
Note: In all cases where you supply a list of tensorizers, the number of items in that list should be one more than the
number of items in the metadata_columns
list. The first tensorizer will always be applied to the input data path.
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?