Training a model with your dataset¶
Now that you created new datasets, the next step will be to train and improve your model with it. Data Engine supports the PyTorch and TensorFlow frameworks out of the box to easily create data loaders and datasets for your model training.
Working with Dataloaders¶
Data Engine’s DataLoaders extend the following classes:
torch.utils.data.DataLoader
for PyTorchtf.keras.utils.Sequence
for TensorFlowdatasets.arrow_dataset.Dataset
for HuggingFace Datasets
Info
For HuggingFace Datasets, the paths of the downloaded datapoints are passed, and can be used with cast_column()
for conversions to a format required by your framework of choice.
For each QueryResult
, we can get a native dataset for PyTorch, TensorFlow or HuggingFace, which you can use to train your model.
Creating a Dataloader¶
The easiest way to get a dataloader from your enriched dataset is to use the following function:
from dagshub.data_engine import datasets
ds = datasets.get_dataset('<your_dataset_name>')
query_res = ds.all()
dl = query_res.as_ml_dataloader(flavor='torch')
from dagshub.data_engine import datasets
ds = datasets.get_dataset('<your_dataset_name>')
query_res = ds.all()
dl = query_res.as_ml_dataloader(flavor='tensorflow')
from dagshub.data_engine import datasets
from datasets import ClassLabel
ds = datasets.get_dataset('<your_dataset_name>')
query_res = ds.all()
ds = query_res.as_hf_dataset().select_columns(['desired', 'column', 'names'])
ds = ds.cast_column('label_column', ClassLabel(names=['class 1', 'class 2']))
# Example: To cast image files to the proper format for HF datasets
ds = ds.cast_column('path', Image())
# Example: To cast audio files to the proper format for HF datasets
ds = ds.cast_column('path', Audio())
Info
For more information on how to best utilize HuggingFace datasets, check out HuggingFaces' documentation.
For PyTorch and TensorFlow, Data Engine automatically guesses the data type from the path (there’s built-in support for images, audio, video, and numeric formats), and converts each file to a tensor you can use for training.
By default, files are downloaded from your project only as requested, so you don’t download anything you don’t use. This allows you to start training immediately without needing to first download the entire dataset.
However, all these things are customizable. Below is a full explanation of the dataloader functionality.
Full Dataloader Functionality¶
Given the above query_res
object, you can convert it to a dataloader with the following options:
dl = query_res.as_ml_dataloader(
flavor='torch',
metadata_columns=["ColumnName1", "ColumnName2"], # Default: None
strategy='preload|background|lazy', # Default: lazy
savedir='/path/to/saved/dataset', # Default: None
processes=8, # Default: 8
tensorizers='auto' | ['image|audio|video', <function>] | <function>, # Default: auto
** kwargs # Pass additional arguments for the dataloader object e.g. batch size
)
dl = query_res.as_ml_dataloader(
flavor='tensorflow',
metadata_columns=["ColumnName1", "ColumnName2"], # Default: None
strategy='preload|background|lazy', # Default: lazy
savedir='/path/to/saved/dataset', # Default: None
processes=8, # Default: 8
tensorizers='auto' | ['image|audio|video', <function>] | <function>, # Default: auto
** kwargs # Pass additional arguments for the dataloader object e.g. batch size
)
flavor
: supportstorch
ortensorflow
. Choose the one you need depending on your training framework.metadata_columns
: To support extensible multimodal data use cases with varying model I/O as well as getting labels into your dataloader, we added themetadata_columns
argument. Select which columns you would like to extract from the metadata by specifying a list of metadata column names tometadata_columns
as strings. The dataloaders will return a list of all the tensorized columns.strategy
: dataloaders stream data from the cloud remote in order to facilitate training. There are multiple ways of doing this, and the best strategy usually depends on the compute at hand as well as the scenario within which you are using the dataloaders. You can choose from the following inputs:'lazy'
: Downloads as indices are requested. Intended for compute or storage hardware, where having your entire dataset at hand is not possible.'background'
: As it’s name suggests – download data in the background while continuing to let you work in your IDE. If an item is requested that isn't already downloaded, that item will be prioritized and downloaded immediately. This is the ideal for interactive work, or when the time it takes to train on a batch is much longer than the time it take to download one.'preload'
: Downloads all the data before finishing the function and returning the dataloader object. This is best when training on batches is much faster than downloading them, or when avoiding dataloader delays is critical (GPU clusters where jobs have strict timeouts).
tensorizers
: Most data isn’t stored as tensors, and must be converted into tensor format for training. Since objects can be of a different type (file paths, annotations, numbers, etc.) you can let Data Engine guess how best to convert each data and metadata column to tensor format, or provide a list of custom functions that get the column value and convert it to the appropriate tensor. This can help in cases where you need to do custom processing, for example normalizing an image, before inputting it into your model. The supported options:'auto'
- in this case the data types of the columns will be automatically declared and tensorized, by checking the file extension to see if there is a match.
dl = query_res.as_ml_dataloader(flavor='torch') # Manually detecting file columns; this may take a second. # `tensorizers` set to 'auto'; guessing the datatypes
dl = query_res.as_ml_dataloader(flavor='tensorflow') # Manually detecting file columns; this may take a second. # `tensorizers` set to 'auto'; guessing the datatypes
'image'|'audio'|'video'
- You can provide a single string if all the datatypes within columns are the same – for example in an image-to-image task, you can just send in a string and it will use the same tensorizer for all cases:
dl = query_res.all().as_ml_dataloader(flavor='torch', metadata_columns=["label_path"], tensorizers='image')
dl = query_res.all().as_ml_dataloader(flavor='tensorflow', metadata_columns=["label_path"], tensorizers='image')
For multi-modal data, such as image to video, you can provide a list of datatype strings – each type will be sequentially parsed with column:
dl = query_res.all().as_ml_dataloader(flavor='torch', metadata_columns=["video_path"], tensorizers=['image','video'])
dl = query_res.all().as_ml_dataloader(flavor='tensorflow', metadata_columns=["video_path"], tensorizers=['image','video'])
You can also provide a single function that takes as input each field, and can merged and converted to any desired format:
dl = query_res.all().as_ml_dataloader(flavor='torch', metadata_columns=["video_path"], tensorizers=<function>)
dl = query_res.all().as_ml_dataloader(flavor='tensorflow', metadata_columns=["video_path"], tensorizers=<function>)
- Data Engine also supports custom tensor conversion functions. You can supply a list of functions, each with
signature
filepath: str → <torch.Tensor>|<tf.Tensor>
depending on your framework of choice. The actual output type specification is not enforced, which means if you have a model that takes as input a dictionary from a single column within your dataset, you can write a tensorizer to do just that, and it won’t complain. For example, if we want to normalize our images in an image-to-image task we can use the following code:def image_norm(file: str) -> torch.Tensor: img_tensor = torchvision.io.read_image(file).type(torch.float) img_tensor = (img_tensor - torch.min(img_tensor)) / (torch.max(img_tensor) - torch.min(img_tensor)) return img_tensor dl = query_res.all().as_ml_dataloader(flavor='torch', metadata_columns=["label_path"], batch_size=4, tensorizers=[image_norm, image_norm])
**kwargs
: Use Torch documentation or TensorFlow documentation for additional keyword arguments you can supply to the data loader.
List of tensorizers
Note: In all cases where you supply a list of tensorizers, the number of items in that list should be one more than the
number of items in the metadata_columns
list. The first tensorizer will always be applied to the input data path.