Streaming 40+ Open Source Audio Datasets for free with 3 lines of code

Direct Data Access Nov 14, 2022

In this short article, I'll share with you an awesome resource for finding open source datasets, and an easy way to stream them for any use.

If you didn't know, streaming means you don't need to wait for the download to finish, it happens while you train your model. That way, everything happens faster and you save money on cloud computing :)

Step 1: Pick the dataset of your liking from DagsHub Explore Datasets page.

Step 2: Simple setup

pip3 install dagshub
dagshub login # you'll be prompted with instructions to authenticate

Step 3: Stream the dataset using the following snippet:

from dagshub.streaming import DagsHubFilesystem

fs = DagsHubFilesystem(repo_url="<url-of-your-chosen-dataset>",
					   project_root=".")

for f in fs.listdir("audio-train"): # <- a folder inside the repository
	print(f) # <- You can use fs.open(f) to access the content of the file
	# Do data science idk

Congrats! You just unlocked a whole new world of datasets and data science-ing.

For additional information, read the docs.