The Flickr 8k Audio Caption Corpus

If you use this data, please cite

D. Harwath and J. Glass, "Deep Multimodal Semantic Embeddings for Speech and Images," 2015 IEEE Automatic Speech Recognition and Understanding Workshop, pp. 237-244, Scottsdale, Arizona, USA, December 2015 (PDF)

as well as the original Flickr 8k text caption corpus:

M. Hodosh, P. Young and J. Hockenmaier (2013) "Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics", Journal of Artificial Intelligence Research, Volume 47, pages 853-899 https://www.jair.org/index.php/jair/article/view/10833/25854

You can download the original Flickr 8k corpus of text captions here: https://forms.illinois.edu/sec/1713398

The dataset is uploaded to DagsHub: Flickr-Audio-Caption-Corpus, which allows you to preview parts of the dataset before downloading it.

This data is distributed under the Creative Commons Attribution-ShareAlike (CC BY-SA) license.

Brief description of datasets

Here is a brief description of what is included in the Flickr 8k audio data:

The wavs/ directory contains 40,000 spoken audio captions in .wav audio format, one for each caption included in the train, dev, and test splits in the original Flickr 8k corpus (as defined by the files Flickr_8k.trainImages.txt, Flickr_8k.devImages.txt, and Flickr_8k.testImages.txt)

The audio is sampled at 16000 Hz with 16-bit depth, and stored in Microsoft WAVE audio format

The file wav2capt.txt contains a mapping from the .wav file names to the corresponding .jpg images as well as caption number. The .jpg file names and caption numbers can then be mapped to the caption text via the Flickr8k.token.txt file from the original Flickr 8k corpus.

The file wav2spk.txt contains a mapping from the .wav file names to its speaker. Each unique speaker is numbered consecutively from 1 to 183 (the total number of unique speakers).

Data downloads

Flickr Audio Corpus (4.2 GB): Download gzip'd tar file

MD5 checksum: 9d078f1f15

This open source contribution is part of DagsHub x Hacktoberfest

This dataset is uploaded to DagsHub, enabling you to preview (i.e. prehear) it before downloading.

Golos - Russian ASR dataset

Note: This is at the moment only a sample of the whole dataset:

train/test/crowd/0
test
*.jsonl

General Information

Golos is a Russian corpus suitable for speech research. The dataset mainly consists of recorded audio files manually annotated on the crowd-sourcing platform. The total duration of the audio is about 1240 hours.

For more overview see the research article. For a detailed account and acoustic models please consult the original github repository.

Folder structure

.
├── test
│   ├── crowd
│   │   └── files
│   └── farfield
│       └── files
└── train
    ├── crowd
    │   ├── 0
    │   ├── 1   (not uploaded yet)
    │   ├── 2   (not uploaded yet)
    │   ├── 3   (not uploaded yet)
    │   ├── 4   (not uploaded yet)
    │   ├── 5   (not uploaded yet)
    │   ├── 6   (not uploaded yet)
    │   ├── 7   (not uploaded yet)
    │   ├── 8   (not uploaded yet)
    │   └── 9   (not uploaded yet)
    └── farfield
    ├──── 1hour.jsonl
    ├──── 10hours.jsonl
    ├──── 10min.jsonl
    ├──── 100hours.jsonl
    ├──── manifest.jsonl

Full Dataset structure

Domain	Train files	Train hours	Test files	Test hours
Crowd	979 796	1 095	9 994	11.2
Farfield	124 003	132.4	1 916	1.4
Total	1 103 799	1 227.4	11 910	12.6

License

This work is licensed under a variant of Creative Commons Attribution-ShareAlike 4.0 International License.

Please see the specific license.

Authors and Credits

Alexander Denisenko
Angelina Kovalenko
Fedor Minkin
Nikolay Karpov

You can cite the data using the following BibTeX entry:

@article{karpov2021golos, title={Golos: Russian Dataset for Speech Research}, author={Karpov, Nikolay and Denisenko, Alexander and Minkin, Fedor}, journal={arXiv preprint arXiv:2106.10161}, year={2021} }

This dataset is uploaded to DagsHub, enabling you to preview (i.e. prehear) it before downloading.

ZeroSpeech Challenge 2021 Datasetls

General Information

Traditional speech and language technologies are trained with massive amounts of text and/or expert knowledge. This is not sustainable: the majority of the world’s languages do not have reliable textual or expert resources. Even in high resourced languages, there is a large domain mismatch between oral and written uses of language.

But infants learn to speak their native language, spontaneously, from raw sensory input, without supervision from text or linguists. It should be possible to do the same in machines!

The ultimate goal of the “Zero Resource Speech Challenge” 1 is to construct a system that learn an end-to-end Spoken Dialog (SD) system, in an unknown language, from scratch, using only raw sensory information available to an early language learner.

For a more detailed account and the competition timeline see here. For an overview of previous competitions see the challenge website.

Structure


├── lexical
│   ├── dev
├──────── aAAfmkmQpVz.wav
│   ├──── ...
│   └── test
├──────── aaaDSGZhrtbq.wav
│   ├──── ...
├── phonetic
│   ├── dev-clean
├──────── 84-121123-0000.wav
│   ├──── ...
│   ├── dev-other
├──────── 116-288045-0000.wav
│   ├──── ...
│   ├── test-clean
├──────── 61-70968-0000.wav
│   ├──── ...
│   └── test-other
├──────── 367-130732-0000.wav
│   ├──── ...
├── semantic
│   ├── dev
│   │   ├── librispeech
├──────────── aaRpeKDRbj.wav
│   ├──────── ...
│   │   └── synthetic
├──────────── aAbcsWWKCz.wav
│   ├──────── ...
│   │   └── gold.csv
│   │   └── pairs.csv
│   └── test
│       ├── librispeech
├──────────── AabWUdQiJx.wav
│   ├──────── ...
│       └── synthetic
├──────────── aaEGIphSpE.wav
│   ├──────── ...
└── syntactic
│   ├── dev
├──────────── aaacEBDmoCU.wav
│   ├──────── ...
│   └── test
├──────────── aAAAZvtMsGyf.wav
│   ├──────── ...

License

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Authors and Credits

Challenge Organizing Committee https://zerospeech.com/2021/#challenge-organizing-committee

phonetic/*.wav semantic/{dev,test}/librispeech

DagsHub / audio-datasets connected to https://github.com/DAGsHub/audio-datasets.git

#85 added zerospeech2021

The Flickr 8k Audio Caption Corpus

Brief description of datasets

Data downloads

Golos - Russian ASR dataset

General Information

Folder structure

Full Dataset structure

License

Authors and Credits

ZeroSpeech Challenge 2021 Datasetls

General Information

Structure

License

Authors and Credits

DagsHub
/
audio-datasets
connected to https://github.com/DAGsHub/audio-datasets.git