Updated 2 years ago
Path: .
The Flickr 8k Audio Caption Corpus contains 40,000 spoken captions of 8,000 natural images. It was collected in 2015 to investigate multimodal learning schemes for unsupervised speech pattern discovery.
Updated 2 years ago
Path: .
CMU Multimodal Opinion Sentiment Intensity (CMU-MOSI) is a dataset of opinion level sentiment intensity in online videos. It contains 2199 opinion utterances with sentiment annotated between very negative to very positive in seven Likert steps.
Updated 2 years ago
Path: .
This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.
Updated 2 years ago
Path: .
The DAPS (Device and Produced Speech) dataset is a collection of aligned versions of professionally produced studio speech recordings and recordings of the same speech on common consumer devices (tablet and smartphone) in real-world environments.