Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

#9 fix grammatical errors in URDU-Dataset documentation

Merged
Nir Barazida merged 12 commits into DagsHub:main from idivyanshbansal:patch-1
3 changed files with 186 additions and 11 deletions
  1. 73
    0
      EmoSynth/README.md
  2. 14
    11
      URDU-Dataset/README.md
  3. 99
    0
      voice_gender_detection/README.md

EmoSynth: The Emotional Synthetic Audio Dataset

DOI

1. General information

The ability of sound to enhance human wellbeing has been known since ancient civilizations, and methods can be found today across domains of health and within a variety of cultures. image EmoSynth is a dataset of 144 audio files which have been labelled by 40 listeners for their the perceived emotion, in regards to the dimensions of Valence and Arousal.

The similar version of dataset is uploaded to DagsHub: EmoSynth , enabling you to preview the dataset before downloading it.

2. Organization of the dataset

The dataset is small (106MB) and simple to navigate as it has only one folder based containing synthetic audio files. We also have an audio_labels.csv file, which contains details about the classification of audio based on the dimensions of Valence and Arousal. Each audio file is approximate 5 seconds long and 430 KB in size.

For the best experience keep your volume high to listen to the sounds.

<root directory>
    |
    .- README.md
    |
    .- meta.txt
    |
    .- citation.txt
    |
    .- audio_labels.csv
    |
    .- Audio-Data/
          |
          .- s1_a0_d1.wav
          |
          .- s1_a0_d2.wav
          |
          .- s1_a1_d1.wav
          | ...
  • meta.txt: contains labeling collection information.
  • citation.txt: contains citation data for research paper.
  • audio_labels.csv: contains labels of audio based on perceived listener rating form 1-5.
    • audio_file: wav audio files name
    • valence: average rating of valence (1~5)
    • arousal: average rating of arousal (1~5)
    • round_val: round valence mean rating
    • round_ar: round arousal mean rating
    • round_val_sd: round valence standard deviation
    • round_ar_sd: round arousal standard deviation

3. Results

Results on the dataset show that Arousal does correlate moderately to fundamental frequency, and that the sine waveform is perceived as significantly different to square and sawtooth waveforms when evaluating perceived Arousal. The general results suggest that isolated synthetic audio can be modelled as a means of evoking affective states of emotion.

Acknowledgments

First, I would like to thank Baird, Alice and Parada-Cabaleiro, Emilia and Fraser, Cameron and Hantke, Simone and Schuller, Bjorn for publishing dataset on Zendo and explaining the results. Secondly, I would like to thank Zenodo for maintaining amazing open source dataset.


Alice Baird; Emilia Parada-Cabaleiro, Aug 20, 2019


Original Dataset: EmoSynth| Zenodo

DAGsHub Dataset: kingabzpro/EmoSynth

Photo by Jonathan Borba on Unsplash


This open source contribution is part of DagsHub x Hacktoberfest

Discard
Some lines were truncated since they exceed the maximum allowed length of 500, please use a local Git client to see the full diff.
@@ -2,28 +2,31 @@
 
 
 ## 1. General information
 ## 1. General information
 
 
-URDU dataset contains emotional utterances of Urdu speech gathered from Urdu talk shows. It contains 400 utterances of four basic emotions: Angry, Happy, Neutral, and Emotion. There are 38 speakers (27 male and 11 female). This data is created from YouTube. Speakers are selected randomly.
+The URDU dataset contains emotional utterances of Urdu speech gathered from Urdu talk shows. There are 400 utterances of four basic emotions in the book: Angry, Happy, Neutral, and Emotion. There are 38 speakers (27 male and 11 female). This data is created from YouTube. Speakers are randomly selected.
 
 
 **The similar version of dataset is uploaded to DagsHub: [URDU-Dataset](https://dagshub.com/kingabzpro/URDU-Dataset), enabling you to preview the dataset before downloading it.**
 **The similar version of dataset is uploaded to DagsHub: [URDU-Dataset](https://dagshub.com/kingabzpro/URDU-Dataset), enabling you to preview the dataset before downloading it.**
 
 
 ## 2. Structure
 ## 2. Structure
 
 
-Nomenclature followed while naming the files in the dataset is to provide information about the speaker, gender, number of the file for that speaker and overall numbering of the file in particular emotion. Files are named as follows:
+Nomenclature followed while naming the files in the dataset is to provide information about the speaker, gender, number of the file for that speaker, and overall numbering of the file in a particular emotion. Files are named as follows:
 
 
 General Name: SGX_FXX_EYY
 General Name: SGX_FXX_EYY
 
 
-For more details about dataset, please refer the complete paper "Cross Lingual Speech Emotion Recognition: Urdu vs. Western Languages". https://arxiv.org/pdf/1812.10411.pdf
+For more details about the dataset, please refer to the complete paper "Cross Lingual Speech Emotion Recognition: Urdu vs. Western Languages". https://arxiv.org/pdf/1812.10411.pdf
 
 
 ### 2.1 Audio files:
 ### 2.1 Audio files:
 
 
 * [21M] Angry - of approximately 242 seconds of "clean" speech in `.wav` format **(pushed to DagsHub)**
 * [21M] Angry - of approximately 242 seconds of "clean" speech in `.wav` format **(pushed to DagsHub)**
+ 
 * [21M] Happy - of approximately 244 seconds of "clean" speech `.wav` format **(pushed to DagsHub)**
 * [21M] Happy - of approximately 244 seconds of "clean" speech `.wav` format **(pushed to DagsHub)**
+ 
 * [21M] Neutral - of approximately 244 seconds of "clean" speech `.wav` format **(pushed to DagsHub)**
 * [21M] Neutral - of approximately 244 seconds of "clean" speech `.wav` format **(pushed to DagsHub)**
+
 * [21M] Sad - of approximately 244 seconds of "clean" speech `.wav` format **(pushed to DagsHub)**
 * [21M] Sad - of approximately 244 seconds of "clean" speech `.wav` format **(pushed to DagsHub)**
 
 
 ### 2.2 Organization of the Emotions dataset
 ### 2.2 Organization of the Emotions dataset
 
 
-The dataset is small (88MB) and simple to navigate as it has 4 folders based on emotions. Each folder contains 100 `.wav` audio files containing the emotions of Urdu speakers. The audio file range from 2~3 second of audio taken from a various video uploaded on YouTube. The following ASCII diagram depicts the directory structure.
+The dataset is small (88MB) and simple to navigate as it has 4 folders based on emotions. Each folder contains 100 `.wav` audio files containing the emotions of Urdu speakers. The audio file range from 2~3 second of audio taken from a various video uploaded on YouTube. A representation of the directory structure can be seen in the ASCII diagram below.
 
 
 ```
 ```
 <root directory>
 <root directory>
@@ -49,27 +52,27 @@ The name of audio file can be divided into three segments which is segregated by
 
 
 Where,
 Where,
 
 
-- In SGX, G indicates the gender of the speaker either it can be M for male speaker and F for female speaker, while X represents the speaker ID which remains the same for a particular speaker in all the emotions.
+- In SGX, G indicates the gender of the speaker, which can be M for male speaker and F for female speaker, while X represents the speaker ID which remains the same for a particular speaker in all the emotions.
 
 
-- In FXX, F is a keyword presenting file and XX indicates the number of file for particular speaker.
+- In FXX, F is a keyword presenting file and XX indicates the number of files for a particular speaker.
 
 
-- In EYY, E provides the information about emotion i.e., A,H,N and S for Angery, Happy, Neutral and Sad. respectively.
+- In EYY, E provides the information about emotion i.e., A, H, N, and S for Angry, Happy, Neutral, and Sad. respectively.
 
 
-For example, file name SM1_F01_A12 indicates that this is 1st file recorded by speaker No. 1 and A12 indicates that this is 12th file of Angery emotion.
+For example, file name SM1_F01_A12 indicates that this is 1st file recorded by speaker No. 1 and A12 indicates that this is the 12th file of Angry emotion.
 
 
 ## 3. Use Case
 ## 3. Use Case
 
 
-Cross-lingual speech emotion recognition is an important task for practical applications. The performance of automatic speech emotion recognition systems degrades in cross corpus scenarios, particularly in scenarios involving multiple languages or a previously unseen language such as Urdu for which limited or no data is available.
+Cross lingual speech emotion recognition is an important task for practical applications. The performance of automatic speech emotion recognition systems degrades in cross-corpus scenarios, particularly in scenarios involving multiple languages or a previously unseen language such as Urdu for which limited or no data is available.
 
 
 ## 4. Results
 ## 4. Results
 
 
-The data of multiple languages are used for training, results for emotion detection is increased even for URDU dataset, which is highly dissimilar from other databases. Also, accuracy boosted when a small fraction of testing data is included in the training of the model with single corpus. These findings would be very helpful for designing a robust emotion recognition systems even for the languages having limited or no dataset. [Cross Lingual Speech Emotion Recognition: Urdu vs. Western Languag
+The data of multiple languages are used for training. Results for emotion detection are increased even for the URDU dataset, which is highly dissimilar from other databases. Also, accuracy is boosted when a small fraction of testing data is included in the training of the model with a single corpus. These findings would be very helpful for designing robust emotion recognition systems even for languages having limited or no dataset. [Cross Lingual Speech Emotion Recognition: Urdu vs. Western Lan
 
 
 ## Acknowledgments
 ## Acknowledgments
 
 
 First and foremost, I would like to thank Siddique Latif and his team from *Information Technology University (ITU)-Punjab* and *COMSATS University Islamabad (CUI), Islamabad* for pushing the audio dataset to GitHub. 
 First and foremost, I would like to thank Siddique Latif and his team from *Information Technology University (ITU)-Punjab* and *COMSATS University Islamabad (CUI), Islamabad* for pushing the audio dataset to GitHub. 
 
 
-We would like to thank Farwa Anees, Muhammad Usman, Muhammad Atif, and Farid Ullah Khan for assisting us in preparation of URDU dataset.
+We would like to thank Farwa Anees, Muhammad Usman, Muhammad Atif, and Farid Ullah Khan for assisting us in the preparation of the URDU dataset.
 
 
 ---
 ---
 
 
Discard

Voice Gender Detection

1. General information

Cleaned Dataset for Voice gender detection using the VoxCeleb dataset (7000+ unique speakers and utterances, 3683 males / 2312 females). The VoxCeleb is an audio-visual dataset consisting of short clips of human speech, extracted from interview videos uploaded to YouTube. VoxCeleb contains speech from speakers spanning a wide range of different ethnicities, accents, professions and ages.

The similar version of dataset is uploaded to DagsHub, enabling you to preview the dataset before downloading it.

2. Data Preprocessing

The author have downloaded all the files from VoxCeleb2. After this, he cleaned the data to separate all the males from the females. I took one voice file at random for all the males and females so as to provide unique files.

img

To prepare the dataset, He put the 'males' and 'females' folders in the data directory of this repository. This will allow for us to featurize the files and train machine learning models via the provided training scripts.

img

3. Audio File Conversion

The original files that I downloaded were in .m4a format which is not detectable by DAGsHub audio visualization, so I used Python script to convert m4a files to wav files (github.com) to convert my dataset into .wav format. I ran code for the males and females folder separately.

4. Organization of the dataset

The dataset is large (1.26GB) and simple to navigate as it has 2 folders based on binary gender. Males folder contains 3682 .wav audio files from unique speakers all over the world. Similar to the males folder we have females fold containing 2312 .wav files of unique females speakers. The audio duration range from 5~30 seconds to approximately 194 KB size. The following ASCII diagram depicts the directory structure.

<root directory>
    |
    .- README.md
    |
    .- fileconvert.py
    |
    .- females/
    |
    .- males/
          |
          .- 0.wav
          |
          .- 1.wav
          |
          .- 2.wav
          | ...

5. Use Case & Results

The dataset is used to train a machine learning model to detect males from females from audio files (90.7% +/- 1.3% accuracy). You can find more about code and results here.

Decision tree accuracy (+/-) 0.007327676542764603
0.7398596519424567
Gaussian NB accuracy (+/-) 0.016660391044338484
0.8682797740896762
SKlearn classifier accuracy (+/-) 0.00079538963465451
0.5157270607408913
Adaboost classifier accuracy (+/-) 0.013940745120583124
0.8892763651333413
Gradient boosting accuracy (+/-) 0.01950292233912751
0.8669747415791165
Logistic regression accuracy (+/-) 0.012678238150779661
0.894515837971657
Hard voting accuracy (+/-) 0.013226860908589952
0.9076178049591996
K Nearest Neighbors accuracy (+/-) 0.017244722910655787
0.731352177051436
Random forest accuracy (+/-) 0.02258623279374182
0.8079923672086033
svm accuracy (+/-) 0.022841304608332974
0.8781480823563248
most accurate classifier is Hard Voting with audio features (mfcc coefficients).

Acknowledgments

First, I would like to thank Jim Schwoebel for publishing dataset on GitHub and explaining in depth how to use this dataset. Secondly, I would like to thank VoxCeleb for providing amazing open source dataset.

The VoxCeleb is supported by the EPSRC programme grant Seebibyte EP/M013774/1: Visual Search for the Era of Big Data.

License


Jim Schwoebel, Aug 8, 2020


Original Dataset: Voice Gender Detection

DAGsHub Dataset: kingabzpro/voice_gender_detection


This open source contribution is part of DagsHub x Hacktoberfest

Discard