DagsHub
/
Librispeech-ASR-corpus


  
1

	
2

	
3

	
4

	
5

	
6

	
7

	
8

	
9

	
10

	
11

	
12

	
13

	
14

	
15

	
16

	
17

	
18

	
19

	
20

	
21

	
22

	
23

	
24

	
25

	
26

	
27

	
28

	
29

	
30

	
31

	
32

	
33

	
34

	
35

	
36

	
37

	
38

	
39

	
40

	
41

	
42

	
43

	
44

	
45

	
46

	
47

	
48

	
49

	
50

	
51

	
52

	
53

	
54

	
55

	
56

	
57

	
58

	
59

	
60

	
61

	
62

	
63

	
64

	
65

	
66

	
67

	
68

	
69

	
70

	
71

	
72

	
73

	
74

	
75

	
76

	
77

	
78

	
79

	
80

	
81

	
82

	
83

	
84

	
85

	
86

	
87

	
88

	
89

	
90

	
91

	
92

	
93

	
94

	
95

	
96

	
97

	
98

	
99

	
100

	
101

	
102

	
103

	
104

	
105

	
106

	
107

	
108

	
109

	
110

	
111

	
112

	
113

	
114

	
            This file describes some of the more important bits of information in the
SQLite databases, stored in this directory. The plain text metadata files
(e.g. SPEAKERS.TXT) contain a subset of the information in these databases.

ia.sqlite.db
============

This file contains the metadata extracted, about the LibriVox recordings,
hosted on the Internet Archive. The database contains two tables:

* meta - each row represents a LV project stored on the Internet Archive.
         That is the URL of the main IA page for the project, the title and 
         the author of the book on which the recordings are based

* mp3 - each row is an audio chapter. It references the project this chapter
        is part of(i.e. a row in 'meta'), the URL from which the .mp3 file can
        be downloaded, it's size and checksums. For the 64kbit/s .mp3 recordings
        the "parent_id" points to another row, containg information about the
        128 kbit/s version from which it was derived.

pg.sqlite.db
============

Contains the parts of the metadata provided by Project Gutenbergs XML/RDF files,
which is relevant to the LibriSpeech corpus. The tables are:

* books - each row represents a PG book. Has columns for book's title, various
          classification codes, and the URLs for the ASCII and/or UTF-8 version
          of the text

* authors - each row represents an author, and have many-to-many relationship
            with the "books" table - i.e. a book can have more than one author
            and one author can be associated with more than one book.


lv-annotated.sqlite.db
======================

This database is a hodge-podge collection of various things that were relevant
for the alignment process and subsequent corpus creation.

First there are tables with information about LibriVox projects. This was
the original information in the database- the other things were then added
as needed:

* projects - each row describes a LibriVox audio book project. Has columns for
             the project's title, the URL for the associated LibriVox page, the
             number of the audio chapters, total time in seconds, the URL for
             relevant Internet Archive page and so on.

* audio_chapters - each row corresponds to an audio chapter a LibriVox project.
                   The "project_id" column contains the foreign key pointing to
                   the parent project, and there are columns containing the
                   duration of the chapter in seconds, the URL from which it
                   can be downloaded and so on. Another foreign key links the
                   chapter to a row in the "readers" table below.

* readers - contains the ID and the name of a LibriVox volunteer.

* authors - contains the name and dates of birth and death of a book author

* genres - ID and name of the genres under which the LibriVox projects are 
           classified


The tables that follow were used in the process of scheduling alignment jobs:

* jobs - contains the basic information about each scheduled job. Things like
         dates when the task was scheduled, enqueued on SGE, started and 
         finished, as well as status code for the outcome of the job(e.g.
         success or failure). The 'successor_id' field points the re-scheduled
         instance of this job if we had to restart it for some reason.

Two types of jobs were used to align the LibriSpeech's audio- in the parlance of
the source code they are called "top-half" and "bottom-half" jobs. The former
are responsible for normalizing a book's, building the phase 1 decoding graph,
and g2p lexicon generation(for more details search for the yet-to-be published
paper). The "bottom-half" jobs are user oriented, and are responsible for the
actual alignment. This design was chosen in order to share the initial processing
of a book, for which there could be LibriVox audio chapters, read by different
readers. The tables below extend the definition of "jobs", through the magic
of SQLAlchemy:

* top_half_jobs - contains the location of the source text to process.

* bottom_half_jobs - contains the ID of the reader, whose audio should be
                     processed by this job. Also stores number of statistics
                     about the job, filled after its completion. Information
                     like, the percentage of the audio successfully aligned
                     the number of the audio chapters for this reader which
                     succeeded and failed, the real-time factor, that is the
                     ratio between the useful, aligned audio and the time
                     it took to obtain it and so on. The "worst_verification_wer"
                     field records the result of the post-processing verification
                     of the obtained utterances(a process that produces a certain
                     number of false alarms)

* bottom_half_chapters - each row contains information about the processing
                         of an individual chapter within the bottom-half job.
                         Mostly the same information as above but at per-chapter
                         granularity.


The next set of tables contain information that was added using a process of
quickly reviewing tiny fraction of the audio to make sure that there aren't
multi-speaker recordings and other undesirable artifacts in the corpus. Note
that this process should not be considered infallible.


* reader_annotations - contains annotations about the individual reader.
                       Perhaps the only (more or less) reliable field here
                       is the gender information.

* audio_chapter_annotations - per-chapter "noisy" and "multi-speaker" flags.