DagsHub-Datasets
/
human-microbiome-project-dataset


  
1

	
2

	
3

	
4

	
5

	
6

	
7

	
8

	
9

	
10

	
11

	
12

	
13

	
14

	
15

	
16

	
17

	
18

	
19

	
20

	
21

	
22

	
23

	
24

	
25

	
26

	
27

	
28

	
29

	
30

	
31

	
32

	
33

	
34

	
35

	
36

	
37

	
38

	
39

	
40

	
41

	
42

	
43

	
44

	
45

	
46

	
47

	
48

	
49

	
50

	
51

	
52

	
53

	
54

	
55

	
56

	
57

	
58

	
59

	
60

	
61

	
62

	
63

	
64

	
65

	
66

	
67

	
68

	
69

	
70

	
71

	
72

	
73

	
74

	
75

	
76

	
77

	
78

	
79

	
80

	
81

	
82

	
83

	
84

	
85

	
86

	
87

	
88

	
89

	
90

	
91

	
92

	
93

	
94

	
95

	
96

	
97

	
98

	
99

	
100

	
101

	
102

	
103

	
104

	
105

	
106

	
107

	
108

	
SFF and library metadata files for July 12, 2010 snapshot of SRP002395
        
Jonathan Crabtree
jcrabtree@som.umaryland.edu

OVERVIEW

This directory contains 7518 .sff files and a gzipped tar file of 7518
.lmd files generated from the Human Microbiome Project 16S rRNA 454
Clinical Production Phase I (corresponding to SRA study accession
number SRP002395.)

SFF FILES

The 7518 .sff files (one for each of the 7518 SRA runs that are supposed
to be associated with study SRP002395) in this directory were generated
by the following process:

 1. Download all 7518 runs in SRA native format from NCBI using Aspera client.
 2. Convert all 7518 runs from SRA to SFF format using the "sffdump"
    utility from the NCBI SRA toolkit (the May 25, 2010 version, from
    http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software)

Note that 6 of the runs failed to convert cleanly to SFF via sffdump,
resulting in the loss of some ~12,000 out of ~72 million reads.  These
are the 6 runs that had problems:

SRR041240: mismatch - 2623 spot(s) expected, but 1616 spot(s) were extracted
SRR042823: mismatch - 1110 spot(s) expected, but 0 (due to failure) spot(s) were extracted
SRR042925: mismatch - 1820 spot(s) expected, but 0 (due to failure) spot(s) were extracted
SRR042952: mismatch - 656 spot(s) expected, but 416 spot(s) were extracted
SRR043261: mismatch - 96 spot(s) expected, but 25 spot(s) were extracted
SRR043690: mismatch - 7367 spot(s) expected, but 17 spot(s) were extracted

LIBRARY METADATA (.lmd) FILES

The ".lmd" files in this directory (in the gzipped tar file
SRP002395-2010-Aug-03-run-metadata-corrected-v1.tar.gz) are in an
ad-hoc tab-delimited format and were generated by downloading and
parsing the SRA XML files for all 7518 runs and the corresponding SRA
samples using a custom Perl script.  There should be exactly one .lmd
file for each .sff file.  Since most of the SFF files are already
deconvoluted, these library metadata files contain quite a bit of
duplicated information (see below).  Each tab-delimited row in one of
the lmd files contains the following fields, in this order:

    -SRA run accession (e.g., SRR012345)
    -SRA experiment accession (e.g., SRX012345)
    -run alias
    -sequencing center
    -experiment pool member_name (descriptor pulled from the XML that might serve as a library identifier)
    -reverse barcode description
    -reverse barcode sequence
    -reverse primer description
    -reverse primer sequence
    -SRA sample accession (e.g., SRS012345)
    -submitted anonymized subject id
    -EMMES body site (with spelling errors corrected)
    -submitted anonymized sample id

The crucial thing to note about the .lmd files is that each one
typically contains a number of rows with the first field (SRA run
accession) set to NULL and then one or more rows with the first field
set to a non-NULL value.  The rows with the initial NULLs enumerate
all the samples for the specified _experiment_ (typically
corresponding to one or more 454 machine runs) and then the rows with
the non-NULL SRA run accessions tell you which samples you should
expect to see _in that particular SFF file_.  So for most of the SFF
files the corresponding .lmd file will start with a set of sample rows
with NULL in the first column and then will have a single sample row
with the accession of the SFF file in the first column.  This is
because the sequencing centers have already deconvoluted the data and
so each SFF file downloaded from the SRA contains data from only one
sample.  However, for some of the SFF files (those from JCVI) the SFF
files in this directory are not fully deconvoluted and so the
corresponding .lmd file will contain _multiple_ rows with non-NULL
initial fields, and these SFF files will have to be deconvoluted.  The
issue here is not that JCVI failed to deconvolute the files prior to
submission, but rather that they used an alternate SRA submission
format that encodes the deconvolution differently, and in such a way
that the NCBI sffdump tool doesn't automatically split the output SFF
into one SFF file per sample.

CAVEATS

Study SRP002395 has been in flux but at the time when this snapshot
was made (July 12. 2010), it is believed to have the correct data for
the Clinical Production Phase I 16S 454 sequencing.  There was one
spurious WGS Illumina run in the dataset on July 12, but that spurious
run is NOT included in the set of files provided here.

We are currently investigating the 6 runs that did not convert cleanly
to SFF and will update this file when their status is resolved.  The
.lmd files for these runs should be correct, however.

Additional discrepancies between the SRA metadata and the actual
contents of the SFF files were observed by Pat Schloss during the
initial processing of these files and the .lmd files in
SRP002395-2010-Aug-03-run-metadata-corrected-v1.tar.gz have been
updated to reflect the actual contents of some of the SFF files,
particularly those that contain reads from multiple distinct V-regions
(but the same barcode and sample.)

QUESTIONS/COMMENTS

Please e-mail jcrabtree@som.umaryland.edu with any questions and/or
comments.