1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
|
- SFF and library metadata files for July 12, 2010 snapshot of SRP002395
-
- Jonathan Crabtree
- jcrabtree@som.umaryland.edu
- OVERVIEW
- This directory contains 7518 .sff files and a gzipped tar file of 7518
- .lmd files generated from the Human Microbiome Project 16S rRNA 454
- Clinical Production Phase I (corresponding to SRA study accession
- number SRP002395.)
- SFF FILES
- The 7518 .sff files (one for each of the 7518 SRA runs that are supposed
- to be associated with study SRP002395) in this directory were generated
- by the following process:
- 1. Download all 7518 runs in SRA native format from NCBI using Aspera client.
- 2. Convert all 7518 runs from SRA to SFF format using the "sffdump"
- utility from the NCBI SRA toolkit (the May 25, 2010 version, from
- http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software)
- Note that 6 of the runs failed to convert cleanly to SFF via sffdump,
- resulting in the loss of some ~12,000 out of ~72 million reads. These
- are the 6 runs that had problems:
- SRR041240: mismatch - 2623 spot(s) expected, but 1616 spot(s) were extracted
- SRR042823: mismatch - 1110 spot(s) expected, but 0 (due to failure) spot(s) were extracted
- SRR042925: mismatch - 1820 spot(s) expected, but 0 (due to failure) spot(s) were extracted
- SRR042952: mismatch - 656 spot(s) expected, but 416 spot(s) were extracted
- SRR043261: mismatch - 96 spot(s) expected, but 25 spot(s) were extracted
- SRR043690: mismatch - 7367 spot(s) expected, but 17 spot(s) were extracted
- LIBRARY METADATA (.lmd) FILES
- The ".lmd" files in this directory (in the gzipped tar file
- SRP002395-2010-Aug-03-run-metadata-corrected-v1.tar.gz) are in an
- ad-hoc tab-delimited format and were generated by downloading and
- parsing the SRA XML files for all 7518 runs and the corresponding SRA
- samples using a custom Perl script. There should be exactly one .lmd
- file for each .sff file. Since most of the SFF files are already
- deconvoluted, these library metadata files contain quite a bit of
- duplicated information (see below). Each tab-delimited row in one of
- the lmd files contains the following fields, in this order:
- -SRA run accession (e.g., SRR012345)
- -SRA experiment accession (e.g., SRX012345)
- -run alias
- -sequencing center
- -experiment pool member_name (descriptor pulled from the XML that might serve as a library identifier)
- -reverse barcode description
- -reverse barcode sequence
- -reverse primer description
- -reverse primer sequence
- -SRA sample accession (e.g., SRS012345)
- -submitted anonymized subject id
- -EMMES body site (with spelling errors corrected)
- -submitted anonymized sample id
- The crucial thing to note about the .lmd files is that each one
- typically contains a number of rows with the first field (SRA run
- accession) set to NULL and then one or more rows with the first field
- set to a non-NULL value. The rows with the initial NULLs enumerate
- all the samples for the specified _experiment_ (typically
- corresponding to one or more 454 machine runs) and then the rows with
- the non-NULL SRA run accessions tell you which samples you should
- expect to see _in that particular SFF file_. So for most of the SFF
- files the corresponding .lmd file will start with a set of sample rows
- with NULL in the first column and then will have a single sample row
- with the accession of the SFF file in the first column. This is
- because the sequencing centers have already deconvoluted the data and
- so each SFF file downloaded from the SRA contains data from only one
- sample. However, for some of the SFF files (those from JCVI) the SFF
- files in this directory are not fully deconvoluted and so the
- corresponding .lmd file will contain _multiple_ rows with non-NULL
- initial fields, and these SFF files will have to be deconvoluted. The
- issue here is not that JCVI failed to deconvolute the files prior to
- submission, but rather that they used an alternate SRA submission
- format that encodes the deconvolution differently, and in such a way
- that the NCBI sffdump tool doesn't automatically split the output SFF
- into one SFF file per sample.
- CAVEATS
- Study SRP002395 has been in flux but at the time when this snapshot
- was made (July 12. 2010), it is believed to have the correct data for
- the Clinical Production Phase I 16S 454 sequencing. There was one
- spurious WGS Illumina run in the dataset on July 12, but that spurious
- run is NOT included in the set of files provided here.
- We are currently investigating the 6 runs that did not convert cleanly
- to SFF and will update this file when their status is resolved. The
- .lmd files for these runs should be correct, however.
- Additional discrepancies between the SRA metadata and the actual
- contents of the SFF files were observed by Pat Schloss during the
- initial processing of these files and the .lmd files in
- SRP002395-2010-Aug-03-run-metadata-corrected-v1.tar.gz have been
- updated to reflect the actual contents of some of the SFF files,
- particularly those that contain reads from multiple distinct V-regions
- (but the same barcode and sample.)
- QUESTIONS/COMMENTS
- Please e-mail jcrabtree@som.umaryland.edu with any questions and/or
- comments.
|