Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

README.txt

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
  1. SFF and library metadata files for July 12, 2010 snapshot of SRP002395
  2. Jonathan Crabtree
  3. jcrabtree@som.umaryland.edu
  4. OVERVIEW
  5. This directory contains 7518 .sff files and a gzipped tar file of 7518
  6. .lmd files generated from the Human Microbiome Project 16S rRNA 454
  7. Clinical Production Phase I (corresponding to SRA study accession
  8. number SRP002395.)
  9. SFF FILES
  10. The 7518 .sff files (one for each of the 7518 SRA runs that are supposed
  11. to be associated with study SRP002395) in this directory were generated
  12. by the following process:
  13. 1. Download all 7518 runs in SRA native format from NCBI using Aspera client.
  14. 2. Convert all 7518 runs from SRA to SFF format using the "sffdump"
  15. utility from the NCBI SRA toolkit (the May 25, 2010 version, from
  16. http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software)
  17. Note that 6 of the runs failed to convert cleanly to SFF via sffdump,
  18. resulting in the loss of some ~12,000 out of ~72 million reads. These
  19. are the 6 runs that had problems:
  20. SRR041240: mismatch - 2623 spot(s) expected, but 1616 spot(s) were extracted
  21. SRR042823: mismatch - 1110 spot(s) expected, but 0 (due to failure) spot(s) were extracted
  22. SRR042925: mismatch - 1820 spot(s) expected, but 0 (due to failure) spot(s) were extracted
  23. SRR042952: mismatch - 656 spot(s) expected, but 416 spot(s) were extracted
  24. SRR043261: mismatch - 96 spot(s) expected, but 25 spot(s) were extracted
  25. SRR043690: mismatch - 7367 spot(s) expected, but 17 spot(s) were extracted
  26. LIBRARY METADATA (.lmd) FILES
  27. The ".lmd" files in this directory (in the gzipped tar file
  28. SRP002395-2010-Aug-03-run-metadata-corrected-v1.tar.gz) are in an
  29. ad-hoc tab-delimited format and were generated by downloading and
  30. parsing the SRA XML files for all 7518 runs and the corresponding SRA
  31. samples using a custom Perl script. There should be exactly one .lmd
  32. file for each .sff file. Since most of the SFF files are already
  33. deconvoluted, these library metadata files contain quite a bit of
  34. duplicated information (see below). Each tab-delimited row in one of
  35. the lmd files contains the following fields, in this order:
  36. -SRA run accession (e.g., SRR012345)
  37. -SRA experiment accession (e.g., SRX012345)
  38. -run alias
  39. -sequencing center
  40. -experiment pool member_name (descriptor pulled from the XML that might serve as a library identifier)
  41. -reverse barcode description
  42. -reverse barcode sequence
  43. -reverse primer description
  44. -reverse primer sequence
  45. -SRA sample accession (e.g., SRS012345)
  46. -submitted anonymized subject id
  47. -EMMES body site (with spelling errors corrected)
  48. -submitted anonymized sample id
  49. The crucial thing to note about the .lmd files is that each one
  50. typically contains a number of rows with the first field (SRA run
  51. accession) set to NULL and then one or more rows with the first field
  52. set to a non-NULL value. The rows with the initial NULLs enumerate
  53. all the samples for the specified _experiment_ (typically
  54. corresponding to one or more 454 machine runs) and then the rows with
  55. the non-NULL SRA run accessions tell you which samples you should
  56. expect to see _in that particular SFF file_. So for most of the SFF
  57. files the corresponding .lmd file will start with a set of sample rows
  58. with NULL in the first column and then will have a single sample row
  59. with the accession of the SFF file in the first column. This is
  60. because the sequencing centers have already deconvoluted the data and
  61. so each SFF file downloaded from the SRA contains data from only one
  62. sample. However, for some of the SFF files (those from JCVI) the SFF
  63. files in this directory are not fully deconvoluted and so the
  64. corresponding .lmd file will contain _multiple_ rows with non-NULL
  65. initial fields, and these SFF files will have to be deconvoluted. The
  66. issue here is not that JCVI failed to deconvolute the files prior to
  67. submission, but rather that they used an alternate SRA submission
  68. format that encodes the deconvolution differently, and in such a way
  69. that the NCBI sffdump tool doesn't automatically split the output SFF
  70. into one SFF file per sample.
  71. CAVEATS
  72. Study SRP002395 has been in flux but at the time when this snapshot
  73. was made (July 12. 2010), it is believed to have the correct data for
  74. the Clinical Production Phase I 16S 454 sequencing. There was one
  75. spurious WGS Illumina run in the dataset on July 12, but that spurious
  76. run is NOT included in the set of files provided here.
  77. We are currently investigating the 6 runs that did not convert cleanly
  78. to SFF and will update this file when their status is resolved. The
  79. .lmd files for these runs should be correct, however.
  80. Additional discrepancies between the SRA metadata and the actual
  81. contents of the SFF files were observed by Pat Schloss during the
  82. initial processing of these files and the .lmd files in
  83. SRP002395-2010-Aug-03-run-metadata-corrected-v1.tar.gz have been
  84. updated to reflect the actual contents of some of the SFF files,
  85. particularly those that contain reads from multiple distinct V-regions
  86. (but the same barcode and sample.)
  87. QUESTIONS/COMMENTS
  88. Please e-mail jcrabtree@som.umaryland.edu with any questions and/or
  89. comments.
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...