Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
23c62235af
Initial commit
1 year ago
1706c71be4
update readme automation
1 year ago
Storage Buckets

README.md

You have to be logged in to leave a comment. Sign In

DNAStack COVID19 SRA Data

Stream data with DDA:

from dagshub.streaming import DagsHubFilesystem

fs = DagsHubFilesystem(".", repo_url="https://dagshub.com/DagsHub-Datasets/dnastack-covid-19-sra-data-dataset")

fs.listdir("s3://dnastack-covid-19-sra-data")

Description:

The Sequence Read Archive (SRA) is the primary archive of high-throughput sequencing data, hosted by the National Institutes of Health (NIH). The SRA represents the largest publicly available repository of SARS-CoV-2 sequencing data. This dataset was created by DNAstack using SARS-CoV-2 sequencing data sourced from the SRA. Where possible, raw sequence data were processed by DNAstack through a unified bioinformatics pipeline to produce genome assemblies and variant calls. The use of a standardized workflow to produce this harmonized dataset allows public data generated using different methodologies to be combined and compared for a more powerful global analysis of available SARS-CoV-2 data, allowing researchers rapid access to aggregated downstream results for accelerated insight generation. Methodology: Reads from the SRA were extracted in FASTQ format, then entered into a different pipeline depending on the sequencing technology used to create the reads: the ARTIC protocol for Oxford Nanopore-derived reads; the SIGNAL pipeline for paired-end Illumina reads; and the CoSA pipeline (using DeepVariant for variant calling) for PacBio reads. Briefly, reads were primer-trimmed and aligned to the SARS-CoV-2 reference genome, following which contiguous regions were assembled and variant sites were called. Pangolin was then used to assign viral lineage based on the assembled genome.

Contact:

The Sequence Read Archive (SRA) is the primary archive of high-throughput sequencing data, hosted by the National Institutes of Health (NIH). The SRA represents the largest publicly available repository of SARS-CoV-2 sequencing data. This dataset was created by DNAstack using SARS-CoV-2 sequencing data sourced from the SRA. Where possible, raw sequence data were processed by DNAstack through a unified bioinformatics pipeline to produce genome assemblies and variant calls. The use of a standardized workflow to produce this harmonized dataset allows public data generated using different methodologies to be combined and compared for a more powerful global analysis of available SARS-CoV-2 data, allowing researchers rapid access to aggregated downstream results for accelerated insight generation. Methodology: Reads from the SRA were extracted in FASTQ format, then entered into a different pipeline depending on the sequencing technology used to create the reads: the ARTIC protocol for Oxford Nanopore-derived reads; the SIGNAL pipeline for paired-end Illumina reads; and the CoSA pipeline (using DeepVariant for variant calling) for PacBio reads. Briefly, reads were primer-trimmed and aligned to the SARS-CoV-2 reference genome, following which contiguous regions were assembled and variant sites were called. Pangolin was then used to assign viral lineage based on the assembled genome.

Update Frequency:

Rolling

Managed By:

https://dnastack.com/

Resources:

  1. resource:
    • Description: SARS-CoV-2 raw sequencing and output data (FASTQ, BAM, FASTA, VCF)
    • ARN: arn:aws:s3:::dnastack-covid-19-sra-data
    • Region: us-west-2
    • Type: S3 Bucket
    • RequesterPays: False
    • Explore: Browse bucket

Tags:

aws-pds, bam, bioinformatics, coronavirus, COVID-19, fasta, fastq, global, genetic, genomic, health, life sciences, long read sequencing, SARS-CoV-2, vcf, virus, whole genome sequencing

Tutorials:

  1. tutorial:

Tools & Applications:

  1. tools & applications:
Tip!

Press p or to see the previous file or, n or to see the next file

About

dnastack-covid-19-sra-data-dataset is originate from the Registry of Open Data on AWS

Collaborators 5

Comments

Loading...