Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
905aa8aa0f
Initial commit
1 year ago
dd1cd1c78a
update readme automation
1 year ago
Storage Buckets

README.md

You have to be logged in to leave a comment. Sign In

COVID-19 Genome Sequence Dataset

Stream data with DDA:

from dagshub.streaming import DagsHubFilesystem

fs = DagsHubFilesystem(".", repo_url="https://dagshub.com/DagsHub-Datasets/ncbi-covid-19-dataset")

fs.listdir("s3://sra-pub-sars-cov2")

Description:

A centralized sequence repository for all records containing sequence associated with the novel corona virus (SARS-CoV-2) submitted to the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). Included are both the original sequences submitted by the principal investigator as well as SRA-processed sequences that require the SRA Toolkit for analysis. Additionally, submitter provided metadata included in associated BioSample and BioProject records is available alongside NCBI calculated data, such k-mer based taxonomy analysis results, contiguous assemblies (contigs) and associated statistics such as contig length, blast results for the assembled contigs, contig annotation, blast databases of contigs and their annotated peptides, and VCF files generated for each record relative to the SARS-CoV-2 RefSeq record. Finally, metadata is additionally made available in parquet format to facilitate search and filtering using the AWS Athena Service.

Contact:

A centralized sequence repository for all records containing sequence associated with the novel corona virus (SARS-CoV-2) submitted to the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). Included are both the original sequences submitted by the principal investigator as well as SRA-processed sequences that require the SRA Toolkit for analysis. Additionally, submitter provided metadata included in associated BioSample and BioProject records is available alongside NCBI calculated data, such k-mer based taxonomy analysis results, contiguous assemblies (contigs) and associated statistics such as contig length, blast results for the assembled contigs, contig annotation, blast databases of contigs and their annotated peptides, and VCF files generated for each record relative to the SARS-CoV-2 RefSeq record. Finally, metadata is additionally made available in parquet format to facilitate search and filtering using the AWS Athena Service.

Update Frequency:

Hourly

Managed By:

NLM

Resources:

  1. resource:

    • Description: Genomic sequence reads of SARS-CoV-2 and related coronaviridae, organized by NCBI accession. Files in the sra-src folder are in FASTQ, BAM, or CRAM format (original submission); files in the run folder are in .sra format and require the SRA Toolkit
    • ARN: arn:aws:s3:::sra-pub-sars-cov2
    • Region: us-east-1
    • Type: S3 Bucket
  2. resource:

    • Description: Metadata for sra-pub-sars-cov2 in an Athena-queryable format
    • ARN: arn:aws:s3:::sra-pub-sars-cov2-metadata-us-east-1
    • Region: us-east-1
    • Type: S3 Bucket

Tags:

aws-pds, bioinformatics, biology, coronavirus, COVID-19, fastq, bam, cram, genomic, genetic, health, life sciences, MERS, SARS, virus, STRIDES, whole genome sequencing, transcriptomics

Tools & Applications:

  1. tools & applications:
Tip!

Press p or to see the previous file or, n or to see the next file

About

ncbi-covid-19-dataset is originate from the Registry of Open Data on AWS

Collaborators 5

Comments

Loading...