Biology Datasets 

The biology domain, encompassing fields such as life sciences, genetics, genomics, and neuroscience, presents a wealth of information for machine learning applications. With data from sources such as genetic sequences, brain imaging, and other biological markers, models can make predictions and uncover insights related to health, evolution, and neuroscience. The emerging field of “precision medicine” leverages these data sources to study topics such as personalized diagnosis and treatment, drug discovery, and neurodegenerative disease analysis. With the right tools and algorithms, biology data has the potential to transform our understanding of life and lead to breakthroughs in health and biology. So, explore the power of machine learning in the biology domain and discover new trends and insights that drive innovation and progress.

1000 Genomes

Binding DB – Data Lakehouse Ready

IBL Behavioral Data on AWS

OpenCell on AWS

1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5 – Data Lakehouse Ready

Allen Ivy Glioblastoma Atlas

Encyclopedia of DNA Elements (ENCODE)

Allen Brain Observatory – Visual Coding AWS Public Data Set

Tabula Sapiens

SiPeCaM (Sitios Permanentes de la Calibración y Monitoreo de la Biodiversidad)

Variant Effect Predictor (VEP) and the Loss-Of-Function Transcript Effect Estimator (LOFTEE) Plugin

COVID-19 Genome Sequence Dataset

Oxford Nanopore Technologies Benchmark Datasets

PubSeq – Public Sequence Resource

DNAStack COVID19 SRA Data

AWS iGenomes

stdpopsim species resources

Hecatomb Databases

Cloud Indexes for Bowtie, Kraken, HISAT, and Centrifuge

International Neuroimaging Data-Sharing Initiative (INDI)

Cell Organelle Segmentation in Electron Microscopy (COSEM) on AWS

Toxicant Exposures and Responses by Genomic and Epigenomic Regulators of Transcription (TaRGET)

UCSC Genome Browser Sequence and Annotations

Natural Scenes Dataset

Pacific Ocean Sound Recordings

Cancer Cell Line Encyclopedia (CCLE)

Allen Cell Imaging Collections

Distributed Archives for Neurophysiology Data Integration (DANDI)

COVID-19 Data Lake

Ohio State Cardiac MRI Raw Data (OCMR)

NOAA Water-Column Sonar Data Archive

REDASA COVID-19 Open Data

Physionet

CoMMpass from the Multiple Myeloma Research Foundation

GATK Test Data

Human Cancer Models Initiative (HCMI) Cancer Model Development Center

National Cancer Institute Center for Cancer Research – Diffuse Large B Cell Lymphoma (DLBCL) Genomics and Expression

4D Nucleome (4DN)

recount3

OpenProteinSet

BossDB Open Neuroimagery Datasets

Cell Painting Image Collection

Sea Around Us Global Fisheries Catch Data

Tabula Muris Senis

IBL Neuropixels Reproducible Ephys Data on AWS

COVID-19 Open Research Dataset (CORD-19)

1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5 and 3.7

GATK Structural Variation (SV) Data

InRad COVID-19 X-Ray and CT Scans

UK Biobank Pan-Ancestry Summary Statistics

Clinical Trial Sequencing Project – Diffuse Large B-Cell Lymphoma

CAncer MEtastases in LYmph nOdes challeNge (CAMELYON) Dataset

Biological and Physical Sciences (BPS) Microscopy Benchmark Training Dataset

3000 Rice Genomes Project

Open Targets – Data Lakehouse Ready

Genome in a Bottle on AWS

ZINC Database

The Genome Modeling System

Human PanGenomics Project

Sounds of Central African landscapes

Oregon Health & Science University Chronic Neutrophilic Leukemia Dataset

IBL Neuropixels Brainwide Map on AWS

iNaturalist Licensed Observation Images

Open NeuroData

Genome Aggregation Database (gnomAD)

Refgenie reference genome assets

University of British Columbia Sunflower Genome Dataset

CMS 2008-2010 Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) in OMOP Common Data Model

Australasian Genomes

Medical Segmentation Decathlon

OpenCRAVAT

Africa Soil Information Service (AfSIS) Soil Chemistry

STOIC2021 Training

iSDAsoil

Foldingathome COVID-19 Datasets

Synthea synthetic patient generator data in OMOP Common Data Model

Foundation Medicine Adult Cancer Clinical Dataset (FM-AD)

OpenNeuro

COVID-19 Harmonized Data

Allen Mouse Brain Atlas

Mouse Brain Anatomy: MouseLight Imagery

Serratus: Ultra-deep Search for Novel Viruses – Versioned Data Release

Nanopore Reference Human Genome

Protein Data Bank 3D Structural Biology Data

Google Brain Genomics Sequencing Dataset for Benchmarking and Development

Conformational Space of Short Peptides

Seattle Alzheimer’s Disease Brain Cell Atlas (SEA-AD)

Cell Painting Gallery

The Singapore Nanopore Expression Data Set

NIH NCBI PubMed Central (PMC) Article Datasets – Full-Text Biomedical and Life Sciences Journal Articles on AWS

ChEMBL – Data Lakehouse Ready

Genome Aggregation Database (gnomAD) – Data Lakehouse Ready

National Herbarium of NSW

Broad Genome References

COVID-19 Molecular Structure and Therapeutics Hub

Tabula Muris

Basic Local Alignment Sequences Tool (BLAST) Databases

Clinical Proteomic Tumor Analysis Consortium 2 (CPTAC-2)

Orcasound – bioacoustic data for marine conservation

Clinical Proteomic Tumor Analysis Consortium 3 (CPTAC-3)

Fly Brain Anatomy: FlyLight Gen1 and Split-GAL4 Imagery

The Human Microbiome Project

ClinVar – Data Lakehouse Ready

QIIME 2 User Tutorial Datasets

UniProt

Open Bioinformatics Reference Data for Galaxy

CIViC (Clinical Interpretation of Variants in Cancer)

Biological and Physical Sciences (BPS) RNA Sequencing Benchmark Training Dataset

TIGER Training

Therapeutically Applicable Research to Generate Effective Treatments (TARGET)

Genome Ark

Transform your ML development with DagsHub –
Try it now!

More categories

Audio

Computer Vision

Geology

NLP

Tabular

Urban

Back to top