
NIH NCBI PubMed Central (PMC) Article Datasets – Full-Text Biomedical and Life Sciences Journal Articles on AWS Dataset for Machine Learning
Install DagsHub:
pip install dagshub
To stream this data directly on DagsHub
from dagshub.streaming import DagsHubFilesystem
fs = DagsHubFilesystem(".", repo_url="https://dagshub.com/DagsHub-Datasets/ncbi-pmc-dataset")
fs.listdir("s3://pmc-oa-opendata")
Description
PubMed Central® (PMC) is a free full-text archive of biomedical and life sciences journal article at the U.S. National Institutes of Health’s National Library of Medicine (NIH/NLM). The PubMed Central (PMC) Article Datasets include full-text articles archived in PMC and made available under license terms that allow for text mining and other types of secondary analysis and reuse. The articles are organized on AWS based on general license type:
The PMC Open Access (OA) Subset, which includes all articles in PMC with a machine-readable Creative Commons license
The Author Manuscript Dataset, which includes all articles collected under a funder policy in PMC and made available in machine-readable formats for text mining
These datasets collectively span more than half of PMC’s total collection of full-text articles. PMC enables access to these datasets to expand the impact of open access and publicly-funded research; enable greater machine learning across the spectrum of scientific research; reach new audiences; and open new doors for discovery. The bucket in this registry contains individual articles in NISO Z39.96-2015 JATS XML format as well as in plain text as extracted from the XML. The bucket is updated daily with new and updated articles. Also included are file lists that include metadata for articles in each dataset.
Additional information
Documentation
Update frequency
Daily
Managed by
NLM