NLP Datasets

Natural language processing (NLP) is transforming the way we interact with technology. With the ability to understand and generate human language, NLP is making it possible for computers to understand our thoughts and emotions. Whether it’s voice assistants, chatbots, or text-based communication, NLP is revolutionizing the way we communicate with machines. Join the NLP revolution and experience the power of human language technology.

Search datasets:

Filter results:

Automatic Speech Recognition (ASR) Error Robustness

Helpful Sentences from Reviews

Learning to Rank and Filter – community question answering

AI2 TabMCQ: Multiple Choice Questions aligned with the Aristo Tablestore

The Klarna Product-Page Dataset

MultiCoNER Dataset

Low Context Name Entity Recognition (NER) Datasets with Gazetteer

WikiSum: Coherent Summarization Dataset for Efficient Human-Evaluation

Common Screens

REDASA COVID-19 Open Data

Sudachi Language Resources

Japanese Tokenizer Dictionaries

Answer Reformulation

Common Crawl

NLP – fast.ai datasets

DialoGLUE: A Natural Language Understanding Benchmark for Task-Oriented Dialogue

VoiSeR

OpenAlex dataset

ZEST: ZEroShot learning from Task descriptions

Pre- and post-purchase product questions

Amazon-PQA

CMS 2008-2010 Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF) in OMOP Common Data Model

AI2 Diagram Dataset (AI2D)

Textbook Question Answering (TQA)

Synthea synthetic patient generator data in OMOP Common Data Model

AI2 Tablestore (November 2015 Snapshot)

Humor Detection from Product Question Answering Systems

Aristo Tuple KB

Humor patterns used for querying Alexa traffic

Discrete Reasoning Over the content of Paragraphs (DROP)

The Massively Multilingual Image Dataset (MMID)

Wizard of Tasks

Reasoning Over Paragraph Effects in Situations (ROPES)

Quoref

Provision of Web-Scale Parallel Corpora for Official European Languages (ParaCrawl)

Enriched Topical-Chat Dataset for Knowledge-Grounded Dialogue Systems

National Archives Catalog

Google Books Ngrams

PASS: Perturb-and-Select Summarizer for Product Reviews

Multilingual Name Entity Recognition (NER) Datasets with Gazetteer

Phrase Clustering Dataset (PCD)

The Multilingual Amazon Reviews Corpus

Software Heritage Graph Dataset

Improve your data quality for better AI

Easily curate and annotate your vision, audio, and document data with a single platform

Book A Demo

More categories

Audio

Biology

Computer Vision

Geology

Tabular

Urban