Open Source Data Science Datasets

Path: .

No description

dataset nlp dvc git github

0 0

Path: .

Transactions messages NLP

dataset model nlp dvc label studio git

0 0 0

Path: .

This repository contains the code to import and integrate the book and rating data that we work with. It imports and integrates data from several sources in a homogenous tabular outputs; import scripts are primarily Rust, with Python implement analyses.

dataset nlp dvc git github

3 0

Path: .

classification mail text on scan pdf images

dataset nlp classification object detection image classification dvc git

0 0 0

Path: datasets

A DagsHub implementation of BioBERT: a pre-trained biomedical language representation model for biomedical text mining

dataset model nlp named entity recognition dvc git

2 0 0

Path: data tests

DPT is a QA-bot designed to help answer questions about DagsHub. It is a fork of the brilliant buster project. Using DagsHub's documentation as reference and sentence-transformers/all-MiniLM-L6-v2 for sentence similarity, we identify documents that contain relevant information to a given query. This is then passed to OpenAI's GPT-3.5 Turbo, that uses the information and the query given a prompt to return an answer to the user query, that's hopefully helpful.

dataset nlp question answering chatbot dvc git

0 0 0

morrisalp / unikud

Updated 6 months ago

Path: . data

UNIKUD is an open-source tool for adding vowel signs (nikud) to Hebrew text with deep learning, using absolutely no rule-based logic.

dataset model nlp dvc git mlflow github

0 1

Path: .

A subset of the LAION Aesthetics V2 dataset that contains only images with an aesthetics score of 6.5 or larger.

dataset nlp computer vision text-to-image generation dvc git

4 0 0

Path: .

databricks-dolly-15k is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

dataset nlp dvc git

0 0 0

Path: .

Code for the TriviaQA reading comprehension dataset

dataset nlp dvc git github

0 0

Path: data

Fastai community entry to 2020 Reproducibility Challenge

dataset nlp dvc git github

1 0

Path: .

No description

dataset nlp dvc git

0 0 0

Path: .

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

dataset nlp language modelling dvc git

0 0 0

Path: .

The purpose of the project is to make available a standard training and test setup for language modeling experiments.

dataset nlp language modelling dvc git

0 0 0

Path: .

Subsets of IMDb data are available for access to customers for personal and non-commercial use

dataset nlp tabular dvc git

0 0 0

Path: .

The test data for the Large Text Compression Benchmark is the first 109 bytes of the English Wikipedia

dataset nlp dvc git

0 0 0

Path: .

SQuAD (Stanford Question Answering Dataset) is a dataset for reading comprehension. It consists of a list of questions by crowdworkers on a set of Wikipedia articles. The answers to each of the questions is a segment of text, or span, from the corresponding Wikipedia reading passage. Alternatively, the question may also be unanswerable.

dataset nlp question answering reading comprehension dvc git