Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Dan Zhang dbd60ec50e
pipeline test
3 years ago
4d07009095
first data file
3 years ago
e4d02917e8
pipeline test
3 years ago
4d07009095
first data file
3 years ago
3 years ago
02f644de64
initial commit
3 years ago
3 years ago
3 years ago
e4d02917e8
pipeline test
3 years ago
dbd60ec50e
pipeline test
3 years ago
3 years ago
3 years ago
3 years ago
02f644de64
initial commit
3 years ago
02f644de64
initial commit
3 years ago
02f644de64
initial commit
3 years ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

You have to be logged in to leave a comment. Sign In

bakeoff-sample

This repository holds a sample application that bakeoff participants will implement in their respective frameworks.

Data

We have three question/answer datasets from Indeed, as described below:

original.8.15.2019 original.5.26.2020 original.8.10.2020
no of questions 65381 208559
no of answers 321287 919656

All datasets are on server-san: /mnt/data/dataset/indeed_qa/{original.8.15.2019 | original.5.26.2020 | original.8.10.2020} directories. To use these datasets in the project you can copy them into the <repository>/data directory. Datasets in original.8.15.2019 and original.5.26.2020 are in CSV format whereas original.8.10.2020 contains a JSON file for each question with its answers.

Run Textrank

Setup

Setup python to use 3.7.4 and create a virtual env for bakeoff-sample

cd scripts
pyenv local 3.7.4
pyenv virtualenv 3.7.4 bakeoff-sample

Libraries

Install the required libraries in the bakeoff-sample virtualenv:

cd scripts
pip install -r requirements.txt

Run

The following command will run TextRank on questions containing keywords interview and process, output a summary with at most 5 sentences and 100 words to the output directory in data/output where each json file will contain summarizations.

cd scripts
python run_textrank.py \
  --keywords interview,process \
  --num_sentences 5 \
  --num_words 100 \

Run TextRank on a specific topic(s):

python run_textrank.py \
  --topics DRESS_CODE \
  --num_sentences 5 \
  --num_words 100 \

Run TextRank on a specific questions:

python run_textrank.py \
  --questions <questionids> \
  --num_sentences 5 \
  --num_words 100 \
Output format

Each summarization json file has the following format:

{
	"question_id": "...",
	"question_text": "...?",
	"question_code": "...",
	"question_topics": ["...",..."]
	"company_id": 123,
	"company": "...",
	"num_answers": 7,
	"summary": [{
		"text": "...",
		"support": 2,
		"coverage": [{
			"answer_id": "...",
			"answer_text": "...",
			"similarity": 0.9,
			"matching_tokens": [{
				"token": "...",
				"spans": [ {
					"text": "...",
					"start": 0,
					"end": 8
				}, ...]
			}, ...]
		}, {
			"answer_id": "...",
			"answer_text": "..."
		}],
		"tokens": ["...", "...", ...]
	}, ...],
	"stats": {
		"ROUGE-1": 0.9937106868240972,
		"ROUGE-2": 0.9852216698769686,
		"ROUGE-L": 0.9936320879085392
	}
}
Tip!

Press p or to see the previous file or, n or to see the next file

About

No description

Collaborators 1

Comments

Loading...