1 Branches

.dvc

4d07009095

first data file

3 years ago

data

data_v2

models

output

e4d02917e8

pipeline test

3 years ago

.dvcignore

4d07009095

first data file

3 years ago

.gitignore

3afa69b513

ver2

3 years ago

README.md

02f644de64

initial commit

3 years ago

data.dvc

3afa69b513

ver2

3 years ago

data_v2.dvc

3afa69b513

ver2

3 years ago

dvc.lock

e4d02917e8

pipeline test

3 years ago

dvc.yaml

dbd60ec50e

pipeline test

3 years ago

metrics.csv

3afa69b513

ver2

3 years ago

models.dvc

3afa69b513

ver2

3 years ago

params.yml

3afa69b513

ver2

3 years ago

requirements.txt

02f644de64

initial commit

3 years ago

run_textrank_ver1.py

f44b80e75e

test

3 years ago

run_textrank_ver2.py

3afa69b513

ver2

3 years ago

run_textrank_ver3.py

02f644de64

initial commit

3 years ago

run_textrank_ver4.py

02f644de64

initial commit

3 years ago

DagsHub Storage

Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

You have to be logged in to leave a comment.

bakeoff-sample

This repository holds a sample application that bakeoff participants will implement in their respective frameworks.

Data

We have three question/answer datasets from Indeed, as described below:

	original.8.15.2019	original.5.26.2020	original.8.10.2020
no of questions	65381	208559
no of answers	321287	919656

All datasets are on server-san: /mnt/data/dataset/indeed_qa/{original.8.15.2019 | original.5.26.2020 | original.8.10.2020} directories. To use these datasets in the project you can copy them into the <repository>/data directory. Datasets in original.8.15.2019 and original.5.26.2020 are in CSV format whereas original.8.10.2020 contains a JSON file for each question with its answers.

Run Textrank

Setup

Setup python to use 3.7.4 and create a virtual env for bakeoff-sample

cd scripts
pyenv local 3.7.4
pyenv virtualenv 3.7.4 bakeoff-sample

Libraries

Install the required libraries in the bakeoff-sample virtualenv:

cd scripts
pip install -r requirements.txt

Run

The following command will run TextRank on questions containing keywords interview and process, output a summary with at most 5 sentences and 100 words to the output directory in data/output where each json file will contain summarizations.

cd scripts
python run_textrank.py \
  --keywords interview,process \
  --num_sentences 5 \
  --num_words 100 \

Run TextRank on a specific topic(s):

python run_textrank.py \
  --topics DRESS_CODE \
  --num_sentences 5 \
  --num_words 100 \

Run TextRank on a specific questions:

python run_textrank.py \
  --questions <questionids> \
  --num_sentences 5 \
  --num_words 100 \

Output format

Each summarization json file has the following format:

{
	"question_id": "...",
	"question_text": "...?",
	"question_code": "...",
	"question_topics": ["...",..."]
	"company_id": 123,
	"company": "...",
	"num_answers": 7,
	"summary": [{
		"text": "...",
		"support": 2,
		"coverage": [{
			"answer_id": "...",
			"answer_text": "...",
			"similarity": 0.9,
			"matching_tokens": [{
				"token": "...",
				"spans": [ {
					"text": "...",
					"start": 0,
					"end": 8
				}, ...]
			}, ...]
		}, {
			"answer_id": "...",
			"answer_text": "..."
		}],
		"tokens": ["...", "...", ...]
	}, ...],
	"stats": {
		"ROUGE-1": 0.9937106868240972,
		"ROUGE-2": 0.9852216698769686,
		"ROUGE-L": 0.9936320879085392
	}
}

Tip!

Press p or to see the previous file or, n or to see the next file

README.md

bakeoff-sample

Data

Run Textrank

Setup

Libraries

Run

Output format

Comments

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

dan_z / bakeoff-sample

README.md

bakeoff-sample

Data

Run Textrank

Setup

Libraries

Run

Output format

Comments

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

dan_z
/
bakeoff-sample