Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
General:  academic Type:  dataset Data Domain:  nlp Integration:  dvc git github
Michael Ekstrand af8a808460
Update DB status checking and add init.status
4 years ago
218ea6b584
Download LOC file listing
4 years ago
7 years ago
af8a808460
Update DB status checking and add init.status
4 years ago
0ba67409b9
Download book and name files
4 years ago
af8a808460
Update DB status checking and add init.status
4 years ago
src
877490c246
Add support for saving file hashes
4 years ago
188c8b14df
Clean up whitespace
5 years ago
5e04344d3f
use fortran for clustering
6 years ago
a88037f9a1
LOCID Schema
4 years ago
877490c246
Add support for saving file hashes
4 years ago
877490c246
Add support for saving file hashes
4 years ago
d929779cfb
Add license
5 years ago
f10de0d682
A bunch of work on indexing
4 years ago
af8a808460
Update DB status checking and add init.status
4 years ago
9bc8ad30c1
use LOCMDS instead of LOCID for clusters
4 years ago
52a53e41b4
get author to run!
4 years ago
bdbeff166a
Update rating tables
4 years ago
4217ccfb20
rating data scheams
5 years ago
4217ccfb20
rating data scheams
5 years ago
4217ccfb20
rating data scheams
5 years ago
af8a808460
Update DB status checking and add init.status
4 years ago
de9caf4251
Add SQL script support to DVC
4 years ago
af8a808460
Update DB status checking and add init.status
4 years ago
de9caf4251
Add SQL script support to DVC
4 years ago
7b4c0f2dbe
Record file checksums into database
4 years ago
33ec3b81e5
support code and tests
5 years ago
bdbeff166a
Update rating tables
4 years ago
cf10211c37
use full name in GR int columns
5 years ago
af8a808460
Update DB status checking and add init.status
4 years ago
10f848ae6b
some indexes
4 years ago
0cc4469270
id improvements
4 years ago
0cc4469270
id improvements
4 years ago
a88037f9a1
LOCID Schema
4 years ago
8cc43cdad1
Use _uuid suffixes for UUIDs
4 years ago
a88037f9a1
LOCID Schema
4 years ago
0cc4469270
id improvements
4 years ago
53b729dbd8
Properly re-run LOC-MDS
4 years ago
0184598c93
Fixes to run stuff properly
5 years ago
54325e29da
MDS schema
4 years ago
a302fb243e
Set up schemas for VIAF and LOC
5 years ago
7b4c0f2dbe
Record file checksums into database
4 years ago
baaae933e6
simplify + goodreads import
5 years ago
b9ab389f38
hey indexing
4 years ago
4473820ea1
openlib schema
5 years ago
7b4c0f2dbe
Record file checksums into database
4 years ago
7b4c0f2dbe
Record file checksums into database
4 years ago
7b4c0f2dbe
Record file checksums into database
4 years ago
7b4c0f2dbe
Record file checksums into database
4 years ago
72d2e15508
Vacuum and notx support
5 years ago
0375bd409b
no vacuums
5 years ago
64d82367e9
Stepify VIAF indexing
5 years ago
7b4c0f2dbe
Record file checksums into database
4 years ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

You have to be logged in to leave a comment. Sign In

This repository contains the code to import and integrate the book and rating data that we work with. It imports and integrates data from several sources in a single PostgreSQL database; import scripts are primarily in Python, with Rust code for high-throughput processing of raw data files.

If you use these scripts in any published reseaerch, cite our paper:

Michael D. Ekstrand, Mucun Tian, Mohammed R. Imran Kazi, Hoda Mehrpouyan, and Daniel Kluver. 2018. Exploring Author Gender in Book Rating and Recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys '18). ACM, pp. 242–250. DOI:10.1145/3240323.3240373. arXiv:1808.07586v1 [cs.IR].

Note: the limitations section of the paper contains important information about the limitations of the data these scripts compile. Do not use this data or tools without understanding those limitations. In particular, VIAF's gender information is incomplete and, in a number of cases, incorrect.

We use Data Version Control (dvc) to script the import and wire its various parts together.

Requirements

  • PostgreSQL 10 or later with orafce and pg_prewarm (from the PostgreSQL Contrib package) installed.
  • Python 3.6 or later with the following packages:
    • psycopg2
    • numpy
    • tqdm
    • pandas
    • numba
    • colorama
    • chromalog
    • humanize
    • dvc
  • The Rust compiler (available from Anaconda)
  • 2TB disk space for the database
  • 100GB disk space for data files

It is best if you do not store the data files on the same disk as your PostgreSQL database.

The environment.yml file defines an Anaconda environment that contains all the required packages except for the PostgreSQL server. It can be set up with:

conda create -f environment.yml

All scripts read database connection info from the standard PostgreSQL client environment variables:

  • PGDATABASE
  • PGHOST
  • PGUSER
  • PGPASSWORD

Alternatively, they will read from DB_URL.

Initializing and Configuring the Database

After creating your database, initialize the extensions (as the database superuser):

CREATE EXTENSION orafce;
CREATE EXTENSION pg_prewarm;
CREATE EXTENSION "uuid-ossp";

The default PostgreSQL performance configuration settings will probably not be very effective; we recommend turning on parallelism and increasing work memory, at a minimum.

Downloading Data Files

This imports the following data sets:

Several of these files can be auto-downloaded with the DVC scripts; others will need to be manually downloaded.

Running Everything

You can run the entire import process with:

dvc repro

Individual steps can be run with their corresponding .dvc files.

Layout

The import code consists of Python, Rust, and SQL code, wired together with DVC.

DVC Usage and Stage Files

In order to allow DVC to be aware of current database state, we use a little bit of an unconventional layout for many of our DVC scripts. Many steps have two .dvc files with associated outputs:

  • step.dvc runs import stage step.
  • step.transcript is (consistent) output from running step, recording the actions taken. It is registered with DVC as the output of step.dvc.
  • step.status.dvc is an always-changed DVC stage that depends on step.transcript and produces step.status, to check the current status in the database of that import stage.
  • step.status is an uncached output (so it isn't saved with DVC, and we also ignore it from Git) that is registered as the output of step.status.dvc. It contains a stable status dump from the database, to check whether step is actually in the database or has changed in a meaningful way.

Steps that depend on step then depend on step.status, not step.trasncript.

The reason for this somewhat bizarre layoutis that if we just wrote the output files, and the database was reloaded or corrupted, the DVC status-checking logic would not be ableto keep track of it. This double-file design allows us to make subsequent steps depend on the actual results of the import, not our memory of the import in the Git repository.

The file init.status is an initial check for database initialization, and forces the creation of the meta-structures used for tracking stage status. Everything touching the database should depend on it, directly or indirectly.

Tip!

Press p or to see the previous file or, n or to see the next file

About

This repository contains the code to import and integrate the book and rating data that we work with. It imports and integrates data from several sources in a homogenous tabular outputs; import scripts are primarily Rust, with Python implement analyses.

Collaborators 1

Comments

Loading...