Are you sure you want to delete this access key?
Legend |
---|
DVC Managed File |
Git Managed File |
Metric |
Stage File |
External File |
Legend |
---|
DVC Managed File |
Git Managed File |
Metric |
Stage File |
External File |
This repository contains the code to import and integrate the book and rating data that we work with. It imports and integrates data from several sources in a single PostgreSQL database; import scripts are primarily in Python, with Rust code for high-throughput processing of raw data files.
If you use these scripts in any published reseaerch, cite our paper:
Michael D. Ekstrand, Mucun Tian, Mohammed R. Imran Kazi, Hoda Mehrpouyan, and Daniel Kluver. 2018. Exploring Author Gender in Book Rating and Recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys '18). ACM, pp. 242–250. DOI:10.1145/3240323.3240373. arXiv:1808.07586v1 [cs.IR].
Note: the limitations section of the paper contains important information about the limitations of the data these scripts compile. Do not use this data or tools without understanding those limitations. In particular, VIAF's gender information is incomplete and, in a number of cases, incorrect.
We use Data Version Control (dvc
) to script the import and wire
its various parts together.
pg_prewarm
(from the
PostgreSQL Contrib package) installed.It is best if you do not store the data files on the same disk as your PostgreSQL database.
The environment.yml
file defines an Anaconda environment that contains all the required packages except for the PostgreSQL server. It can be set up with:
conda create -f environment.yml
All scripts read database connection info from the standard PostgreSQL client environment variables:
PGDATABASE
PGHOST
PGUSER
PGPASSWORD
Alternatively, they will read from DB_URL
.
After creating your database, initialize the extensions (as the database superuser):
CREATE EXTENSION orafce;
CREATE EXTENSION pg_prewarm;
CREATE EXTENSION "uuid-ossp";
The default PostgreSQL performance configuration settings will probably not be very effective; we recommend turning on parallelism and increasing work memory, at a minimum.
This imports the following data sets:
data
.Several of these files can be auto-downloaded with the DVC scripts; others will need to be manually downloaded.
You can run the entire import process with:
dvc repro
Individual steps can be run with their corresponding .dvc
files.
The import code consists of Python, Rust, and SQL code, wired together with DVC.
Python scripts live under scripts
, as a Python package. They should not be launched directly, but
rather via run.py
, which will make sure the environment is set up properly for them.
In order to allow DVC to be aware of current database state, we use a little bit of an unconventional
layout for many of our DVC scripts. Many steps have two .dvc
files with associated outputs:
step.dvc
runs import stage step
.step.transcript
is (consistent) output from running step
, recording the actions taken. It is
registered with DVC as the output of step.dvc
.step.status.dvc
is an always-changed DVC stage that depends on step.transcript
and produces
step.status
, to check the current status in the database of that import stage.step.status
is an uncached output (so it isn't saved with DVC, and we also ignore it from Git)
that is registered as the output of step.status.dvc
. It contains a stable status dump from the
database, to check whether step
is actually in the database or has changed in a meaningful way.Steps that depend on step
then depend on step.status
, not step.trasncript
.
The reason for this somewhat bizarre layoutis that if we just wrote the output files, and the database was reloaded or corrupted, the DVC status-checking logic would not be ableto keep track of it. This double-file design allows us to make subsequent steps depend on the actual results of the import, not our memory of the import in the Git repository.
The file init.status
is an initial check for database initialization, and forces the creation of the
meta-structures used for tracking stage status. Everything touching the database should depend on it,
directly or indirectly.
Import steps are tracked in the stage_status
table in the database. For completed stages, this can
include a key (checksum, UUID, or other identifier) to identify a 'version' of the stage. Stages
can also have dependencies, which are solely used for computing the status of a stage (all actual
dependency relationships are handled by DVC):
stage_deps
tracks stage-to-stage dependencies, to say that one stage used another as input.stage_file
tracks stage-to-file dependencies, to say that a stage used a file as input.The source_file
table tracks input file checksums.
Projects using the book database can also use stage_status
to obtain data version information, to
see if they are up-to-date.
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?