Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
General:  academic Type:  dataset Data Domain:  nlp Integration:  dvc git github
Michael Ekstrand 1a656d691a
update README a bit
5 years ago
7 years ago
2b7eb8aa19
get ready to import RDF
5 years ago
src
15f55c0f25
Get LOC ID import fixes actually working
5 years ago
188c8b14df
Clean up whitespace
5 years ago
5e04344d3f
use fortran for clustering
6 years ago
5f18663711
Use direct SQL & copy to cluster
5 years ago
de8c409aac
drop unused lazy-static
5 years ago
de8c409aac
drop unused lazy-static
5 years ago
1a656d691a
update README a bit
5 years ago
0184598c93
Fixes to run stuff properly
5 years ago
0184598c93
Fixes to run stuff properly
5 years ago
4217ccfb20
rating data scheams
5 years ago
4217ccfb20
rating data scheams
5 years ago
4217ccfb20
rating data scheams
5 years ago
4217ccfb20
rating data scheams
5 years ago
a6bb08aff0
index and cluster GoodReads books!
5 years ago
968c58e0e1
update Conda environment
5 years ago
7131e39787
better docs and environment
5 years ago
89c11721a2
Better logging and status output
5 years ago
33ec3b81e5
support code and tests
5 years ago
0184598c93
Fixes to run stuff properly
5 years ago
cf10211c37
use full name in GR int columns
5 years ago
aef3ce5157
index updates
5 years ago
15f55c0f25
Get LOC ID import fixes actually working
5 years ago
0987fdaca4
Index more IRIs
5 years ago
0184598c93
Fixes to run stuff properly
5 years ago
0184598c93
Fixes to run stuff properly
5 years ago
a302fb243e
Set up schemas for VIAF and LOC
5 years ago
0184598c93
Fixes to run stuff properly
5 years ago
baaae933e6
simplify + goodreads import
5 years ago
aef3ce5157
index updates
5 years ago
4473820ea1
openlib schema
5 years ago
5678c426a4
fix schemas
5 years ago
4217ccfb20
rating data scheams
5 years ago
301c77a41d
add environment setup
5 years ago
32a2f18dce
fix updates
5 years ago
89c11721a2
Better logging and status output
5 years ago
72d2e15508
Vacuum and notx support
5 years ago
0375bd409b
no vacuums
5 years ago
64d82367e9
Stepify VIAF indexing
5 years ago
64d82367e9
Stepify VIAF indexing
5 years ago
Storage Buckets

README.md

You have to be logged in to leave a comment. Sign In

This repository contains the code to import and integrate the book and rating data that we work with.

If you use these scripts in any published reseaerch, cite our paper:

Michael D. Ekstrand, Mucun Tian, Mohammed R. Imran Kazi, Hoda Mehrpouyan, and Daniel Kluver. 2018. Exploring Author Gender in Book Rating and Recommendation. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys '18). ACM, pp. 242–250. DOI:10.1145/3240323.3240373. arXiv:1808.07586v1 [cs.IR].

Note: the limitations section of the paper contains important information about the limitations of the data these scripts compile. Do not use this data or tools without understanding those limitations.

Requirements

  • PostgreSQL 10 or later with orafce and pg_prewarm (from the PostgreSQL Contrib package) installed.
  • Python 3.6 or later with the following packages:
    • psycopg2
    • invoke
    • numpy
    • tqdm
    • pandas
    • numba
    • colorama
    • chromalog
    • humanize
  • A Rust compiler (available from Anaconda)
  • psql executable on the machine where the import scripts will run
  • 1TB disk space for the database
  • 100GB disk space for data files

The environment.yml file defines an Anaconda environment that contains all the required packages except for the PostgreSQL server. It can be set up with:

conda env create -f environment.yml

All scripts read database connection info from the standard PostgreSQL client environment variables:

  • PGDATABASE
  • PGHOST
  • PGUSER
  • PGPASSWORD

Initializing and Configuring the Database

After creating your database, initialize the extensions (as the database superuser):

CREATE EXTENSION orafce;
CREATE EXTENSION pg_prewarm;

The default PostgreSQL performance configuration settings will probably not be very effective; we recommend turning on parallelism and increasing work memory, at a minimum.

Downloading Data Files

This imports the following data sets:

Running Import Tasks

The import process is scripted with invoke. The first tasks to run are the import tasks:

invoke loc.import-books
invoke loc.import-names
invoke viaf.import
invoke openlib.import-authors openlib.import-works openlib.import-editions
invoke goodreads.import
invoke ratings.import-az
invoke ratings.import-bx

Once all the data is imported, you can begin to run the indexing and linking tasks:

invoke viaf.index
invoke loc.index-books
invoke loc.index-names
invoke openlib.index
invoke goodreads.index-books
invoke analyze.cluster --scope loc
invoke analyze.cluster --scope ol
invoke analyze.cluster --scope gr
invoke analyze.cluster
invoke ratings.index
invoke goodreads.index-ratings
invoke analyze.authors

The tasks keep track of the import status in an import_status table, and will keep you from running tasks in the wrong order.

Setting Up Schemas

The -schema files contain the base schemas for the data:

  • common-schema.sql — common tables
  • loc-mds-schema.sql — Library of Congress catalog tables
  • ol-schema.sql — OpenLibrary book data
  • viaf-schema.sql — VIAF tables
  • az-schema.sql — Amazon rating schema
  • bx-schema.sql — BookCrossing rating data schema
  • gr-schema.sql — GoodReads data schema
  • loc-ids-schema.sql - LOC ID schemas
Tip!

Press p or to see the previous file or, n or to see the next file

About

This repository contains the code to import and integrate the book and rating data that we work with. It imports and integrates data from several sources in a homogenous tabular outputs; import scripts are primarily Rust, with Python implement analyses.

Collaborators 1

Comments

Loading...