Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
General:  academic Type:  dataset Data Domain:  nlp Integration:  dvc git github
Michael Ekstrand defa4eff0d
try to work on psql
5 years ago
7 years ago
96e1a3d4f9
Document data download
5 years ago
src
defa4eff0d
try to work on psql
5 years ago
74a516ec05
update editor config
5 years ago
5e04344d3f
use fortran for clustering
5 years ago
ed30b915fc
BookCrossing and OpenLibrary import
5 years ago
defa4eff0d
try to work on psql
5 years ago
defa4eff0d
try to work on psql
5 years ago
8c3af6d280
Add support for LOC MDS name data
5 years ago
4ceede3194
Use schemas for clusters
5 years ago
03ecee71a4
first author indexing
5 years ago
4217ccfb20
rating data scheams
5 years ago
4217ccfb20
rating data scheams
5 years ago
4217ccfb20
rating data scheams
5 years ago
4217ccfb20
rating data scheams
5 years ago
6a0c2a9855
python-based clustering
5 years ago
3686d45b73
Import logging for tasks
5 years ago
a6bb08aff0
index and cluster GoodReads books!
5 years ago
301c77a41d
add environment setup
5 years ago
4473820ea1
openlib schema
5 years ago
94611f882e
GoodReads schema
5 years ago
4217ccfb20
rating data scheams
5 years ago
94611f882e
GoodReads schema
5 years ago
4164f04721
LOC schema setup
5 years ago
a302fb243e
Set up schemas for VIAF and LOC
5 years ago
a302fb243e
Set up schemas for VIAF and LOC
5 years ago
a302fb243e
Set up schemas for VIAF and LOC
5 years ago
4164f04721
LOC schema setup
5 years ago
baaae933e6
simplify + goodreads import
5 years ago
4473820ea1
openlib schema
5 years ago
4473820ea1
openlib schema
5 years ago
4473820ea1
openlib schema
5 years ago
4217ccfb20
rating data scheams
5 years ago
301c77a41d
add environment setup
5 years ago
569fa1e2e0
clustering and whatnot
5 years ago
569fa1e2e0
clustering and whatnot
5 years ago
a302fb243e
Set up schemas for VIAF and LOC
5 years ago
a302fb243e
Set up schemas for VIAF and LOC
5 years ago
a302fb243e
Set up schemas for VIAF and LOC
5 years ago
Storage Buckets

README.md

You have to be logged in to leave a comment. Sign In

This repository contains the code to import and integrate the book and rating data that we work with.

Requirements

  • PostgreSQL 10 or later with orafce
  • Python 3.6 or later with the following packages:
    • psycopg2
    • invoke
    • numpy
    • tqdm
    • pandas
    • sqlalchemy
    • numba
    • colorama
    • chromalog
    • humanize
  • A Rust compiler
  • psql executable on the machine where the import scripts will run
  • 500GB disk space for the database
  • 30GB disk space for data files

The environment-linux-x64.yml file defines an Anaconda environment that contains all the required packages, with the exception of the PostgreSQL server and client executables.

All scripts read database connection info from the standard PostgreSQL client environment variables:

  • PGDATGABASE
  • PGHOST
  • PGUSER
  • PGPASSWORD

Downloading Data Files

This imports the following data sets:

Running Import Tasks

The import process is scripted with invoke. The first tasks to run are the import tasks:

invoke loc.import
invoke viaf.import
invoke openlib.import-authors openlib.import-works openlib.import-editions
invoke goodreads.import
invoke ratings.import-az
invoke ratings.import-bx

Once all the data is imported, you can begin to run the indexing and linking tasks:

invoke viaf.index
invoke loc.index
invoke openlib.index
invoke goodreads.index-books
invoke analyze.cluster --scope loc
invoke analyze.cluster --scope ol
invoke analyze.cluster --scope gr
invoke analyze.cluster
invoke ratings.index
invoke goodreads.index-ratings
invoke analyze.authors

The tasks keep track of the import status in an import_status table, and will keep you from running tasks in the wrong order.

Setting Up Schemas

The -schema files contain the base schemas for the data:

  • common-schema.sql — common tables
  • loc-schema.sql — Library of Congress catalog tables
  • ol-schema.sql — OpenLibrary book data
  • viaf-schema.sql — VIAF tables
  • az-schema.sql — Amazon rating schema
  • bx-schema.sql — BookCrossing rating data schema
  • gr-schema.sql — GoodReads data schema
Tip!

Press p or to see the previous file or, n or to see the next file

About

This repository contains the code to import and integrate the book and rating data that we work with. It imports and integrates data from several sources in a homogenous tabular outputs; import scripts are primarily Rust, with Python implement analyses.

Collaborators 1

Comments

Loading...