Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Michael Ekstrand 7131e39787
better docs and environment
5 years ago
7 years ago
2b7eb8aa19
get ready to import RDF
5 years ago
src
15f55c0f25
Get LOC ID import fixes actually working
5 years ago
188c8b14df
Clean up whitespace
5 years ago
5e04344d3f
use fortran for clustering
6 years ago
5f18663711
Use direct SQL & copy to cluster
5 years ago
de8c409aac
drop unused lazy-static
5 years ago
de8c409aac
drop unused lazy-static
5 years ago
7131e39787
better docs and environment
5 years ago
0184598c93
Fixes to run stuff properly
5 years ago
0184598c93
Fixes to run stuff properly
5 years ago
4217ccfb20
rating data scheams
5 years ago
4217ccfb20
rating data scheams
5 years ago
4217ccfb20
rating data scheams
5 years ago
4217ccfb20
rating data scheams
5 years ago
a6bb08aff0
index and cluster GoodReads books!
5 years ago
968c58e0e1
update Conda environment
5 years ago
7131e39787
better docs and environment
5 years ago
89c11721a2
Better logging and status output
5 years ago
33ec3b81e5
support code and tests
5 years ago
0184598c93
Fixes to run stuff properly
5 years ago
cf10211c37
use full name in GR int columns
5 years ago
aef3ce5157
index updates
5 years ago
15f55c0f25
Get LOC ID import fixes actually working
5 years ago
0987fdaca4
Index more IRIs
5 years ago
0184598c93
Fixes to run stuff properly
5 years ago
0184598c93
Fixes to run stuff properly
5 years ago
a302fb243e
Set up schemas for VIAF and LOC
5 years ago
0184598c93
Fixes to run stuff properly
5 years ago
baaae933e6
simplify + goodreads import
5 years ago
aef3ce5157
index updates
5 years ago
4473820ea1
openlib schema
5 years ago
5678c426a4
fix schemas
5 years ago
4217ccfb20
rating data scheams
5 years ago
301c77a41d
add environment setup
5 years ago
32a2f18dce
fix updates
5 years ago
89c11721a2
Better logging and status output
5 years ago
72d2e15508
Vacuum and notx support
5 years ago
0375bd409b
no vacuums
5 years ago
64d82367e9
Stepify VIAF indexing
5 years ago
64d82367e9
Stepify VIAF indexing
5 years ago
Storage Buckets

README.md

You have to be logged in to leave a comment. Sign In

This repository contains the code to import and integrate the book and rating data that we work with.

Requirements

  • PostgreSQL 10 or later with orafce and pg_prewarm (from the PostgreSQL Contrib package) installed.
  • Python 3.6 or later with the following packages:
    • psycopg2
    • invoke
    • numpy
    • tqdm
    • pandas
    • numba
    • colorama
    • chromalog
    • humanize
  • A Rust compiler (available from Anaconda)
  • psql executable on the machine where the import scripts will run
  • 1TB disk space for the database
  • 100GB disk space for data files

The environment.yml file defines an Anaconda environment that contains all the required packages except for the PostgreSQL server. It can be set up with:

conda env create -f environment.yml

All scripts read database connection info from the standard PostgreSQL client environment variables:

  • PGDATABASE
  • PGHOST
  • PGUSER
  • PGPASSWORD

Initializing and Configuring the Database

After creating your database, initialize the extensions (as the database superuser):

CREATE EXTENSION orafce;
CREATE EXTENSION pg_prewarm;

The default PostgreSQL performance configuration settings will probably not be very effective; we recommend turning on parallelism and increasing work memory, at a minimum.

Downloading Data Files

This imports the following data sets:

Running Import Tasks

The import process is scripted with invoke. The first tasks to run are the import tasks:

invoke loc.import
invoke viaf.import
invoke openlib.import-authors openlib.import-works openlib.import-editions
invoke goodreads.import
invoke ratings.import-az
invoke ratings.import-bx

Once all the data is imported, you can begin to run the indexing and linking tasks:

invoke viaf.index
invoke loc.index
invoke openlib.index
invoke goodreads.index-books
invoke analyze.cluster --scope loc
invoke analyze.cluster --scope ol
invoke analyze.cluster --scope gr
invoke analyze.cluster
invoke ratings.index
invoke goodreads.index-ratings
invoke analyze.authors

The tasks keep track of the import status in an import_status table, and will keep you from running tasks in the wrong order.

Setting Up Schemas

The -schema files contain the base schemas for the data:

  • common-schema.sql — common tables
  • loc-schema.sql — Library of Congress catalog tables
  • ol-schema.sql — OpenLibrary book data
  • viaf-schema.sql — VIAF tables
  • az-schema.sql — Amazon rating schema
  • bx-schema.sql — BookCrossing rating data schema
  • gr-schema.sql — GoodReads data schema
Tip!

Press p or to see the previous file or, n or to see the next file

About

No description

Collaborators 1

Comments

Loading...