Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
General:  academic Type:  dataset Data Domain:  nlp Integration:  dvc git github
Michael Ekstrand 8c3af6d280
Add support for LOC MDS name data
5 years ago
7 years ago
96e1a3d4f9
Document data download
6 years ago
src
a85be8f7af
build with Rust 1.29
5 years ago
74a516ec05
update editor config
5 years ago
5e04344d3f
use fortran for clustering
6 years ago
ed30b915fc
BookCrossing and OpenLibrary import
5 years ago
dfca92225b
compile with indicatif
5 years ago
dfca92225b
compile with indicatif
5 years ago
8c3af6d280
Add support for LOC MDS name data
5 years ago
569fa1e2e0
clustering and whatnot
5 years ago
03ecee71a4
first author indexing
6 years ago
b11c0f4e7a
lots more good indexing
5 years ago
baaae933e6
simplify + goodreads import
5 years ago
b11c0f4e7a
lots more good indexing
5 years ago
8edfaa0c5d
fix bx schema
5 years ago
6a0c2a9855
python-based clustering
5 years ago
3686d45b73
Import logging for tasks
5 years ago
a6bb08aff0
index and cluster GoodReads books!
5 years ago
301c77a41d
add environment setup
5 years ago
6f0530108c
turn on error stop for psql
5 years ago
b11c0f4e7a
lots more good indexing
5 years ago
b11c0f4e7a
lots more good indexing
5 years ago
5 years ago
8c3af6d280
Add support for LOC MDS name data
5 years ago
8c3af6d280
Add support for LOC MDS name data
5 years ago
8c3af6d280
Add support for LOC MDS name data
5 years ago
2d6c5ddbdb
drop record tables from schemas
6 years ago
8c3af6d280
Add support for LOC MDS name data
5 years ago
baaae933e6
simplify + goodreads import
5 years ago
6f0530108c
turn on error stop for psql
5 years ago
ed30b915fc
BookCrossing and OpenLibrary import
5 years ago
6f0530108c
turn on error stop for psql
5 years ago
eb55ab8d37
fix log output error
5 years ago
301c77a41d
add environment setup
5 years ago
569fa1e2e0
clustering and whatnot
5 years ago
569fa1e2e0
clustering and whatnot
5 years ago
23d8a6b023
Index more data in the VIAF
5 years ago
2d6c5ddbdb
drop record tables from schemas
6 years ago
6f0530108c
turn on error stop for psql
5 years ago
Storage Buckets

README.md

You have to be logged in to leave a comment. Sign In

This repository contains the code to import and integrate the book and rating data that we work with.

Requirements

  • PostgreSQL 10 or later with orafce
  • Python 3.6 or later with the following packages:
    • psycopg2
    • invoke
    • numpy
    • tqdm
    • pandas
    • sqlalchemy
    • numba
    • colorama
    • chromalog
    • humanize
  • A Rust compiler
  • psql executable on the machine where the import scripts will run
  • 500GB disk space for the database
  • 30GB disk space for data files

The environment-linux-x64.yml file defines an Anaconda environment that contains all the required packages, with the exception of the PostgreSQL server and client executables.

All scripts read database connection info from the standard PostgreSQL client environment variables:

  • PGDATGABASE
  • PGHOST
  • PGUSER
  • PGPASSWORD

Downloading Data Files

This imports the following data sets:

Running Import Tasks

The import process is scripted with invoke. The first tasks to run are the import tasks:

invoke loc.import
invoke viaf.import
invoke openlib.import-authors openlib.import-works openlib.import-editions
invoke goodreads.import
invoke ratings.import-az
invoke ratings.import-bx

Once all the data is imported, you can begin to run the indexing and linking tasks:

invoke viaf.index
invoke loc.index
invoke openlib.index
invoke goodreads.index-books
invoke analyze.cluster --scope loc
invoke analyze.cluster --scope ol
invoke analyze.cluster --scope gr
invoke analyze.cluster
invoke ratings.index
invoke goodreads.index-ratings
invoke analyze.authors

The tasks keep track of the import status in an import_status table, and will keep you from running tasks in the wrong order.

Setting Up Schemas

The -schema files contain the base schemas for the data:

  • common-schema.sql — common tables
  • loc-schema.sql — Library of Congress catalog tables
  • ol-schema.sql — OpenLibrary book data
  • viaf-schema.sql — VIAF tables
  • az-schema.sql — Amazon rating schema
  • bx-schema.sql — BookCrossing rating data schema
  • gr-schema.sql — GoodReads data schema
Tip!

Press p or to see the previous file or, n or to see the next file

About

This repository contains the code to import and integrate the book and rating data that we work with. It imports and integrates data from several sources in a homogenous tabular outputs; import scripts are primarily Rust, with Python implement analyses.

Collaborators 1

Comments

Loading...