Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
General:  academic Type:  dataset Data Domain:  nlp Integration:  dvc git github
Michael Ekstrand b75c458a60
Refactor JSON cleaning
5 years ago
7 years ago
96e1a3d4f9
Document data download
5 years ago
lib
55125054cc
Merge branch 'master' of https://github.com/boisestate/ol-processing-tools
5 years ago
src
b75c458a60
Refactor JSON cleaning
5 years ago
f85e2fd94a
random byte test
5 years ago
b44a6f1cf4
use gulp to orchestrate import
7 years ago
5e04344d3f
use fortran for clustering
5 years ago
ed30b915fc
BookCrossing and OpenLibrary import
5 years ago
dfca92225b
compile with indicatif
5 years ago
dfca92225b
compile with indicatif
5 years ago
8415fcb1d8
Update OL readme for Orafce
5 years ago
03ecee71a4
first author indexing
5 years ago
baaae933e6
simplify + goodreads import
5 years ago
baaae933e6
simplify + goodreads import
5 years ago
baaae933e6
simplify + goodreads import
5 years ago
8edfaa0c5d
fix bx schema
5 years ago
6a0c2a9855
python-based clustering
5 years ago
6870631fbc
lots more indexing
6 years ago
5 years ago
4b66dfc202
viaf version
5 years ago
dfadb93d4e
import final clusters
6 years ago
baaae933e6
simplify + goodreads import
5 years ago
2d6c5ddbdb
drop record tables from schemas
6 years ago
baaae933e6
simplify + goodreads import
5 years ago
e9a73d61b1
more author names and genders
5 years ago
ed30b915fc
BookCrossing and OpenLibrary import
5 years ago
f85e2fd94a
random byte test
5 years ago
f85e2fd94a
random byte test
5 years ago
6b522bd570
Add Amazon rating importer
5 years ago
6b522bd570
Add Amazon rating importer
5 years ago
155db26317
get LOC import running
6 years ago
26860d5d65
update ol import
6 years ago
2d6c5ddbdb
drop record tables from schemas
6 years ago
Storage Buckets

README.md

You have to be logged in to leave a comment. Sign In

This repository contains the code to import and integrate the book and rating data that we work with.

Requirements

  • PostgreSQL 10 orafce
  • Node.js (tested on Carbon, the 8.x LTS line)
  • R with the Tidyverse and RPostgreSQL
  • psql executable on the machine where the import scripts will run
  • 300GB disk space for the database
  • 20-30GB disk for data files

All scripts read database connection info from the standard PostgreSQL client environment variables:

  • PGDATGABASE
  • PGHOST
  • PGUSER
  • PGPASSWORD

Setting Up Schemas

The -schema files contain the base schemas for the data to import:

  • common-schema.sql — common tables
  • loc-schema.sql — Library of Congress catalog tables
  • ol-schema.sql — OpenLibrary book data
  • viaf-schema.sql — VIAF tables
  • az-schema.sql — Amazon rating schema
  • bx-schema.sql — BookCrossing rating data schema

Importing Data

The importer is run with Gulp.

npm install
npx gulp importOpenLib
npx gulp importLOC
npx gulp importVIAF
npx gulp importBX
npx gulp importAmazon

The full import takes 1–3 days.

Indexing and Integrating

Start tying the data together:

psql <viaf-index.sql
psql <loc-index.sql
psql <ol-index.sql

Clustering is done by the ClusterISBNs.r script:

Rscript ClusterISBNs.r
psql <load-clusters.sql

With the clusters in place, we're ready to index the rating data:

psql <az-index.sql
psql <bx-index.sql

And finally, compute author information for ISBN clusters:

psql <author-info.sql
Tip!

Press p or to see the previous file or, n or to see the next file

About

This repository contains the code to import and integrate the book and rating data that we work with. It imports and integrates data from several sources in a homogenous tabular outputs; import scripts are primarily Rust, with Python implement analyses.

Collaborators 1

Comments

Loading...