Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
General:  academic Type:  dataset Data Domain:  nlp Integration:  dvc git github
Michael Ekstrand 5e04344d3f
use fortran for clustering
6 years ago
7 years ago
96e1a3d4f9
Document data download
6 years ago
lib
0e1d6fafb2
no more q-e-s in ol import
6 years ago
26860d5d65
update ol import
6 years ago
b44a6f1cf4
use gulp to orchestrate import
7 years ago
5e04344d3f
use fortran for clustering
6 years ago
5e04344d3f
use fortran for clustering
6 years ago
5e04344d3f
use fortran for clustering
6 years ago
eb9cb1f50d
imporrt PG info
6 years ago
03ecee71a4
first author indexing
6 years ago
9368d7f44a
Properly use different clusters
6 years ago
da164ba54d
Separate indexing and schemas
7 years ago
9368d7f44a
Properly use different clusters
6 years ago
da164ba54d
Separate indexing and schemas
7 years ago
6870631fbc
lots more indexing
6 years ago
5e04344d3f
use fortran for clustering
6 years ago
4e214d9b79
index LOC
6 years ago
6 years ago
dfadb93d4e
import final clusters
6 years ago
f57863fe5e
Add view for publication years
6 years ago
2d6c5ddbdb
drop record tables from schemas
6 years ago
e9a73d61b1
more author names and genders
6 years ago
b0dd321089
new ol schema
6 years ago
7438f01100
package-lock
6 years ago
1e3eecc7ad
drop node packages
6 years ago
155db26317
get LOC import running
6 years ago
5329dad25a
typings
7 years ago
26860d5d65
update ol import
6 years ago
2d6c5ddbdb
drop record tables from schemas
6 years ago
Storage Buckets

README.md

You have to be logged in to leave a comment. Sign In

This repository contains the code to import and integrate the book and rating data that we work with.

Requirements

  • PostgreSQL 10 (9.x may also work)
  • Node.js (tested on Carbon, the 8.x LTS line)
  • R with the Tidyverse and RPostgreSQL
  • psql executable on the machine where the import scripts will run
  • 300GB disk space for the database
  • 20-30GB disk for data files

All scripts read database connection info from the standard PostgreSQL client environment variables:

  • PGDATGABASE
  • PGHOST
  • PGUSER
  • PGPASSWORD

Setting Up Schemas

The -schema files contain the base schemas for the data to import:

  • common-schema.sql — common tables
  • loc-schema.sql — Library of Congress catalog tables
  • ol-schema.sql — OpenLibrary book data
  • viaf-schema.sql — VIAF tables
  • az-schema.sql — Amazon rating schema
  • bx-schema.sql — BookCrossing rating data schema

Importing Data

The importer is run with Gulp.

npm install
npx gulp importOpenLib
npx gulp importLOC
npx gulp importVIAF
npx gulp importBX
npx gulp importAmazon

The full import takes 1–3 days.

Indexing and Integrating

Start tying the data together:

psql <viaf-index.sql
psql <loc-index.sql
psql <ol-index.sql

Clustering is done by the ClusterISBNs.r script:

Rscript ClusterISBNs.r
psql <load-clusters.sql

With the clusters in place, we're ready to index the rating data:

psql <az-index.sql
psql <bx-index.sql

And finally, compute author information for ISBN clusters:

psql <author-info.sql
Tip!

Press p or to see the previous file or, n or to see the next file

About

This repository contains the code to import and integrate the book and rating data that we work with. It imports and integrates data from several sources in a homogenous tabular outputs; import scripts are primarily Rust, with Python implement analyses.

Collaborators 1

Comments

Loading...