openlib.md 2.6 KB


title: OpenLibrary parent: Data Model

nav_order: 3

OpenLibrary

We also source book data from OpenLibrary, as downloaded from their developer dumps.

The DVC control files automatically download the appropriate version. The version can be updated by modifying the data/ol_dump_*.txt.gz.dvc files.

Imported data lives in the ol schema.

Import Steps

The import is controlled by the following DVC steps:

schemas/ol-schema.dvc : Run ol-schema.sql to set up the base schema.

import/ol-works.dvc : Import raw OpenLibrary works from data/ol_dump_works.txt.gz.

import/ol-editions.dvc : Import raw OpenLibrary editions from data/ol_dump_editions.txt.gz.

import/ol-authors.dvc : Import raw OpenLibrary authors from data/ol_dump_authors.txt.gz.

index/ol-index.dvc : Run ol-index.sql to index the book data and extract tables.

index/ol-book-info.dvc : Run ol-book-info.sql to extract additional book data into tables.

Raw Data

OpenLibrary provides its data as JSON. It is imported as-is into a JSONB column in three tables:

  • ol.author
  • ol.work
  • ol.edition

Each of these has the following columns:

type_id : A numeric record identifier generated at import.

type_key : The OpenLibrary identifier key (e.g. /books/3180A3).

type_data : The raw JSON data containing the record.

We use PostgreSQL's JSON operators and functions to extract the data from these tables for the rest of the OpenLibrary data model.

Extracted Edition Tables

We extract the following tables from OpenLibrary editions:

edition_author : Links edition and author to record an edition's authors.

edition_first_author : Links edition and author to record an edition's first author.

edition_work : Links each edition to its work(s)

edition_isbn : The raw ISBNs for each edition (not ISBN IDs)

isbn_link : Link ISBNs, editions, and works, along with the book code derived from an edition's

work and edition IDs.  If an edition belongs to multiple works, it will appear multiple
times here.  This table violates 4NF.

Extracted Work Tables

We extract the following tables from OpenLibrary works:

work_author : Links work and author to record an work's authors.

work_first_author : Links work and author to record an work's first author.

work_subject : The subjects entries for each work.

Extracted Author Tables

author_name : The names for each author. An author may have more than one listed name; this extracts

all of them.
Tip!

Press p or to see the previous file or, n or to see the next file