title: GoodReads parent: Data Model
We import GoodReads data from the UCSD Book Graph for additional book and user interaction information. The source files are not automatically downloaded; you will need the following:
We do not yet support reviews.
If you use this data, cite the paper(s) documented on the data set web site.
Imported data lives in the
The import is controlled by the following DVC steps:
gr-schema.sql to set up the base schema.
: Import raw GoodReads data from files under
gr-index-books.sql to index the book data and extract identifiers.
gr-book-info.sql to extract additional book and work metadata.
gr-index-ratings.sql to index the rating and interaction data.
The raw rating data, with invalid characters cleaned up, is in the various
Each table has the following columns:
gr_type_rid : Numeric record identifier generated at import time. Throughout this page, we will refer to these as record identifiers; they are distinct from the identifiers GoodReads uses for books and works, as those are not known until the JSON is unpacked.
JSONB column containing imported data.
We extract the following tables for book and work data:
: GoodReads work identifiers.
: GoodReads book identifiers. This maps each GoodReads book record identifier to the following identifiers:
- book ID - work ID - ASIN - ISBN 10 (`gr_isbn`) - ISBN 13 (`gr_isbn13`) This table extracts the *textual* versions of ISBNs and ASINs directly from the `raw_book` table. It does not resolve them to ISBN IDs.
: Map GoodReads books to ISBN IDs and book codes. This does not use ASINs, just ISBN-10 and ISBN-13s.
: Genre membership (and scores) for each book. This is a direct extract of the book genres file from UCSD.
: The title of each work.
: The publication date of each book. It extracts the year, month, and day; if all three are present, then
pub_date contains the date as an SQL date. These are the
publication_* fields in the book JSON data.
: The original publication date of each work. Extracted like
book_pub_date, but from a work's
: The book cluster each book is a member of.
We extract the following tables for book ratings and interactions (add-to-shelf actions):
: Mapping between user record IDs and GoodReads user IDs.
: Extract of basic information about each entry in the Interactions file. These interactions
represent an add-to-shelf action, optionally with a rating. We extract the following: `gr_interaction_rid` : The interaction record identifier (PK) `gr_book_id` : GoodReads book ID `gr_user_rid` : User record identifier (we use record IDs instead of user IDs to keep them numeric) `rating` : The 5-star rating value (if provided) `is_read` : `isRead` flag from original JSON data. `date_added` : The date the book was added to the shelf. `date_updated` : The update date for this interaction.
: Rating table suitable for use in LensKit. This is aggregated
by book cluster, and contains both the median rating and the last rating, along with the median update date as the timestamp.
: Add-action table suitable for use in LensKit. Also aggregated by book cluster,
with the first and last (update) date as the timestamps, and number of interactions with this book.
Press p or to see the previous file or, n or to see the next file