loc.md 3.3 KB

title: Library of Congress

Library of Congress

One of our sources of book data is the Library of Congress MDSConnect Books bibliography records.

We download and import the XML versions of these files.

Imported data lives under the locmds schema.

Data Model Diagram

LOC data model

Import Steps

The import is controlled by the following DVC steps:

schemas/loc-mds-schema.dvc : Run loc-mds-schema.sql to set up the base schema.

import/loc-mds-books.dvc : Import raw MARC data from data/loc-books/.

import/loc-mds-extract-isbns.dvc : Parse ISBNs from LOC ISBN records.

index/loc-mds-index-books.dvc : Run loc-mds-index-books.sql to index the book data and extract tables.

index/loc-mds-book-info.dvc : Run loc-mds-book-info.sql to extract additional book data into tables.

Raw Book Data

The locmds.book_marc_fields table contains the raw data imported from the MARC files, as MARC fields. The LOC book data follows the MARC 21 Bibliographic Data format; the various tags, field codes, and indicators are defined there. This table is not terribly useful on its own, but it is the source from which the other tables are derived.

It has the following columns:

rec_id : The record identifier (generated at import)

fld_no : The field number. This corresponds to a single MARC field entry; rows in this table

containing data from MARC subfields will share a `fld_no` with their containing field.

tag : The MARC tag; either a three-digit number, or LDR for the MARC leader.

ind1, ind2 : MARC indicators. Their meanings are defined in the MARC specification.

sf_code : MARC subfield code.

contents : The raw textual content of the MARC field or subfield.

Extracted Book Tables

We then extract a number of tables and views from this MARC data. These tables include:

book_record_info : Code information for each book record.

- MARC Control Number
- Library of Congress Control Number (LCCN)
- Record status
- Record type
- Bibliographic level

More information about the last three is in the [leader specification](https://www.loc.gov/marc/bibliographic/bdleader.html).

book : A subset of book_record_info intended to capture the actual books in the collection,

as opposed to other types of materials.  We consider a book to be anything that has MARC
record type ‘a’ or ‘t’ (language material), and is not also classified as a government
record in MARC field 008.

book_extracted_isbn : Textual ISBNs as extracted from LOC records. The actual ISBN strings (tag 020 subfield ‘a’) are

quite messy; the Rust program `parse-isbns` parses out ISBNs, along with additional tags or
descriptors, from the ISBN strings using a number of best-effort heuristics.  This table contains
the results of that process.

book_rec_isbn : Map book records to their ISBNs.

book_author_name : Author names for book records. This only extracts the primary author name (MARC field 100

subfield ‘a’).

book_pub_year : Book publication year (MARC field 260 subfield ‘c’).

book_title : Book title (MARC field 245 subfield ‘a’).


Press p or to see the previous file or, n or to see the next file