Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

search.json 139 KB

You have to be logged in to leave a comment. Sign In
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
  1. [
  2. {
  3. "objectID": "papers.html",
  4. "href": "papers.html",
  5. "title": "Papers Using BookData",
  6. "section": "",
  7. "text": "Papers Using BookData\nThese are papers we know to be using this book data integration. If you use this data in published research, please cite our paper."
  8. },
  9. {
  10. "objectID": "implementation/index.html",
  11. "href": "implementation/index.html",
  12. "title": "Implementation",
  13. "section": "",
  14. "text": "These data and integration tools are designed to support several goals:\n\nComplete end-to-end reproducibility with a single command (dvc repro)\nSelf-documenting import stage dependencies\nAutomatically re-run downstream steps when a data file or integration logic changes\nSupport updates (e.g. new OpenLibrary dumps) by replacing the file and re-running\nEfficient import and integration\n\n\n\nThese goals are realized through a few technology and design decisions:\n\nScript all import steps with a tool that can track stage dependencies and check whether a stage is up-to-date (DVC).\nMake individual import stages self-contained and limited.\nExtract data from raw sources into tabular form, then integrate as a separate step.\nWhen feasible and performant, implement integration and processing steps with straightforward data join operations.\n\n\n\n\n\nAdd the new data file(s), if necessary, to data, and update the documentation to describe how to download them.\nImplement a scan stage to process the raw imported data into tabular form. The code can be written in either Rust or Python, depending on performance needs.\nIf necessary, add the inputs to the ISBN collection (under book-links) and clustering to connect it with the rest of the code.\nImplement stages to integrate the data with the rest of the tools. Again, this code can be in Rust or Python. We usually use Polars (either the Rust or the Python API) to efficiently process large data files.\n\nSee the Pipeline DSL for information about how to update the pipeline."
  15. },
  16. {
  17. "objectID": "implementation/index.html#implementation-principles",
  18. "href": "implementation/index.html#implementation-principles",
  19. "title": "Implementation",
  20. "section": "",
  21. "text": "These goals are realized through a few technology and design decisions:\n\nScript all import steps with a tool that can track stage dependencies and check whether a stage is up-to-date (DVC).\nMake individual import stages self-contained and limited.\nExtract data from raw sources into tabular form, then integrate as a separate step.\nWhen feasible and performant, implement integration and processing steps with straightforward data join operations."
  22. },
  23. {
  24. "objectID": "implementation/index.html#adding-or-modifying-data",
  25. "href": "implementation/index.html#adding-or-modifying-data",
  26. "title": "Implementation",
  27. "section": "",
  28. "text": "Add the new data file(s), if necessary, to data, and update the documentation to describe how to download them.\nImplement a scan stage to process the raw imported data into tabular form. The code can be written in either Rust or Python, depending on performance needs.\nIf necessary, add the inputs to the ISBN collection (under book-links) and clustering to connect it with the rest of the code.\nImplement stages to integrate the data with the rest of the tools. Again, this code can be in Rust or Python. We usually use Polars (either the Rust or the Python API) to efficiently process large data files.\n\nSee the Pipeline DSL for information about how to update the pipeline."
  29. },
  30. {
  31. "objectID": "implementation/pipeline.html",
  32. "href": "implementation/pipeline.html",
  33. "title": "Pipeline Specification",
  34. "section": "",
  35. "text": "Data Version Control is a great tool, but its pipelines are static YAML files with limited configurability, and substantial redundancy. That redundancy makes updates error-prone, and also limits our ability to do things such as enable and disable data sets, and reconfigure which version of the GoodReads interaction files we want to use.\n\n\nHowever, these YAML files are relatively easy to generate, so it’s feasible to generate them with scripts or templates. We use jsonnet, a programming language for generating JSON and similar configuration structures that allows us to generate the pipeline with loops, conditionals, etc. The pipeline primary sources are in the dvc.jsonnet files, which we render to produce dvc.yaml.\nA Python script renders the pipelines to YAML using the Python jsonnet bindings. You can run this with:\n./update-pipeline.py\nThe lib.jsonnet file provides helper routines for generating pipelines:\n\npipeline produces a DVC pipeline given a record of stages.\ncmd takes a book data command (that would be passed to the book data executable) and adds the relevant bits to run it through Cargo (so the import software is automatically recompiled if necessary).\n\n\n\n\nThe pipeline can be configured through the config.yaml file. We keep this file, along with the generated pipeline, committed to git; if you change it, we recommend working in a branch. After changing the file, you need to regenerate the pipeline with update-pipeline.py for changes to take effect.\nSee the comments in that file for details. Right now, two things can be configured:\n\nWhich sources of book rating and interaction data are used.\nWhether to use full review data."
  36. },
  37. {
  38. "objectID": "implementation/pipeline.html#render",
  39. "href": "implementation/pipeline.html#render",
  40. "title": "Pipeline Specification",
  41. "section": "",
  42. "text": "However, these YAML files are relatively easy to generate, so it’s feasible to generate them with scripts or templates. We use jsonnet, a programming language for generating JSON and similar configuration structures that allows us to generate the pipeline with loops, conditionals, etc. The pipeline primary sources are in the dvc.jsonnet files, which we render to produce dvc.yaml.\nA Python script renders the pipelines to YAML using the Python jsonnet bindings. You can run this with:\n./update-pipeline.py\nThe lib.jsonnet file provides helper routines for generating pipelines:\n\npipeline produces a DVC pipeline given a record of stages.\ncmd takes a book data command (that would be passed to the book data executable) and adds the relevant bits to run it through Cargo (so the import software is automatically recompiled if necessary)."
  43. },
  44. {
  45. "objectID": "implementation/pipeline.html#config",
  46. "href": "implementation/pipeline.html#config",
  47. "title": "Pipeline Specification",
  48. "section": "",
  49. "text": "The pipeline can be configured through the config.yaml file. We keep this file, along with the generated pipeline, committed to git; if you change it, we recommend working in a branch. After changing the file, you need to regenerate the pipeline with update-pipeline.py for changes to take effect.\nSee the comments in that file for details. Right now, two things can be configured:\n\nWhich sources of book rating and interaction data are used.\nWhether to use full review data."
  50. },
  51. {
  52. "objectID": "reports/index.html",
  53. "href": "reports/index.html",
  54. "title": "Reports and Audits",
  55. "section": "",
  56. "text": "We provide several notebooks that describe aspects of the data set and its evolution.\n\n\nThese notebooks report the current status of the data.\n\n\n\nThese notebooks describe how the data has changed from version to version, to detect regressions."
  57. },
  58. {
  59. "objectID": "reports/index.html#current-status",
  60. "href": "reports/index.html#current-status",
  61. "title": "Reports and Audits",
  62. "section": "",
  63. "text": "These notebooks report the current status of the data."
  64. },
  65. {
  66. "objectID": "reports/index.html#change-audits",
  67. "href": "reports/index.html#change-audits",
  68. "title": "Reports and Audits",
  69. "section": "",
  70. "text": "These notebooks describe how the data has changed from version to version, to detect regressions."
  71. },
  72. {
  73. "objectID": "reports/audit-gender-changes.html",
  74. "href": "reports/audit-gender-changes.html",
  75. "title": "Cluster Gender Changes",
  76. "section": "",
  77. "text": "This notebook audits for significant changes in cluster gender annotations, to allow us to detect the significance of shifts over time. It depends on the aligned cluster identities in isbn-version-clusters.parquet.\nfrom pathlib import Path\nfrom functools import reduce\nimport pandas as pd\nimport polars as pl\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns"
  78. },
  79. {
  80. "objectID": "reports/audit-gender-changes.html#load-data",
  81. "href": "reports/audit-gender-changes.html#load-data",
  82. "title": "Cluster Gender Changes",
  83. "section": "Load Data",
  84. "text": "Load Data\nDefine the versions we care about:\n\nversions = ['pgsql', '2022-03-2.0', '2022-07', '2022-10', '2022-11-2.1', '2023-07', 'current']\n\nLoad the aligned ISBNs:\n\nisbn_clusters = pd.read_parquet('isbn-version-clusters.parquet')\nisbn_clusters.info()\n\n<class 'pandas.core.frame.DataFrame'>\nRangeIndex: 43027360 entries, 0 to 43027359\nData columns (total 9 columns):\n # Column Dtype \n--- ------ ----- \n 0 isbn object \n 1 isbn_id int32 \n 2 current float64\n 3 2023-07 float64\n 4 2022-11-2.1 float64\n 5 2022-10 float64\n 6 2022-07 float64\n 7 2022-03-2.0 float64\n 8 pgsql float64\ndtypes: float64(7), int32(1), object(1)\nmemory usage: 2.7+ GB"
  85. },
  86. {
  87. "objectID": "reports/audit-gender-changes.html#different-genders",
  88. "href": "reports/audit-gender-changes.html#different-genders",
  89. "title": "Cluster Gender Changes",
  90. "section": "Different Genders",
  91. "text": "Different Genders\nHow many clusters changed gender?\nTo get started, we need a list of genders in order.\n\ngenders = [\n 'ambiguous', 'female', 'male', 'unknown',\n 'no-author-rec', 'no-book-author', 'no-book', 'absent'\n]\n\nLet’s make a function to read gender info:\n\ndef read_gender(path, map_file=None):\n cg = pl.scan_parquet(path)\n cg = cg.select([\n pl.col('cluster').cast(pl.Int32),\n pl.when(pl.col('gender') == 'no-loc-author')\n .then('no-book-author')\n .when(pl.col('gender') == 'no-viaf-author')\n .then('no-author-rec')\n .otherwise(pl.col('gender'))\n .cast(pl.Categorical)\n .alias('gender')\n ])\n if map_file is not None:\n map = pl.scan_parquet(map_file)\n cg = cg.join(map, on='cluster', how='left')\n cg = cg.select([\n pl.col('common').alias('cluster'),\n pl.col('gender')\n ])\n return cg\n\nRead each data source’s gender info and map to common cluster IDs:\n\ngender_cc = {\n v: read_gender(f'{v}/cluster-genders.parquet', f'{v}/cluster-map.parquet')\n for v in versions if v != 'current'\n}\ngender_cc['current'] = read_gender('../book-links/cluster-genders.parquet')\n\n/tmp/ipykernel_69125/183506089.py:6: DeprecationWarning: in a future version, string input will be parsed as a column name rather than a string literal. To silence this warning, pass the input as an expression instead: `pl.lit('no-book-author')`\n .then('no-book-author')\n/tmp/ipykernel_69125/183506089.py:8: DeprecationWarning: in a future version, string input will be parsed as a column name rather than a string literal. To silence this warning, pass the input as an expression instead: `pl.lit('no-author-rec')`\n .then('no-author-rec')\n\n\nSet up a sequence of frames for merging:\n\nto_merge = [\n gender_cc[v].select([\n pl.col('cluster'),\n pl.col('gender').alias(v)\n ]).unique()\n for v in versions\n]\n\nMerge and collect results:\n\ncluster_genders = reduce(lambda df1, df2: df1.join(df2, on='cluster', how='outer'), to_merge)\ncluster_genders = cluster_genders.collect()\n\nFor unclear reasons, a few versions have a null cluster. Drop that.\n\ncluster_genders = cluster_genders.filter(cluster_genders['cluster'].is_not_null())\n\nNow we will convert to Pandas and fix missing values:\n\ncluster_genders = cluster_genders.to_pandas().set_index('cluster')\n\nNow we’ll unify the categories and their orders:\n\ncluster_genders = cluster_genders.apply(lambda vdf: vdf.cat.set_categories(genders, ordered=True))\ncluster_genders.fillna('absent', inplace=True)\ncluster_genders.head()\n\n\n\n\n\n\n\n\npgsql\n2022-03-2.0\n2022-07\n2022-10\n2022-11-2.1\n2023-07\ncurrent\n\n\ncluster\n\n\n\n\n\n\n\n\n\n\n\n416243397\nabsent\nabsent\nabsent\nabsent\nabsent\nabsent\nno-book-author\n\n\n410767599\nabsent\nabsent\nabsent\nabsent\nabsent\nabsent\nno-book-author\n\n\n421374693\nabsent\nabsent\nabsent\nabsent\nabsent\nabsent\nno-book-author\n\n\n449455849\nabsent\nabsent\nabsent\nabsent\nabsent\nabsent\nno-book-author\n\n\n415350734\nabsent\nabsent\nabsent\nabsent\nabsent\nabsent\nno-book-author\n\n\n\n\n\n\n\nLet’s save this file for further analysis:\n\ncluster_genders.to_parquet('cluster-version-genders.parquet', compression='zstd')"
  92. },
  93. {
  94. "objectID": "reports/audit-gender-changes.html#postgresql-to-current",
  95. "href": "reports/audit-gender-changes.html#postgresql-to-current",
  96. "title": "Cluster Gender Changes",
  97. "section": "PostgreSQL to Current",
  98. "text": "PostgreSQL to Current\nNow we are ready to actually compare cluster genders across categories. Let’s start by comparing original data (PostgreSQL) to current:\n\nct = cluster_genders[['pgsql', 'current']].value_counts().unstack()\nct = ct.reindex(labels=genders, columns=genders)\nct\n\n\n\n\n\n\n\ncurrent\nambiguous\nfemale\nmale\nunknown\nno-author-rec\nno-book-author\nno-book\nabsent\n\n\npgsql\n\n\n\n\n\n\n\n\n\n\n\n\nambiguous\n98387.0\n4331.0\n11287.0\n1856.0\n996.0\n2773.0\n4.0\n4574.0\n\n\nfemale\n16360.0\n1120232.0\n978.0\n12941.0\n9439.0\n19.0\n29.0\n34113.0\n\n\nmale\n28527.0\n2690.0\n3493131.0\n18109.0\n31095.0\n688.0\n152.0\n70824.0\n\n\nunknown\n3004.0\n102929.0\n215359.0\n1545706.0\n19026.0\n15.0\n12.0\n14324.0\n\n\nno-author-rec\n10533.0\n58486.0\n330352.0\n226923.0\n1395181.0\n436.0\n125.0\n13658.0\n\n\nno-book-author\n8356.0\n114884.0\n219279.0\n125984.0\n211482.0\n2457210.0\n903525.0\n273353.0\n\n\nno-book\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\nNaN\n\n\nabsent\n121068.0\n1022710.0\n2439415.0\n1026095.0\n4539059.0\n17046837.0\n734647.0\n65883.0\n\n\n\n\n\n\n\n\nctf = ct.divide(ct.sum(axis='columns'), axis='rows')\ndef style_row(row):\n styles = []\n for col, val in zip(row.index, row.values):\n if col == row.name:\n styles.append('font-weight: bold')\n elif val > 0.1:\n styles.append('color: red')\n else:\n styles.append(None)\n return styles\nctf.style.apply(style_row, 'columns')\n\n\n\n\n\n\ncurrent\nambiguous\nfemale\nmale\nunknown\nno-author-rec\nno-book-author\nno-book\nabsent\n\n\npgsql\n \n \n \n \n \n \n \n \n\n\n\n\nambiguous\n0.792115\n0.034869\n0.090872\n0.014943\n0.008019\n0.022325\n0.000032\n0.036825\n\n\nfemale\n0.013701\n0.938131\n0.000819\n0.010837\n0.007905\n0.000016\n0.000024\n0.028568\n\n\nmale\n0.007826\n0.000738\n0.958278\n0.004968\n0.008530\n0.000189\n0.000042\n0.019429\n\n\nunknown\n0.001581\n0.054162\n0.113324\n0.813369\n0.010012\n0.000008\n0.000006\n0.007537\n\n\nno-author-rec\n0.005174\n0.028730\n0.162280\n0.111472\n0.685359\n0.000214\n0.000061\n0.006709\n\n\nno-book-author\n0.001937\n0.026630\n0.050829\n0.029203\n0.049021\n0.569580\n0.209437\n0.063363\n\n\nno-book\nnan\nnan\nnan\nnan\nnan\nnan\nnan\nnan\n\n\nabsent\n0.004485\n0.037884\n0.090363\n0.038010\n0.168140\n0.631465\n0.027213\n0.002440\n\n\n\n\n\nMost of the change is coming from clusters absent in the original but present in the new.\nThere are also quite a few that had no book author in PGSQL, but no book in the current data - not sure what’s up with that. Let’s look at more crosstabs.\n\ndef gender_crosstab(old, new, fractional=True):\n ct = cluster_genders[[old, new]].value_counts().unstack()\n ct = ct.reindex(labels=genders, columns=genders)\n\n if fractional:\n ctf = ct.divide(ct.sum(axis='columns'), axis='rows')\n return ctf\n else:\n return ct\n\n\ndef plot_gender(set):\n cluster_genders[set].value_counts().sort_index().plot.barh()\n plt.title(f'Gender Distribution in {set}')"
  99. },
  100. {
  101. "objectID": "reports/audit-gender-changes.html#postgresql-to-march-2022-2.0-release",
  102. "href": "reports/audit-gender-changes.html#postgresql-to-march-2022-2.0-release",
  103. "title": "Cluster Gender Changes",
  104. "section": "PostgreSQL to March 2022 (2.0 release)",
  105. "text": "PostgreSQL to March 2022 (2.0 release)\nThis marks the change from PostgreSQL to pure-Rust.\n\nct = gender_crosstab('pgsql', '2022-03-2.0')\nct.style.apply(style_row, 'columns')\n\n\n\n\n\n\n2022-03-2.0\nambiguous\nfemale\nmale\nunknown\nno-author-rec\nno-book-author\nno-book\nabsent\n\n\npgsql\n \n \n \n \n \n \n \n \n\n\n\n\nambiguous\n0.977924\n0.002963\n0.013928\n0.000636\n0.000878\nnan\nnan\n0.003671\n\n\nfemale\n0.002192\n0.993942\n0.000001\n0.000301\n0.000430\n0.000003\n0.000005\n0.003126\n\n\nmale\n0.000591\n0.000000\n0.995942\n0.000528\n0.000796\n0.000002\n0.000014\n0.002127\n\n\nunknown\n0.000043\n0.002760\n0.005303\n0.988899\n0.001953\n0.000001\n0.000003\n0.001038\n\n\nno-author-rec\n0.000104\n0.007922\n0.049966\n0.031481\n0.908594\nnan\n0.000008\n0.001925\n\n\nno-book-author\n0.000002\n0.000051\n0.000198\n0.000107\n0.000053\n0.649173\n0.335017\n0.015398\n\n\nno-book\nnan\nnan\nnan\nnan\nnan\nnan\nnan\nnan\n\n\nabsent\n0.000006\n0.000049\n0.000254\n0.000120\n0.000178\n0.002007\n0.000070\n0.997316\n\n\n\n\n\nThis is where we change from no-book-author to no-book for a bunch of books; otherwise things are pretty consistent. This major change is likely a result of changes that count more books and book clusters - we had some inner joins in the PostgreSQL version that were questionable, and in particular we didn’t really cluster solo ISBNs but now we do. But now, if we have a solo ISBN from rating data, it gets a cluster with no book record instead of being excluded from the clustering.\nLet’s look at the distribution of statuses for each, starting with PostgreSQL:\n\nplot_gender('pgsql')\n\n\n\n\nAnd the Rust version:\n\nplot_gender('2022-03-2.0')"
  106. },
  107. {
  108. "objectID": "reports/audit-gender-changes.html#march-to-july-2022",
  109. "href": "reports/audit-gender-changes.html#march-to-july-2022",
  110. "title": "Cluster Gender Changes",
  111. "section": "March to July 2022",
  112. "text": "March to July 2022\nWe updated a lot of data files and changed the name and ISBN parsing logic.\n\nct = gender_crosstab('2022-03-2.0', '2022-07')\nct.style.apply(style_row, 'columns')\n\n\n\n\n\n\n2022-07\nambiguous\nfemale\nmale\nunknown\nno-author-rec\nno-book-author\nno-book\nabsent\n\n\n2022-03-2.0\n \n \n \n \n \n \n \n \n\n\n\n\nambiguous\n0.836189\n0.035578\n0.083571\n0.014798\n0.004862\n0.000087\n0.000016\n0.024900\n\n\nfemale\n0.010076\n0.963180\n0.000488\n0.007911\n0.001253\n0.000002\n0.000012\n0.017079\n\n\nmale\n0.006706\n0.000646\n0.974629\n0.003702\n0.001364\n0.000079\n0.000014\n0.012859\n\n\nunknown\n0.001899\n0.040307\n0.092948\n0.856037\n0.003412\n0.000200\nnan\n0.005197\n\n\nno-author-rec\n0.003532\n0.020634\n0.108700\n0.101631\n0.762109\n0.000009\n0.000037\n0.003349\n\n\nno-book-author\n0.002058\n0.030443\n0.057348\n0.035320\n0.056237\n0.809983\n0.000008\n0.008604\n\n\nno-book\n0.000156\n0.002342\n0.005269\n0.002737\n0.004768\n0.000634\n0.980664\n0.003431\n\n\nabsent\n0.002246\n0.020150\n0.043122\n0.015901\n0.062986\n0.003484\n0.000001\n0.852108\n\n\n\n\n\nMostly fine; some more are resolved, existing resolutions are pretty consistent.\n\nplot_gender('2022-07')"
  113. },
  114. {
  115. "objectID": "reports/audit-gender-changes.html#july-2022-to-oct.-2022",
  116. "href": "reports/audit-gender-changes.html#july-2022-to-oct.-2022",
  117. "title": "Cluster Gender Changes",
  118. "section": "July 2022 to Oct. 2022",
  119. "text": "July 2022 to Oct. 2022\nWe changed from DataFusion to Polars and made further ISBN and name parsing changes.\n\nct = gender_crosstab('2022-07', '2022-10')\nct.style.apply(style_row, 'columns')\n\n\n\n\n\n\n2022-10\nambiguous\nfemale\nmale\nunknown\nno-author-rec\nno-book-author\nno-book\nabsent\n\n\n2022-07\n \n \n \n \n \n \n \n \n\n\n\n\nambiguous\n0.989408\n0.004969\n0.003626\n0.000336\n0.001647\nnan\nnan\n0.000014\n\n\nfemale\nnan\n0.995091\nnan\n0.000361\n0.004471\nnan\nnan\n0.000078\n\n\nmale\n0.000001\nnan\n0.994582\n0.000431\n0.004975\n0.000000\nnan\n0.000011\n\n\nunknown\nnan\nnan\nnan\n0.995469\n0.004492\nnan\nnan\n0.000040\n\n\nno-author-rec\nnan\n0.000001\n0.000003\n0.000005\n0.999824\n0.000131\nnan\n0.000037\n\n\nno-book-author\nnan\n0.000000\n0.000001\n0.000000\n0.000000\n0.996620\nnan\n0.003378\n\n\nno-book\n0.000001\n0.000100\n0.000029\n0.000066\n0.000091\n0.198053\n0.670216\n0.131445\n\n\nabsent\nnan\nnan\nnan\nnan\nnan\nnan\nnan\n1.000000\n\n\n\n\n\n\nplot_gender('2022-10')"
  120. },
  121. {
  122. "objectID": "reports/audit-gender-changes.html#oct.-2022-to-release-2.1-nov.-2022",
  123. "href": "reports/audit-gender-changes.html#oct.-2022-to-release-2.1-nov.-2022",
  124. "title": "Cluster Gender Changes",
  125. "section": "Oct. 2022 to release 2.1 (Nov. 2022)",
  126. "text": "Oct. 2022 to release 2.1 (Nov. 2022)\nWe added support for GoodReads CSV data and the Amazon 2018 rating CSV files.\n\nct = gender_crosstab('2022-10', '2022-11-2.1')\nct.style.apply(style_row, 'columns')\n\n\n\n\n\n\n2022-11-2.1\nambiguous\nfemale\nmale\nunknown\nno-author-rec\nno-book-author\nno-book\nabsent\n\n\n2022-10\n \n \n \n \n \n \n \n \n\n\n\n\nambiguous\n0.999995\nnan\nnan\nnan\nnan\nnan\nnan\n0.000005\n\n\nfemale\nnan\n0.999999\nnan\nnan\nnan\nnan\nnan\n0.000001\n\n\nmale\nnan\nnan\n0.999999\nnan\nnan\nnan\nnan\n0.000001\n\n\nunknown\nnan\nnan\nnan\n0.999998\nnan\nnan\nnan\n0.000002\n\n\nno-author-rec\nnan\nnan\nnan\nnan\n0.999999\nnan\nnan\n0.000001\n\n\nno-book-author\nnan\nnan\nnan\n0.000000\nnan\n0.999982\nnan\n0.000017\n\n\nno-book\nnan\nnan\nnan\nnan\nnan\nnan\n1.000000\nnan\n\n\nabsent\nnan\n0.000000\n0.000000\n0.000000\n0.000000\n0.000003\n0.033872\n0.966125\n\n\n\n\n\n\nplot_gender('2022-11-2.1')"
  127. },
  128. {
  129. "objectID": "reports/audit-gender-changes.html#release-2.1-to-jul.-2023",
  130. "href": "reports/audit-gender-changes.html#release-2.1-to-jul.-2023",
  131. "title": "Cluster Gender Changes",
  132. "section": "Release 2.1 to Jul. 2023",
  133. "text": "Release 2.1 to Jul. 2023\nWe updated OpenLibrary and VIAF, and made some technical changes.\n\nct = gender_crosstab('2022-11-2.1', '2023-07')\nct.style.apply(style_row, 'columns')\n\n\n\n\n\n\n2023-07\nambiguous\nfemale\nmale\nunknown\nno-author-rec\nno-book-author\nno-book\nabsent\n\n\n2022-11-2.1\n \n \n \n \n \n \n \n \n\n\n\n\nambiguous\n0.879463\n0.006926\n0.018911\n0.001573\n0.000239\n0.074744\n0.000009\n0.018136\n\n\nfemale\n0.004890\n0.975552\n0.000307\n0.004275\n0.000396\n0.000482\n0.000002\n0.014097\n\n\nmale\n0.002247\n0.000044\n0.986206\n0.001428\n0.000497\n0.001463\n0.000005\n0.008110\n\n\nunknown\n0.000267\n0.019108\n0.032233\n0.945152\n0.001086\n0.000425\n0.000000\n0.001728\n\n\nno-author-rec\n0.000278\n0.002898\n0.007174\n0.009780\n0.975848\n0.000360\n0.000004\n0.003657\n\n\nno-book-author\n0.000411\n0.006228\n0.011051\n0.004080\n0.006719\n0.967640\n0.000001\n0.003869\n\n\nno-book\n0.000360\n0.005640\n0.011799\n0.007429\n0.024741\n0.003807\n0.940827\n0.005397\n\n\nabsent\n0.003096\n0.021000\n0.056276\n0.026941\n0.127455\n0.014997\n0.000001\n0.750233\n\n\n\n\n\n\nplot_gender('2023-07')"
  134. },
  135. {
  136. "objectID": "reports/audit-gender-changes.html#jul.-2023-to-current",
  137. "href": "reports/audit-gender-changes.html#jul.-2023-to-current",
  138. "title": "Cluster Gender Changes",
  139. "section": "Jul. 2023 to Current",
  140. "text": "Jul. 2023 to Current\nMostly technical code updates.\n\nct = gender_crosstab('2023-07', 'current')\nct.style.apply(style_row, 'columns')\n\n\n\n\n\n\ncurrent\nambiguous\nfemale\nmale\nunknown\nno-author-rec\nno-book-author\nno-book\nabsent\n\n\n2023-07\n \n \n \n \n \n \n \n \n\n\n\n\nambiguous\n1.000000\nnan\nnan\nnan\nnan\nnan\nnan\nnan\n\n\nfemale\nnan\n1.000000\nnan\nnan\nnan\nnan\nnan\nnan\n\n\nmale\nnan\nnan\n1.000000\nnan\nnan\nnan\nnan\nnan\n\n\nunknown\nnan\nnan\nnan\n1.000000\nnan\nnan\nnan\nnan\n\n\nno-author-rec\nnan\nnan\nnan\nnan\n1.000000\nnan\nnan\nnan\n\n\nno-book-author\nnan\nnan\n0.000000\nnan\n0.000000\n0.999999\nnan\nnan\n\n\nno-book\nnan\nnan\nnan\nnan\nnan\nnan\n1.000000\nnan\n\n\nabsent\nnan\nnan\nnan\nnan\nnan\n0.971987\nnan\n0.028013\n\n\n\n\n\n\nplot_gender('current')"
  141. },
  142. {
  143. "objectID": "using/index.html",
  144. "href": "using/index.html",
  145. "title": "Importing",
  146. "section": "",
  147. "text": "Using the Tools\nThis section of the documentation describes how to set up and use the book data integration tools."
  148. },
  149. {
  150. "objectID": "using/setup.html",
  151. "href": "using/setup.html",
  152. "title": "Environment Setup",
  153. "section": "",
  154. "text": "These tools require an Anaconda installation. It is possible to use them without Anaconda, but we have provided the environment definitions to automate use with Anaconda.\nThis project uses Git submodules, so you should clone it with:\ngit clone --recursive https://github.com/PIReTship/bookdata-tools.git\n\n\nYou will need:\n\nA Unix-like environment (macOS or Linux)\nAnaconda or Miniconda\n250GB of disk space\nAt least 24 GB of memory (lower may be possible)\n\n\n\n\nThe import tools are written in Python and Rust. The provided Conda lockfiles, along with environment.yml, provide the data to define an Anaconda environment that contains all required runtimes and libraries:\nconda-lock install -n bookdata\nconda activate bookdata\nIf you don’t want to use Anaconda, see the following for more details on dependencies. If you don’t yet have conda-lock installed in your base environment, run:\nconda install -c conda-forge -n base conda-lock=2\n\n\nThis needs the following Python dependencies:\n\nPython 3.8 or later\nnumpy\npandas\nseaborn\njupyter\njupytext\ndvc (3 or later)\n\nThe Python dependencies are defined in environment.yml.\n\n\n\nThe Rust tools need Rust version 1.59 or later. The easiest way to install this — besides Anaconda — is with rustup.\nThe cargo build tool will automatically download all Rust libraries required. The Rust code does not depend on any specific system libraries.\n\n\n\nIf you update dependencies, you can re-generate the Conda lockfiles with conda-lock:\nconda-lock lock --mamba -f pyproject.toml"
  155. },
  156. {
  157. "objectID": "using/setup.html#system-requirements",
  158. "href": "using/setup.html#system-requirements",
  159. "title": "Environment Setup",
  160. "section": "",
  161. "text": "You will need:\n\nA Unix-like environment (macOS or Linux)\nAnaconda or Miniconda\n250GB of disk space\nAt least 24 GB of memory (lower may be possible)"
  162. },
  163. {
  164. "objectID": "using/setup.html#import-tool-dependencies",
  165. "href": "using/setup.html#import-tool-dependencies",
  166. "title": "Environment Setup",
  167. "section": "",
  168. "text": "The import tools are written in Python and Rust. The provided Conda lockfiles, along with environment.yml, provide the data to define an Anaconda environment that contains all required runtimes and libraries:\nconda-lock install -n bookdata\nconda activate bookdata\nIf you don’t want to use Anaconda, see the following for more details on dependencies. If you don’t yet have conda-lock installed in your base environment, run:\nconda install -c conda-forge -n base conda-lock=2\n\n\nThis needs the following Python dependencies:\n\nPython 3.8 or later\nnumpy\npandas\nseaborn\njupyter\njupytext\ndvc (3 or later)\n\nThe Python dependencies are defined in environment.yml.\n\n\n\nThe Rust tools need Rust version 1.59 or later. The easiest way to install this — besides Anaconda — is with rustup.\nThe cargo build tool will automatically download all Rust libraries required. The Rust code does not depend on any specific system libraries.\n\n\n\nIf you update dependencies, you can re-generate the Conda lockfiles with conda-lock:\nconda-lock lock --mamba -f pyproject.toml"
  169. },
  170. {
  171. "objectID": "using/running.html",
  172. "href": "using/running.html",
  173. "title": "Running the Tools",
  174. "section": "",
  175. "text": "Running the Tools\nThe data import and integration process is scripted by DVC. The top-level dvc.yaml pipeline depends on all required steps for the the core data, so to import the data, just run:\ndvc repro\nThe import process will take approximately 2–3 hours on a reasonably fast computer.\nThere are some additional useful outputs that the main pipeline does not invoke; you can generate these with:\ndvc repro --all-pipelines\nIf you have configured a remote to store your data files, you can then run dvc push to push the files to the remote to share with others on your team, copy to another computer, or import into another project."
  176. },
  177. {
  178. "objectID": "data/amazon.html",
  179. "href": "data/amazon.html",
  180. "title": "Amazon Ratings",
  181. "section": "",
  182. "text": "This processes two data sets from Julian McAuley’s group at UCSD:\n\nThe 2014 Amazon reviews data set\nThe 2018 Amazon reviews data set\n\nEach consists of user-provided reviews and ratings for a variety of products.\nCurrently we import the ratings-only data from the Books segment of the 2014 data set, and the books reviews from the 2018 data set.\n\n\n\n\n\n\nImportant\n\n\n\nIf you use this data, cite the paper(s) documented on the data set web site.\nFor 2014 data, the citations are:\n\nR. He and J. McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proc. WWW 2016. DOI:10.1145/2872427.2883037.\n\n\nJ. McAuley, C. Targett, J. Shi, and A. van den Hengel. Image-based recommendations on styles and substitutes. In Proc. SIGIR 2016. DOI:10.1145/2766462.2767755.\n\nFor 2018 data:\n\nJ. Ni, J. Li, and J. McAuley. Justifying recommendations using distantly-labeled reviews and fined-grained aspects. In Empirical Methods in Natural Language Processing (EMNLP), 2019.\n\n\n\nImported data lives in the az2014 and az2018 directories. The source files are not automatically downloaded — you will need to download the ratings-only data for the Books category from each data site and save them in the data/az2014 and data/az2018 directories.\n\n\nconfig.yaml allows you to specify whether the review data is used:\naz2014:\n enabled: true\n\naz2018:\n enabled: true\n source: reviews\n\n\n\nThe import is controlled by the following DVC steps:\n\nscan-ratings\n\nScan the rating CSV file into a Parquet file, converting user strings into numeric IDs. Produces az2014/ratings.parquet.\n\ncluster-ratings\n\nLink ratings with book clusters and aggregate by cluster, to produce user ratings for book clsuters. Produces az2014/az-cluster-ratings.parquet.\n\n\n\n\n\n\n\naz2014/ratings.parquet\n\nThe raw rating data, with user strings converted to numeric IDs, is in this file.\nFile details\n\nSchema for az2014/ratings.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nuser\n\n\nInt32\n\n\n\n\nasin\n\n\nUtf8\n\n\n\n\nrating\n\n\nFloat32\n\n\n\n\ntimestamp\n\n\nInt64\n\n\n\n\n\n\n\n\naz2018/ratings.parquet\n\nThe raw rating data, with user strings converted to numeric IDs, is in this file.\nFile details\n\nSchema for az2018/ratings.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nuser\n\n\nInt32\n\n\n\n\nasin\n\n\nUtf8\n\n\n\n\nrating\n\n\nFloat32\n\n\n\n\ntimestamp\n\n\nInt64\n\n\n\n\n\n\n\n\n\n\n\naz2014/az-cluster-ratings.parquet\n\nThis file contains the integrated Amazon ratings, with cluster IDs in the item column.\nFile details\n\nSchema for az2014/az-cluster-ratings.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nuser\n\n\nInt32\n\n\n\n\nitem\n\n\nInt32\n\n\n\n\nrating\n\n\nFloat32\n\n\n\n\nlast_rating\n\n\nFloat32\n\n\n\n\nfirst_time\n\n\nInt64\n\n\n\n\nlast_time\n\n\nInt64\n\n\n\n\nnratings\n\n\nUInt32\n\n\n\n\n\n\n\n\naz2018/az-cluster-ratings.parquet\n\nThis file contains the integrated Amazon ratings, with cluster IDs in the item column.\nFile details\n\nSchema for az2018/az-cluster-ratings.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nuser\n\n\nInt32\n\n\n\n\nitem\n\n\nInt32\n\n\n\n\nrating\n\n\nFloat32\n\n\n\n\nlast_rating\n\n\nFloat32\n\n\n\n\nfirst_time\n\n\nInt64\n\n\n\n\nlast_time\n\n\nInt64\n\n\n\n\nnratings\n\n\nUInt32"
  183. },
  184. {
  185. "objectID": "data/amazon.html#configuration",
  186. "href": "data/amazon.html#configuration",
  187. "title": "Amazon Ratings",
  188. "section": "",
  189. "text": "config.yaml allows you to specify whether the review data is used:\naz2014:\n enabled: true\n\naz2018:\n enabled: true\n source: reviews"
  190. },
  191. {
  192. "objectID": "data/amazon.html#import-steps",
  193. "href": "data/amazon.html#import-steps",
  194. "title": "Amazon Ratings",
  195. "section": "",
  196. "text": "The import is controlled by the following DVC steps:\n\nscan-ratings\n\nScan the rating CSV file into a Parquet file, converting user strings into numeric IDs. Produces az2014/ratings.parquet.\n\ncluster-ratings\n\nLink ratings with book clusters and aggregate by cluster, to produce user ratings for book clsuters. Produces az2014/az-cluster-ratings.parquet."
  197. },
  198. {
  199. "objectID": "data/amazon.html#raw-data",
  200. "href": "data/amazon.html#raw-data",
  201. "title": "Amazon Ratings",
  202. "section": "",
  203. "text": "az2014/ratings.parquet\n\nThe raw rating data, with user strings converted to numeric IDs, is in this file.\nFile details\n\nSchema for az2014/ratings.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nuser\n\n\nInt32\n\n\n\n\nasin\n\n\nUtf8\n\n\n\n\nrating\n\n\nFloat32\n\n\n\n\ntimestamp\n\n\nInt64\n\n\n\n\n\n\n\n\naz2018/ratings.parquet\n\nThe raw rating data, with user strings converted to numeric IDs, is in this file.\nFile details\n\nSchema for az2018/ratings.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nuser\n\n\nInt32\n\n\n\n\nasin\n\n\nUtf8\n\n\n\n\nrating\n\n\nFloat32\n\n\n\n\ntimestamp\n\n\nInt64"
  204. },
  205. {
  206. "objectID": "data/amazon.html#extracted-rating-tables",
  207. "href": "data/amazon.html#extracted-rating-tables",
  208. "title": "Amazon Ratings",
  209. "section": "",
  210. "text": "az2014/az-cluster-ratings.parquet\n\nThis file contains the integrated Amazon ratings, with cluster IDs in the item column.\nFile details\n\nSchema for az2014/az-cluster-ratings.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nuser\n\n\nInt32\n\n\n\n\nitem\n\n\nInt32\n\n\n\n\nrating\n\n\nFloat32\n\n\n\n\nlast_rating\n\n\nFloat32\n\n\n\n\nfirst_time\n\n\nInt64\n\n\n\n\nlast_time\n\n\nInt64\n\n\n\n\nnratings\n\n\nUInt32\n\n\n\n\n\n\n\n\naz2018/az-cluster-ratings.parquet\n\nThis file contains the integrated Amazon ratings, with cluster IDs in the item column.\nFile details\n\nSchema for az2018/az-cluster-ratings.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nuser\n\n\nInt32\n\n\n\n\nitem\n\n\nInt32\n\n\n\n\nrating\n\n\nFloat32\n\n\n\n\nlast_rating\n\n\nFloat32\n\n\n\n\nfirst_time\n\n\nInt64\n\n\n\n\nlast_time\n\n\nInt64\n\n\n\n\nnratings\n\n\nUInt32"
  211. },
  212. {
  213. "objectID": "data/openlib.html",
  214. "href": "data/openlib.html",
  215. "title": "OpenLibrary",
  216. "section": "",
  217. "text": "We also source book data from OpenLibrary, as downloaded from their developer dumps.\nThe DVC control files automatically download the appropriate version. The version can be updated by modifying the data/ol_dump_*.txt.gz.dvc files.\nImported data lives under the openlibrary directory.\n\n\n\n\nerDiagram\n editions ||--o{ edition-isbn-ids : \"\"\n edition-isbn-ids }o--|| all-isbns : \"\"\n editions {\n Int32 id PK\n Utf8 key\n Utf8 title\n }\n editions }o--o{ works : \"edition-works\"\n editions |o--o{ edition-subjects : \"\"\n edition-subjects {\n Int32 id\n Utf8 subject\n }\n works {\n Int32 id PK\n Utf8 key\n Utf8 title\n }\n works |o--o{ work-subjects : \"\"\n work-subjects {\n Int32 id\n Utf8 subject\n }\n authors {\n Int32 id PK\n Utf8 key\n Utf8 name\n }\n authors ||--o{ author-names : \"\"\n editions }o--o{ authors : \"edition-authors\"\n works }o--o{ authors : \"work-authors\"\n\n\n\n\n\n\n\nThe import is controlled by the following DVC steps:\n\nscan-*\n\nThe various scan-* stages (e.g. scan-authors) scan an OpenLibrary JSON file into the resulting Parquet files. There are dependencies, to resolve OpenLibrary keys to numeric identifiers for cross-referencing. These scan stages do not currently extract all available data from the OpenLibrary JSON; they only extract the fields we currently use, and need to be extended to extract and save additional fields.\n\nedition-isbn-ids\n\nConvert edition ISBNs into ISBN IDs, producing {{ERR unknown file edition-isbn-ids.parquet}}.\n\n\n\n\n\nThe raw data lives in the data/openlib directory, as compressed JSON files. Right now we do not extract very many fields from OpenLibrary; additional fields can be extracted by extending the import scripts.\n\n\n\nWe extract the following tables from OpenLibrary editions:\n\n\nopenlibrary/editions.parquet\n\nThis file contains a primary record for each edition, with the numeric edition ID, OpenLibrary key, and edition data.\nFile details\n\nSchema for openlibrary/editions.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nid\n\n\nInt32\n\n\n\n\nkey\n\n\nUtf8\n\n\n\n\ntitle\n\n\nUtf8\n\n\n\n\n\n\n\n\nopenlibrary/edition-authors.parquet\n\nThis file contains mappings between editions and their authors.\nFile details\n\nSchema for openlibrary/edition-authors.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nedition\n\n\nInt32\n\n\n\n\npos\n\n\nInt16\n\n\n\n\nauthor\n\n\nInt32\n\n\n\n\n\n\n\n\nopenlibrary/edition-works.parquet\n\nThis maps editions to their works.\nFile details\n\nSchema for openlibrary/edition-works.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nedition\n\n\nInt32\n\n\n\n\nwork\n\n\nInt32\n\n\n\n\n\n\n\n\nopenlibrary/edition-isbns.parquet\n\nThis contains the ISBN fields extracted from each OpenLibrary edition. This is primarily for internal purposes and most people won’t need to use it. ISBNs are cleaned (with clean_isbn_chars or clean_asin_chars prior to being stored in this file.\nFile details\n\nSchema for openlibrary/edition-isbns.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nedition\n\n\nInt32\n\n\n\n\nisbn\n\n\nUtf8\n\n\n\n\n\n\n\n\nopenlibrary/edition-subjects.parquet\n\nThis table contains the subjects for OpenLibrary editions. Each row contains an edition ID and one subject. Its schema is in EditionSubjectRec.\nFile details\n\nSchema for openlibrary/edition-subjects.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nid\n\n\nInt32\n\n\n\n\nsubj_type\n\n\nUInt8\n\n\n\n\nsubject\n\n\nUtf8\n\n\n\n\n\n\n\n\nopenlibrary/edition-isbn-ids.parquet\n\nThis file maps editions to numeric ISBN identifiers. It is derived from openlibrary/edition-isbns.parquet.\nFile details\n\nSchema for openlibrary/edition-isbn-ids.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nedition\n\n\nInt32\n\n\n\n\nisbn_id\n\n\nInt32\n\n\n\n\n\n\n\n\n\nWe extract the following tables from OpenLibrary works:\n\n\nopenlibrary/works.parquet\n\nThis file contains the primary record for each work, mapping a numeric ID to its OpenLibrary key and containing other per-work fields.\nFile details\n\nSchema for openlibrary/works.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nid\n\n\nInt32\n\n\n\n\nkey\n\n\nUtf8\n\n\n\n\ntitle\n\n\nUtf8\n\n\n\n\n\n\n\n\nopenlibrary/work-authors.parquet\n\nThis file links work records to the work’s author list (works may have separate author lists from their editions).\nFile details\n\nSchema for openlibrary/work-authors.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nid\n\n\nInt32\n\n\n\n\npos\n\n\nInt16\n\n\n\n\nauthor\n\n\nInt32\n\n\n\n\n\n\n\n\nopenlibrary/work-subjects.parquet\n\nThis table contains the subjects for OpenLibrary editions. Each row contains an edition ID and one subject. Its schema is in WorkSubjectRec.\nFile details\n\nSchema for openlibrary/work-subjects.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nid\n\n\nInt32\n\n\n\n\nsubj_type\n\n\nUInt8\n\n\n\n\nsubject\n\n\nUtf8\n\n\n\n\n\n\n\n\n\n\n\nopenlibrary/authors.parquet\n\nThis file contains basic information about OpenLibrary authors.\nFile details\n\nSchema for openlibrary/authors.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nid\n\n\nInt32\n\n\n\n\nkey\n\n\nUtf8\n\n\n\n\nname\n\n\nUtf8\n\n\n\n\n\n\n\n\nopenlibrary/author-names.parquet\n\nThis file contains the names associated with each author in openlibrary/authors.parquet.\nFile details\n\nSchema for openlibrary/author-names.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nid\n\n\nInt32\n\n\n\n\nsource\n\n\nUInt8\n\n\n\n\nname\n\n\nUtf8\n\n\n\n\n\n\n\n\n\n\n\nopenlibrary/work-clusters.parquet\n\nThis file is a helper table to make it easier to connect OpenLibrary data to clusters by mapping OpenLibrary work IDs to book data cluster IDs.\nFile details\n\nSchema for openlibrary/work-clusters.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nwork_id\n\n\nInt32\n\n\n\n\ncluster\n\n\nInt32"
  218. },
  219. {
  220. "objectID": "data/openlib.html#import-steps",
  221. "href": "data/openlib.html#import-steps",
  222. "title": "OpenLibrary",
  223. "section": "",
  224. "text": "The import is controlled by the following DVC steps:\n\nscan-*\n\nThe various scan-* stages (e.g. scan-authors) scan an OpenLibrary JSON file into the resulting Parquet files. There are dependencies, to resolve OpenLibrary keys to numeric identifiers for cross-referencing. These scan stages do not currently extract all available data from the OpenLibrary JSON; they only extract the fields we currently use, and need to be extended to extract and save additional fields.\n\nedition-isbn-ids\n\nConvert edition ISBNs into ISBN IDs, producing {{ERR unknown file edition-isbn-ids.parquet}}."
  225. },
  226. {
  227. "objectID": "data/openlib.html#raw-data",
  228. "href": "data/openlib.html#raw-data",
  229. "title": "OpenLibrary",
  230. "section": "",
  231. "text": "The raw data lives in the data/openlib directory, as compressed JSON files. Right now we do not extract very many fields from OpenLibrary; additional fields can be extracted by extending the import scripts."
  232. },
  233. {
  234. "objectID": "data/openlib.html#extracted-edition-tables",
  235. "href": "data/openlib.html#extracted-edition-tables",
  236. "title": "OpenLibrary",
  237. "section": "",
  238. "text": "We extract the following tables from OpenLibrary editions:\n\n\nopenlibrary/editions.parquet\n\nThis file contains a primary record for each edition, with the numeric edition ID, OpenLibrary key, and edition data.\nFile details\n\nSchema for openlibrary/editions.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nid\n\n\nInt32\n\n\n\n\nkey\n\n\nUtf8\n\n\n\n\ntitle\n\n\nUtf8\n\n\n\n\n\n\n\n\nopenlibrary/edition-authors.parquet\n\nThis file contains mappings between editions and their authors.\nFile details\n\nSchema for openlibrary/edition-authors.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nedition\n\n\nInt32\n\n\n\n\npos\n\n\nInt16\n\n\n\n\nauthor\n\n\nInt32\n\n\n\n\n\n\n\n\nopenlibrary/edition-works.parquet\n\nThis maps editions to their works.\nFile details\n\nSchema for openlibrary/edition-works.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nedition\n\n\nInt32\n\n\n\n\nwork\n\n\nInt32\n\n\n\n\n\n\n\n\nopenlibrary/edition-isbns.parquet\n\nThis contains the ISBN fields extracted from each OpenLibrary edition. This is primarily for internal purposes and most people won’t need to use it. ISBNs are cleaned (with clean_isbn_chars or clean_asin_chars prior to being stored in this file.\nFile details\n\nSchema for openlibrary/edition-isbns.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nedition\n\n\nInt32\n\n\n\n\nisbn\n\n\nUtf8\n\n\n\n\n\n\n\n\nopenlibrary/edition-subjects.parquet\n\nThis table contains the subjects for OpenLibrary editions. Each row contains an edition ID and one subject. Its schema is in EditionSubjectRec.\nFile details\n\nSchema for openlibrary/edition-subjects.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nid\n\n\nInt32\n\n\n\n\nsubj_type\n\n\nUInt8\n\n\n\n\nsubject\n\n\nUtf8\n\n\n\n\n\n\n\n\nopenlibrary/edition-isbn-ids.parquet\n\nThis file maps editions to numeric ISBN identifiers. It is derived from openlibrary/edition-isbns.parquet.\nFile details\n\nSchema for openlibrary/edition-isbn-ids.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nedition\n\n\nInt32\n\n\n\n\nisbn_id\n\n\nInt32"
  239. },
  240. {
  241. "objectID": "data/openlib.html#extracted-work-tables",
  242. "href": "data/openlib.html#extracted-work-tables",
  243. "title": "OpenLibrary",
  244. "section": "",
  245. "text": "We extract the following tables from OpenLibrary works:\n\n\nopenlibrary/works.parquet\n\nThis file contains the primary record for each work, mapping a numeric ID to its OpenLibrary key and containing other per-work fields.\nFile details\n\nSchema for openlibrary/works.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nid\n\n\nInt32\n\n\n\n\nkey\n\n\nUtf8\n\n\n\n\ntitle\n\n\nUtf8\n\n\n\n\n\n\n\n\nopenlibrary/work-authors.parquet\n\nThis file links work records to the work’s author list (works may have separate author lists from their editions).\nFile details\n\nSchema for openlibrary/work-authors.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nid\n\n\nInt32\n\n\n\n\npos\n\n\nInt16\n\n\n\n\nauthor\n\n\nInt32\n\n\n\n\n\n\n\n\nopenlibrary/work-subjects.parquet\n\nThis table contains the subjects for OpenLibrary editions. Each row contains an edition ID and one subject. Its schema is in WorkSubjectRec.\nFile details\n\nSchema for openlibrary/work-subjects.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nid\n\n\nInt32\n\n\n\n\nsubj_type\n\n\nUInt8\n\n\n\n\nsubject\n\n\nUtf8"
  246. },
  247. {
  248. "objectID": "data/openlib.html#extracted-author-tables",
  249. "href": "data/openlib.html#extracted-author-tables",
  250. "title": "OpenLibrary",
  251. "section": "",
  252. "text": "openlibrary/authors.parquet\n\nThis file contains basic information about OpenLibrary authors.\nFile details\n\nSchema for openlibrary/authors.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nid\n\n\nInt32\n\n\n\n\nkey\n\n\nUtf8\n\n\n\n\nname\n\n\nUtf8\n\n\n\n\n\n\n\n\nopenlibrary/author-names.parquet\n\nThis file contains the names associated with each author in openlibrary/authors.parquet.\nFile details\n\nSchema for openlibrary/author-names.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nid\n\n\nInt32\n\n\n\n\nsource\n\n\nUInt8\n\n\n\n\nname\n\n\nUtf8"
  253. },
  254. {
  255. "objectID": "data/openlib.html#utility-tables",
  256. "href": "data/openlib.html#utility-tables",
  257. "title": "OpenLibrary",
  258. "section": "",
  259. "text": "openlibrary/work-clusters.parquet\n\nThis file is a helper table to make it easier to connect OpenLibrary data to clusters by mapping OpenLibrary work IDs to book data cluster IDs.\nFile details\n\nSchema for openlibrary/work-clusters.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nwork_id\n\n\nInt32\n\n\n\n\ncluster\n\n\nInt32"
  260. },
  261. {
  262. "objectID": "data/viaf.html",
  263. "href": "data/viaf.html",
  264. "title": "VIAF",
  265. "section": "",
  266. "text": "We source author data from the Virtual Internet Authority File, as downloaded from their data dumps. This file is slow to download, as the VIAF server is rather slow.\n\n\n\n\n\n\nNote\n\n\n\nVIAF also does not keep old copies of the dump file. You may need to edit data/params.yaml to update the VIAF URL to fetch in order to import this data.\n\n\nImported data lives under the viaf directory.\n\n\nThe import is controlled by the following DVC steps:\n\nscan-authors\n\nImport the VIAF MARC data into {{ERR unknown file viaf.parquet}}.\n\nauthor-genders\n\nExtract author genders from the VIAF MARC data, producing {{ERR unknown file author-genders.parquet}}.\n\nindex-names\n\nNormalize and expand author names and map to VIAF record IDs, producing {{ERR unknown file author-name-index.parquet}}.\n\n\n\n\n\nThe VIAF data is in MARC 21 Authority Record format. The initial scan stage extracts this into a table using the MARC schema.\n\n\nviaf/viaf.parquet\n\nThe table storing raw MARC fields from VIAF.\nFile details\n\nSchema for viaf/viaf.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nrec_id\n\n\nUInt32\n\n\n\n\nfld_no\n\n\nUInt32\n\n\n\n\ntag\n\n\nInt16\n\n\n\n\nind1\n\n\nUInt8\n\n\n\n\nind2\n\n\nUInt8\n\n\n\n\nsf_code\n\n\nUInt8\n\n\n\n\ncontents\n\n\nUtf8\n\n\n\n\n\n\n\n\n\nWe process the MARC records to produce several derived tables.\n\n\nviaf/author-name-index.parquet\n\nThe author-name index file maps record IDs to author names, as defined in field 700a. For each record, it stores each of the names extracted by bookdata::cleaning::names. This file is also available in csv.gz format.\nFile details\n\nSchema for viaf/author-name-index.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nrec_id\n\n\nUInt32\n\n\n\n\nname\n\n\nUtf8\n\n\n\n\n\n\n\n\nviaf/author-genders.parquet\n\nThis file contains the extracted gender information for each author record (field 375a). If a record has multiple gender fields, they are all recorded. Merging gender records happens later in the integration.\nFile details\n\nSchema for viaf/author-genders.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nrec_id\n\n\nUInt32\n\n\n\n\ngender\n\n\nUtf8\n\n\n\n\n\n\n\n\n\nThe MARC gender field is defined as the author’s gender identity. It allows identities from an open vocabulary, along with start and end dates for the validity of each identity.\nThe Program for Cooperative Cataloging Task Group on Gender in Name Authority Records produced a report with recommendations for how to record this field. Many libraries contributing to the Library of Congress file, from which many VIAF records are sourced, follow these recommendations, but it is not safe to assume they are universally followed by all VIAF contributors.\nFurther, as near as we can tell, the VIAF removes all non-binary gender identities or converts them to ‘unknown’.\nThis data should only be used with great care. We discuss these limitations in the extended paper."
  267. },
  268. {
  269. "objectID": "data/viaf.html#import-steps",
  270. "href": "data/viaf.html#import-steps",
  271. "title": "VIAF",
  272. "section": "",
  273. "text": "The import is controlled by the following DVC steps:\n\nscan-authors\n\nImport the VIAF MARC data into {{ERR unknown file viaf.parquet}}.\n\nauthor-genders\n\nExtract author genders from the VIAF MARC data, producing {{ERR unknown file author-genders.parquet}}.\n\nindex-names\n\nNormalize and expand author names and map to VIAF record IDs, producing {{ERR unknown file author-name-index.parquet}}."
  274. },
  275. {
  276. "objectID": "data/viaf.html#raw-data",
  277. "href": "data/viaf.html#raw-data",
  278. "title": "VIAF",
  279. "section": "",
  280. "text": "The VIAF data is in MARC 21 Authority Record format. The initial scan stage extracts this into a table using the MARC schema.\n\n\nviaf/viaf.parquet\n\nThe table storing raw MARC fields from VIAF.\nFile details\n\nSchema for viaf/viaf.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nrec_id\n\n\nUInt32\n\n\n\n\nfld_no\n\n\nUInt32\n\n\n\n\ntag\n\n\nInt16\n\n\n\n\nind1\n\n\nUInt8\n\n\n\n\nind2\n\n\nUInt8\n\n\n\n\nsf_code\n\n\nUInt8\n\n\n\n\ncontents\n\n\nUtf8"
  281. },
  282. {
  283. "objectID": "data/viaf.html#extracted-author-tables",
  284. "href": "data/viaf.html#extracted-author-tables",
  285. "title": "VIAF",
  286. "section": "",
  287. "text": "We process the MARC records to produce several derived tables.\n\n\nviaf/author-name-index.parquet\n\nThe author-name index file maps record IDs to author names, as defined in field 700a. For each record, it stores each of the names extracted by bookdata::cleaning::names. This file is also available in csv.gz format.\nFile details\n\nSchema for viaf/author-name-index.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nrec_id\n\n\nUInt32\n\n\n\n\nname\n\n\nUtf8\n\n\n\n\n\n\n\n\nviaf/author-genders.parquet\n\nThis file contains the extracted gender information for each author record (field 375a). If a record has multiple gender fields, they are all recorded. Merging gender records happens later in the integration.\nFile details\n\nSchema for viaf/author-genders.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nrec_id\n\n\nUInt32\n\n\n\n\ngender\n\n\nUtf8"
  288. },
  289. {
  290. "objectID": "data/viaf.html#viaf-gender-vocabulary",
  291. "href": "data/viaf.html#viaf-gender-vocabulary",
  292. "title": "VIAF",
  293. "section": "",
  294. "text": "The MARC gender field is defined as the author’s gender identity. It allows identities from an open vocabulary, along with start and end dates for the validity of each identity.\nThe Program for Cooperative Cataloging Task Group on Gender in Name Authority Records produced a report with recommendations for how to record this field. Many libraries contributing to the Library of Congress file, from which many VIAF records are sourced, follow these recommendations, but it is not safe to assume they are universally followed by all VIAF contributors.\nFurther, as near as we can tell, the VIAF removes all non-binary gender identities or converts them to ‘unknown’.\nThis data should only be used with great care. We discuss these limitations in the extended paper."
  295. },
  296. {
  297. "objectID": "data/gender.html",
  298. "href": "data/gender.html",
  299. "title": "Book Author Gender",
  300. "section": "",
  301. "text": "We compute the author gender for book clusters using the integrated data set.\n\n\n\n\n\n\nWarning\n\n\n\nSee the paper for important limitations and ethical considerations.\n\n\n\n\n\ncluster-genders (in book-links/)\n\nMatch book genders with clusters. Produces cluster-genders.parquest.\n\n\n\n\n\nFor each book cluster, the integration does the following:\n\nAccumulate all names for the first author from OpenLibrary\nAccumulate all names for the first/primary author from the Library of Congress\nObtain gender identities from all VIAF records matching an author name in this pool\nConsolidate gender into a cluster author gender identity\n\nThe results of this are stored in book-links/cluster-genders.parquet.\n\n\nbook-links/cluster-genders.parquet\n\nThe author gender identified for each book cluster.\nFile details\n\nSchema for book-links/cluster-genders.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\ncluster\n\n\nInt32\n\n\n\n\ngender\n\n\nUtf8\n\n\n\n\n\n\n\n\nbook-links/gender-stats.csv\n\nThis file records the number of books with each gender resolution in each data set, for auditing and analysis purposes.\n\n\n\n\nSee the paper for a fuller discussion. Some known limitations include:\n\nVIAF does not record non-binary gender identities.\nRecent versions of the OpenLibrary data contain VIAF identifiers for book authors, but we do not yet make use of this information. When available, they should improve the reliability of book-author linking.\nGoodReads includes author names, but we do not yet use these for linking to gender records."
  302. },
  303. {
  304. "objectID": "data/gender.html#import-steps",
  305. "href": "data/gender.html#import-steps",
  306. "title": "Book Author Gender",
  307. "section": "",
  308. "text": "cluster-genders (in book-links/)\n\nMatch book genders with clusters. Produces cluster-genders.parquest."
  309. },
  310. {
  311. "objectID": "data/gender.html#gender-integration",
  312. "href": "data/gender.html#gender-integration",
  313. "title": "Book Author Gender",
  314. "section": "",
  315. "text": "For each book cluster, the integration does the following:\n\nAccumulate all names for the first author from OpenLibrary\nAccumulate all names for the first/primary author from the Library of Congress\nObtain gender identities from all VIAF records matching an author name in this pool\nConsolidate gender into a cluster author gender identity\n\nThe results of this are stored in book-links/cluster-genders.parquet.\n\n\nbook-links/cluster-genders.parquet\n\nThe author gender identified for each book cluster.\nFile details\n\nSchema for book-links/cluster-genders.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\ncluster\n\n\nInt32\n\n\n\n\ngender\n\n\nUtf8\n\n\n\n\n\n\n\n\nbook-links/gender-stats.csv\n\nThis file records the number of books with each gender resolution in each data set, for auditing and analysis purposes."
  316. },
  317. {
  318. "objectID": "data/gender.html#limitations",
  319. "href": "data/gender.html#limitations",
  320. "title": "Book Author Gender",
  321. "section": "",
  322. "text": "See the paper for a fuller discussion. Some known limitations include:\n\nVIAF does not record non-binary gender identities.\nRecent versions of the OpenLibrary data contain VIAF identifiers for book authors, but we do not yet make use of this information. When available, they should improve the reliability of book-author linking.\nGoodReads includes author names, but we do not yet use these for linking to gender records."
  323. },
  324. {
  325. "objectID": "data/goodreads.html",
  326. "href": "data/goodreads.html",
  327. "title": "GoodReads",
  328. "section": "",
  329. "text": "We import GoodReads data from the UCSD Book Graph for additional book and user interaction information. The source files are not automatically downloaded; you will need the following:\n\nBooks\nBook works\nAuthors\nBook genres\nBook series\nInteraction data (the full interactions JSON file, not the summary CSV)\n\nDownload the files and save them in the data/goodreads directory. Each one has a corresponding .dvc file already in the repository.\n\n\n\n\n\n\nImportant\n\n\n\nIf you use this data, cite the paper(s) documented on the data set web site.\n\n\nImported data lives in the goodreads directory.\n\n\nThe config.yaml file allows you disable the GoodReads data entirely, as well as control whether reviews are processed:\ngoodreads:\n enabled: true\n reviews: true\nIf you change the configuration, you need to update the pipeline before running.\n\n\n\nThe import is controlled by several DVC steps:\n\nscan-*\n\nThe various scan-* steps each scan a JSON file into corresponding Parquet files. They have a specific order, as scanning interactions needs book information.\n\nbook-isbn-ids\n\nMatch GoodReads ISBNs with ISBN IDs.\n\nbook-links\n\nCreates goodreads/gr-book-link.parquet, which links each GoodReads book with its work (if applicable) and is cluster ID.\n\ncluster-actions\n\nExtracts cluster-level implicit feedback data. Each (user, cluster) pair has one record, with the number of actions (the number of times the user added a book from that cluster to a shelf) and timestamp data.\n\ncluster-ratings\n\nExtracts cluster-level explicit feedback data. This is the ratings each user assigned to books in each cluster.\n\nwork-actions, work-ratings\n\nThe same thing as the cluster-* stages, except it groups by GoodReads work instead of by integrated cluster. If you are only working with the GoodReads data, and not trying to connect across data sets, this data is better to work with.\n\nwork-gender\n\nThe author gender for each GoodReads work, as near as we can tell.\n\n\n\n\n\n\n\ngoodreads/gr-book-ids.parquet\n\nIdentifiers extracted from each GoodReads book record.\nFile details\n\nSchema for goodreads/gr-book-ids.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nbook_id\n\n\nInt32\n\n\n\n\nwork_id\n\n\nInt32\n\n\n\n\ngr_item\n\n\nInt32\n\n\n\n\nisbn10\n\n\nUtf8\n\n\n\n\nisbn13\n\n\nUtf8\n\n\n\n\nasin\n\n\nUtf8\n\n\n\n\n\n\n\n\ngoodreads/gr-book-info.parquet\n\nMetadata extracted from GoodReads book records.\nFile details\n\nSchema for goodreads/gr-book-info.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nbook_id\n\n\nInt32\n\n\n\n\ntitle\n\n\nUtf8\n\n\n\n\npub_year\n\n\nUInt16\n\n\n\n\npub_month\n\n\nUInt8\n\n\n\n\n\n\n\n\ngoodreads/gr-book-genres.parquet\n\nGoodReads book-genre associations.\nFile details\n\nSchema for goodreads/gr-book-genres.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nbook_id\n\n\nInt32\n\n\n\n\ngenre_id\n\n\nInt32\n\n\n\n\ncount\n\n\nInt32\n\n\n\n\n\n\n\n\ngoodreads/gr-book-series.parquet\n\nGoodReads book series associations.\nFile details\n\nSchema for goodreads/gr-book-series.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nbook_id\n\n\nInt32\n\n\n\n\nseries\n\n\nUtf8\n\n\n\n\n\n\n\n\ngoodreads/gr-genres.parquet\n\nThe genre labels to go with goodreads/gr-book-genres.parquet.\nFile details\n\nSchema for goodreads/gr-genres.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\ngenre_id\n\n\nInt32\n\n\n\n\ngenre\n\n\nUtf8\n\n\n\n\n\n\n\n\ngoodreads/gr-book-link.parquet\n\nLinking identifiers (work and cluster) for GoodReads books.\nFile details\n\nSchema for goodreads/gr-book-link.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nbook_id\n\n\nInt32\n\n\n\n\nwork_id\n\n\nInt32\n\n\n\n\ncluster\n\n\nInt32\n\n\n\n\n\n\n\n\ngoodreads/gr-work-info.parquet\n\nMetadata extracted from GoodReads work records.\nFile details\n\nSchema for goodreads/gr-work-info.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nwork_id\n\n\nInt32\n\n\n\n\ntitle\n\n\nUtf8\n\n\n\n\npub_year\n\n\nInt16\n\n\n\n\npub_month\n\n\nUInt8\n\n\n\n\n\n\n\n\ngoodreads/gr-interactions.parquet\n\nGoodReads interaction records (from JSON).\nFile details\n\nSchema for goodreads/gr-interactions.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nrec_id\n\n\nUInt32\n\n\n\n\nreview_id\n\n\nInt64\n\n\n\n\nuser_id\n\n\nInt32\n\n\n\n\nbook_id\n\n\nInt32\n\n\n\n\nis_read\n\n\nUInt8\n\n\n\n\nrating\n\n\nFloat32\n\n\n\n\nadded\n\n\nFloat32\n\n\n\n\nupdated\n\n\nFloat32\n\n\n\n\nread_started\n\n\nFloat32\n\n\n\n\nread_finished\n\n\nFloat32\n\n\n\n\n\n\n\n\ngoodreads/gr-author-info.parquet\n\nGoodReads author information.\nFile details\n\nSchema for goodreads/gr-author-info.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nauthor_id\n\n\nInt32\n\n\n\n\nname\n\n\nUtf8\n\n\n\n\n\n\n\n\n\n\n\ngoodreads/gr-cluster-actions.parquet\n\nCluster-level implicit-feedback records, suitable for use in LensKit. The item column contains cluster IDs. This version of the table is processed from the JSON version of the full interaction log, which is only available by request.\nFile details\n\nSchema for goodreads/gr-cluster-actions.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nuser\n\n\nInt32\n\n\n\n\nitem\n\n\nInt32\n\n\n\n\nfirst_time\n\n\nInt64\n\n\n\n\nlast_time\n\n\nInt64\n\n\n\n\nnactions\n\n\nUInt32\n\n\n\n\nlast_rating\n\n\nFloat32\n\n\n\n\n\n\n\n\ngoodreads/gr-cluster-ratings.parquet\n\nCluster-level explicit-feedback records, suitable for use in LensKit. The item column contains cluster IDs. This version of the table is processed from the JSON version of the full interaction log, which is only available by request.\nFile details\n\nSchema for goodreads/gr-cluster-ratings.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nuser\n\n\nInt32\n\n\n\n\nitem\n\n\nInt32\n\n\n\n\nrating\n\n\nFloat32\n\n\n\n\nlast_rating\n\n\nFloat32\n\n\n\n\nfirst_time\n\n\nInt64\n\n\n\n\nlast_time\n\n\nInt64\n\n\n\n\nnratings\n\n\nUInt32\n\n\n\n\n\n\n\n\n\n\n\ngoodreads/gr-work-actions.parquet\n\nWork-level implicit-feedback records, suitable for use in LensKit. The item column contains work IDs.\nFile details\n\nSchema for goodreads/gr-work-actions.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nuser\n\n\nInt32\n\n\n\n\nitem\n\n\nInt32\n\n\n\n\nfirst_time\n\n\nInt64\n\n\n\n\nlast_time\n\n\nInt64\n\n\n\n\nnactions\n\n\nUInt32\n\n\n\n\nlast_rating\n\n\nFloat32\n\n\n\n\n\n\n\n\ngoodreads/gr-work-ratings.parquet\n\nWork-level explicit-feedback records, suitable for use in LensKit. The item column contains work IDs.\nFile details\n\nSchema for goodreads/gr-work-ratings.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nuser\n\n\nInt32\n\n\n\n\nitem\n\n\nInt32\n\n\n\n\nrating\n\n\nFloat32\n\n\n\n\nlast_rating\n\n\nFloat32\n\n\n\n\nfirst_time\n\n\nInt64\n\n\n\n\nlast_time\n\n\nInt64\n\n\n\n\nnratings\n\n\nUInt32\n\n\n\n\n\n\n\n\ngoodreads/gr-work-gender.parquet\n\nAuthor gender for GoodReads works. This is computed by connecting works to clusters and obtaining the cluster gender information from book-links/cluster-genders.parquet.\nFile details\n\nSchema for goodreads/gr-work-gender.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\ncluster\n\n\nInt32\n\n\n\n\ngender\n\n\nUtf8\n\n\n\n\nbook_id\n\n\nInt32\n\n\n\n\nwork_id\n\n\nInt32"
  330. },
  331. {
  332. "objectID": "data/goodreads.html#configuration",
  333. "href": "data/goodreads.html#configuration",
  334. "title": "GoodReads",
  335. "section": "",
  336. "text": "The config.yaml file allows you disable the GoodReads data entirely, as well as control whether reviews are processed:\ngoodreads:\n enabled: true\n reviews: true\nIf you change the configuration, you need to update the pipeline before running."
  337. },
  338. {
  339. "objectID": "data/goodreads.html#import-steps",
  340. "href": "data/goodreads.html#import-steps",
  341. "title": "GoodReads",
  342. "section": "",
  343. "text": "The import is controlled by several DVC steps:\n\nscan-*\n\nThe various scan-* steps each scan a JSON file into corresponding Parquet files. They have a specific order, as scanning interactions needs book information.\n\nbook-isbn-ids\n\nMatch GoodReads ISBNs with ISBN IDs.\n\nbook-links\n\nCreates goodreads/gr-book-link.parquet, which links each GoodReads book with its work (if applicable) and is cluster ID.\n\ncluster-actions\n\nExtracts cluster-level implicit feedback data. Each (user, cluster) pair has one record, with the number of actions (the number of times the user added a book from that cluster to a shelf) and timestamp data.\n\ncluster-ratings\n\nExtracts cluster-level explicit feedback data. This is the ratings each user assigned to books in each cluster.\n\nwork-actions, work-ratings\n\nThe same thing as the cluster-* stages, except it groups by GoodReads work instead of by integrated cluster. If you are only working with the GoodReads data, and not trying to connect across data sets, this data is better to work with.\n\nwork-gender\n\nThe author gender for each GoodReads work, as near as we can tell."
  344. },
  345. {
  346. "objectID": "data/goodreads.html#scanned-and-linking-data",
  347. "href": "data/goodreads.html#scanned-and-linking-data",
  348. "title": "GoodReads",
  349. "section": "",
  350. "text": "goodreads/gr-book-ids.parquet\n\nIdentifiers extracted from each GoodReads book record.\nFile details\n\nSchema for goodreads/gr-book-ids.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nbook_id\n\n\nInt32\n\n\n\n\nwork_id\n\n\nInt32\n\n\n\n\ngr_item\n\n\nInt32\n\n\n\n\nisbn10\n\n\nUtf8\n\n\n\n\nisbn13\n\n\nUtf8\n\n\n\n\nasin\n\n\nUtf8\n\n\n\n\n\n\n\n\ngoodreads/gr-book-info.parquet\n\nMetadata extracted from GoodReads book records.\nFile details\n\nSchema for goodreads/gr-book-info.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nbook_id\n\n\nInt32\n\n\n\n\ntitle\n\n\nUtf8\n\n\n\n\npub_year\n\n\nUInt16\n\n\n\n\npub_month\n\n\nUInt8\n\n\n\n\n\n\n\n\ngoodreads/gr-book-genres.parquet\n\nGoodReads book-genre associations.\nFile details\n\nSchema for goodreads/gr-book-genres.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nbook_id\n\n\nInt32\n\n\n\n\ngenre_id\n\n\nInt32\n\n\n\n\ncount\n\n\nInt32\n\n\n\n\n\n\n\n\ngoodreads/gr-book-series.parquet\n\nGoodReads book series associations.\nFile details\n\nSchema for goodreads/gr-book-series.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nbook_id\n\n\nInt32\n\n\n\n\nseries\n\n\nUtf8\n\n\n\n\n\n\n\n\ngoodreads/gr-genres.parquet\n\nThe genre labels to go with goodreads/gr-book-genres.parquet.\nFile details\n\nSchema for goodreads/gr-genres.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\ngenre_id\n\n\nInt32\n\n\n\n\ngenre\n\n\nUtf8\n\n\n\n\n\n\n\n\ngoodreads/gr-book-link.parquet\n\nLinking identifiers (work and cluster) for GoodReads books.\nFile details\n\nSchema for goodreads/gr-book-link.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nbook_id\n\n\nInt32\n\n\n\n\nwork_id\n\n\nInt32\n\n\n\n\ncluster\n\n\nInt32\n\n\n\n\n\n\n\n\ngoodreads/gr-work-info.parquet\n\nMetadata extracted from GoodReads work records.\nFile details\n\nSchema for goodreads/gr-work-info.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nwork_id\n\n\nInt32\n\n\n\n\ntitle\n\n\nUtf8\n\n\n\n\npub_year\n\n\nInt16\n\n\n\n\npub_month\n\n\nUInt8\n\n\n\n\n\n\n\n\ngoodreads/gr-interactions.parquet\n\nGoodReads interaction records (from JSON).\nFile details\n\nSchema for goodreads/gr-interactions.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nrec_id\n\n\nUInt32\n\n\n\n\nreview_id\n\n\nInt64\n\n\n\n\nuser_id\n\n\nInt32\n\n\n\n\nbook_id\n\n\nInt32\n\n\n\n\nis_read\n\n\nUInt8\n\n\n\n\nrating\n\n\nFloat32\n\n\n\n\nadded\n\n\nFloat32\n\n\n\n\nupdated\n\n\nFloat32\n\n\n\n\nread_started\n\n\nFloat32\n\n\n\n\nread_finished\n\n\nFloat32\n\n\n\n\n\n\n\n\ngoodreads/gr-author-info.parquet\n\nGoodReads author information.\nFile details\n\nSchema for goodreads/gr-author-info.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nauthor_id\n\n\nInt32\n\n\n\n\nname\n\n\nUtf8"
  351. },
  352. {
  353. "objectID": "data/goodreads.html#cluster-level-tables",
  354. "href": "data/goodreads.html#cluster-level-tables",
  355. "title": "GoodReads",
  356. "section": "",
  357. "text": "goodreads/gr-cluster-actions.parquet\n\nCluster-level implicit-feedback records, suitable for use in LensKit. The item column contains cluster IDs. This version of the table is processed from the JSON version of the full interaction log, which is only available by request.\nFile details\n\nSchema for goodreads/gr-cluster-actions.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nuser\n\n\nInt32\n\n\n\n\nitem\n\n\nInt32\n\n\n\n\nfirst_time\n\n\nInt64\n\n\n\n\nlast_time\n\n\nInt64\n\n\n\n\nnactions\n\n\nUInt32\n\n\n\n\nlast_rating\n\n\nFloat32\n\n\n\n\n\n\n\n\ngoodreads/gr-cluster-ratings.parquet\n\nCluster-level explicit-feedback records, suitable for use in LensKit. The item column contains cluster IDs. This version of the table is processed from the JSON version of the full interaction log, which is only available by request.\nFile details\n\nSchema for goodreads/gr-cluster-ratings.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nuser\n\n\nInt32\n\n\n\n\nitem\n\n\nInt32\n\n\n\n\nrating\n\n\nFloat32\n\n\n\n\nlast_rating\n\n\nFloat32\n\n\n\n\nfirst_time\n\n\nInt64\n\n\n\n\nlast_time\n\n\nInt64\n\n\n\n\nnratings\n\n\nUInt32"
  358. },
  359. {
  360. "objectID": "data/goodreads.html#work-level-tables",
  361. "href": "data/goodreads.html#work-level-tables",
  362. "title": "GoodReads",
  363. "section": "",
  364. "text": "goodreads/gr-work-actions.parquet\n\nWork-level implicit-feedback records, suitable for use in LensKit. The item column contains work IDs.\nFile details\n\nSchema for goodreads/gr-work-actions.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nuser\n\n\nInt32\n\n\n\n\nitem\n\n\nInt32\n\n\n\n\nfirst_time\n\n\nInt64\n\n\n\n\nlast_time\n\n\nInt64\n\n\n\n\nnactions\n\n\nUInt32\n\n\n\n\nlast_rating\n\n\nFloat32\n\n\n\n\n\n\n\n\ngoodreads/gr-work-ratings.parquet\n\nWork-level explicit-feedback records, suitable for use in LensKit. The item column contains work IDs.\nFile details\n\nSchema for goodreads/gr-work-ratings.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nuser\n\n\nInt32\n\n\n\n\nitem\n\n\nInt32\n\n\n\n\nrating\n\n\nFloat32\n\n\n\n\nlast_rating\n\n\nFloat32\n\n\n\n\nfirst_time\n\n\nInt64\n\n\n\n\nlast_time\n\n\nInt64\n\n\n\n\nnratings\n\n\nUInt32\n\n\n\n\n\n\n\n\ngoodreads/gr-work-gender.parquet\n\nAuthor gender for GoodReads works. This is computed by connecting works to clusters and obtaining the cluster gender information from book-links/cluster-genders.parquet.\nFile details\n\nSchema for goodreads/gr-work-gender.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\ncluster\n\n\nInt32\n\n\n\n\ngender\n\n\nUtf8\n\n\n\n\nbook_id\n\n\nInt32\n\n\n\n\nwork_id\n\n\nInt32"
  365. },
  366. {
  367. "objectID": "data/cluster.html",
  368. "href": "data/cluster.html",
  369. "title": "Book Clusters",
  370. "section": "",
  371. "text": "For recommendation and analysis, we often want to look at works instead of individual books or editions of those books. The same material by the same author(s) may be reprinted in many different editions, with different ISBNs, and sometimes separate ratings from the same user.\nThere are a variety of ways to deal with this. GoodReads and OpenLibrary both have the concept of a ‘work’ to group together related editions (the Library of Congress also has such a concept internally in its BIBFRAME schema, but that data is not currently available for integration).\nOther services, such as ThingISBN and OCLC’s xISBN both link together ISBNs: given a query ISBN, they will return a list of ISBNs believed to be for the same book.\nUsing the book data sources here, we have implemented comparable functionality in a manner that anyone can reproduce from public data. We call the resulting equivalence sets ‘book clusters’.\n\n\nOur clustering algorithm begins by forming an undirected graph of record identifiers. We extract records from the following:\n\nLibrary of Congress book records, with edges from records to ISBNs recorded for that record.\nOpenLibrary editions, with edges from editions to ISBNs recorded for that edition.\nOpenLibrary works, with edges from works to editions.\nGoodReads books, with edges from books to ISBNs recorded for that book.\nGoodReads works, with edges from works to books.\n\nWe then compute the connected components on this graph, and treat each connected component as a single ‘book’ (what we call a book cluster).\nThe idea is that if two ISBNs appear together on a book record, that is evidence they are for the same book; likewise, if two book records have the same ISBN, it is evidence they record the same book. Pooling this evidence across all data sources maximizes the ability to detect book clusters.\nThe isbn_cluster table maps each ISBN to its associated cluster. Individual data sources may also have an isbn_cluster table (e.g. gr.isbn_cluster); that is the result of clustering ISBNs using only the book records from that data source. However, all clustered results such as rating tables are based on the all-source book clusters.\n\n\n\nThere are a few known problems with the ISBN clustering:\n\nPublishers occasionally reuse ISBNs. They aren’t supposed to do this, but they do. This results in unrelated books having the same ISBN. This will cause a problem for any ISBN-based linking between books and ratings, not just the book clustering. We don’t yet have a good way to identify these ISBNs.\nSome book sets have ISBNs, which cause them link together books that should not be clustered. The Library of Congress identifies many of these ISBNs as set ISBNs, and we are examining the prospect of using this to exclude them from informing clustering decisions.\n\nIf you only need e.g. the GoodReads data, we recommend that you not cluster it for the purpose of ratings, and only use clusters to link to out-of-GR book or author data. We are open to adding additional tables that facilitate linking GoodReads works directly to other tables.\n\n\n\n\n\nbook-links/isbn-clusters.parquet\n\nThis file maps ISBN IDs to book clusters, enabling the various other book identifiers from other data sources to be mapped to clusters, since everything resolves to ISBN IDs.\nFile details\n\nSchema for book-links/isbn-clusters.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nisbn\n\n\nUtf8\n\n\n\n\nisbn_id\n\n\nInt32\n\n\n\n\ncluster\n\n\nInt32\n\n\n\n\n\n\n{{< schema book-links/isbn-clusters.parquet >}}\n\n\nWe also export the book identifier graph used for clustering to support further analysis.\n\n\nbook-links/cluster-graph-nodes.parquet\n\nThe table of nodes (with attributes) from the graph used for book clustering.\nFile details\n\nSchema for book-links/cluster-graph-nodes.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nbook_code\n\n\nInt32\n\n\n\n\ncluster\n\n\nInt32\n\n\n\n\nnode_type\n\n\nUtf8\n\n\n\n\nlabel\n\n\nUtf8\n\n\n\n\n\n\n{{< schema book-links/cluster-graph-nodes.parquet >}}\n\n\nbook-links/cluster-graph-edges.parquet\n\nThe table of edges rom the book clustering graph.\nFile details\n\nSchema for book-links/cluster-graph-edges.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nsrc\n\n\nInt32\n\n\n\n\ndst\n\n\nInt32\n\n\n\n\n\n\n{{< schema book-links/cluster-graph-edges.parquet >}}\n\nbook-links/book-graph.mp.zst\n\nThis is a serialization of the actual graph itself, using rmp-serde to serialize the Petgraph structure with MsgPack and compressing ith with ZStandard. This is unlikely to be usable outside of the Rust codebase, whereas the node and edge tables could be loaded into something like igraph for further analysis.\n\n\n\n\n\n\nWith the clusters, we then extract additional information from other tables.\n\n\nbook-links/cluster-first-authors.parquet\n\nAll available first-author records for each cluster, to support linking with VIAF.\nFile details\n\nSchema for book-links/cluster-first-authors.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\ncluster\n\n\nInt32\n\n\n\n\nauthor_name\n\n\nUtf8\n\n\n\n\n\n\n\n\nbook-links/cluster-hashes.parquet\n\nThe MD5 checksums of the sorted sequence of ISBNs for each cluster, along with a dcode that is the least-significant bit of the checksum.\nFile details\n\nSchema for book-links/cluster-hashes.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\ncluster\n\n\nInt32\n\n\n\n\nisbn_hash\n\n\nUtf8\n\n\n\n\nisbn_dcode\n\n\nInt8\n\n\n\n\n\n\n\n\nbook-links/cluster-stats.parquet\n\nStatistics for each cluster, useful for auditing and debugging.\nFile details\n\nSchema for book-links/cluster-stats.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\ncluster\n\n\nInt32\n\n\n\n\nn_nodes\n\n\nUInt32\n\n\n\n\nn_isbns\n\n\nUInt32\n\n\n\n\nn_loc_recs\n\n\nUInt32\n\n\n\n\nn_ol_editions\n\n\nUInt32\n\n\n\n\nn_ol_works\n\n\nUInt32\n\n\n\n\nn_gr_books\n\n\nUInt32\n\n\n\n\nn_gr_works\n\n\nUInt32"
  372. },
  373. {
  374. "objectID": "data/cluster.html#clustering-algorithm",
  375. "href": "data/cluster.html#clustering-algorithm",
  376. "title": "Book Clusters",
  377. "section": "",
  378. "text": "Our clustering algorithm begins by forming an undirected graph of record identifiers. We extract records from the following:\n\nLibrary of Congress book records, with edges from records to ISBNs recorded for that record.\nOpenLibrary editions, with edges from editions to ISBNs recorded for that edition.\nOpenLibrary works, with edges from works to editions.\nGoodReads books, with edges from books to ISBNs recorded for that book.\nGoodReads works, with edges from works to books.\n\nWe then compute the connected components on this graph, and treat each connected component as a single ‘book’ (what we call a book cluster).\nThe idea is that if two ISBNs appear together on a book record, that is evidence they are for the same book; likewise, if two book records have the same ISBN, it is evidence they record the same book. Pooling this evidence across all data sources maximizes the ability to detect book clusters.\nThe isbn_cluster table maps each ISBN to its associated cluster. Individual data sources may also have an isbn_cluster table (e.g. gr.isbn_cluster); that is the result of clustering ISBNs using only the book records from that data source. However, all clustered results such as rating tables are based on the all-source book clusters."
  379. },
  380. {
  381. "objectID": "data/cluster.html#known-problems",
  382. "href": "data/cluster.html#known-problems",
  383. "title": "Book Clusters",
  384. "section": "",
  385. "text": "There are a few known problems with the ISBN clustering:\n\nPublishers occasionally reuse ISBNs. They aren’t supposed to do this, but they do. This results in unrelated books having the same ISBN. This will cause a problem for any ISBN-based linking between books and ratings, not just the book clustering. We don’t yet have a good way to identify these ISBNs.\nSome book sets have ISBNs, which cause them link together books that should not be clustered. The Library of Congress identifies many of these ISBNs as set ISBNs, and we are examining the prospect of using this to exclude them from informing clustering decisions.\n\nIf you only need e.g. the GoodReads data, we recommend that you not cluster it for the purpose of ratings, and only use clusters to link to out-of-GR book or author data. We are open to adding additional tables that facilitate linking GoodReads works directly to other tables."
  386. },
  387. {
  388. "objectID": "data/cluster.html#cluster-link-tables",
  389. "href": "data/cluster.html#cluster-link-tables",
  390. "title": "Book Clusters",
  391. "section": "",
  392. "text": "book-links/isbn-clusters.parquet\n\nThis file maps ISBN IDs to book clusters, enabling the various other book identifiers from other data sources to be mapped to clusters, since everything resolves to ISBN IDs.\nFile details\n\nSchema for book-links/isbn-clusters.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nisbn\n\n\nUtf8\n\n\n\n\nisbn_id\n\n\nInt32\n\n\n\n\ncluster\n\n\nInt32\n\n\n\n\n\n\n{{< schema book-links/isbn-clusters.parquet >}}\n\n\nWe also export the book identifier graph used for clustering to support further analysis.\n\n\nbook-links/cluster-graph-nodes.parquet\n\nThe table of nodes (with attributes) from the graph used for book clustering.\nFile details\n\nSchema for book-links/cluster-graph-nodes.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nbook_code\n\n\nInt32\n\n\n\n\ncluster\n\n\nInt32\n\n\n\n\nnode_type\n\n\nUtf8\n\n\n\n\nlabel\n\n\nUtf8\n\n\n\n\n\n\n{{< schema book-links/cluster-graph-nodes.parquet >}}\n\n\nbook-links/cluster-graph-edges.parquet\n\nThe table of edges rom the book clustering graph.\nFile details\n\nSchema for book-links/cluster-graph-edges.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nsrc\n\n\nInt32\n\n\n\n\ndst\n\n\nInt32\n\n\n\n\n\n\n{{< schema book-links/cluster-graph-edges.parquet >}}\n\nbook-links/book-graph.mp.zst\n\nThis is a serialization of the actual graph itself, using rmp-serde to serialize the Petgraph structure with MsgPack and compressing ith with ZStandard. This is unlikely to be usable outside of the Rust codebase, whereas the node and edge tables could be loaded into something like igraph for further analysis."
  393. },
  394. {
  395. "objectID": "data/cluster.html#cluster-information-tables",
  396. "href": "data/cluster.html#cluster-information-tables",
  397. "title": "Book Clusters",
  398. "section": "",
  399. "text": "With the clusters, we then extract additional information from other tables.\n\n\nbook-links/cluster-first-authors.parquet\n\nAll available first-author records for each cluster, to support linking with VIAF.\nFile details\n\nSchema for book-links/cluster-first-authors.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\ncluster\n\n\nInt32\n\n\n\n\nauthor_name\n\n\nUtf8\n\n\n\n\n\n\n\n\nbook-links/cluster-hashes.parquet\n\nThe MD5 checksums of the sorted sequence of ISBNs for each cluster, along with a dcode that is the least-significant bit of the checksum.\nFile details\n\nSchema for book-links/cluster-hashes.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\ncluster\n\n\nInt32\n\n\n\n\nisbn_hash\n\n\nUtf8\n\n\n\n\nisbn_dcode\n\n\nInt8\n\n\n\n\n\n\n\n\nbook-links/cluster-stats.parquet\n\nStatistics for each cluster, useful for auditing and debugging.\nFile details\n\nSchema for book-links/cluster-stats.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\ncluster\n\n\nInt32\n\n\n\n\nn_nodes\n\n\nUInt32\n\n\n\n\nn_isbns\n\n\nUInt32\n\n\n\n\nn_loc_recs\n\n\nUInt32\n\n\n\n\nn_ol_editions\n\n\nUInt32\n\n\n\n\nn_ol_works\n\n\nUInt32\n\n\n\n\nn_gr_books\n\n\nUInt32\n\n\n\n\nn_gr_works\n\n\nUInt32"
  400. },
  401. {
  402. "objectID": "data/ids.html",
  403. "href": "data/ids.html",
  404. "title": "Common Identifiers",
  405. "section": "",
  406. "text": "There are two key identifiers that are used across data sets.\n\n\nWe use ISBNs for a lot of data linking. In order to speed up ISBN-based operations, we map textual ISBNs to numeric ’ISBN IDs`.\n\n\nbook-links/all-isbns.parquet\n\nThis file manages ISBN IDs and their mappings, along with statistics about their usage in other records.\n\n\n\nColumn\nPurpose\n\n\n\n\nisbn_id\nISBN identifier\n\n\nisbn\nTextual ISBNs\n\n\n\nEach type of ISBN (ISBN-10, ISBN-13) is considered a distinct ISBN. We also consider other ISBN-like things, particularly ASINs, to be ISBNs.\nAdditional fields in this table contain the number of records from different sources that reference this ISBN.\nFile details\n\nSchema for book-links/all-isbns.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nisbn_id\n\n\nInt32\n\n\n\n\nisbn\n\n\nUtf8\n\n\n\n\nLOC\n\n\nUInt32\n\n\n\n\nOL\n\n\nUInt32\n\n\n\n\nGR\n\n\nInt64\n\n\n\n\nAZ14\n\n\nUInt32\n\n\n\n\nAZ18\n\n\nUInt32\n\n\n\n\n\n\nMany other tables that work with ISBNs use ISBN IDs.\n\n\n\nWe also use book codes, common identifiers for integrated ‘books’ across data sets. These are derived from identifiers in the various data sets. Each book code source is assigned to a different 100M number band (a ‘numspace’) so we can, if needed, derive the source from a book code.\n\n\n\nSource\nNamespace Object\nNumspace\n\n\n\n\nOL Work\nNS_WORK\n100M\n\n\nOL Edition\nNS_EDITION\n200M\n\n\nLOC Record\nNS_LOC_REC\n300M\n\n\nGR Work\nNS_GR_WORK\n400M\n\n\nGR Book\nNS_GR_BOOK\n500M\n\n\nLOC Work\nNS_LOC_WORK\n600M\n\n\nLOC Instance\nNS_LOC_INSTANCE\n700M\n\n\nISBN\nNS_ISBN\n900M\n\n\n\nThe bookdata::ids::codes module contains the Rust API for working with these codes (including each of the namespace objects) and converting identifiers into and out of them.\nThe LOC Work and Instance sources are not currently used; they are intended for future use when we are able to import BIBFRAME data from the Library of Congress."
  407. },
  408. {
  409. "objectID": "data/ids.html#sec-isbn-ids",
  410. "href": "data/ids.html#sec-isbn-ids",
  411. "title": "Common Identifiers",
  412. "section": "",
  413. "text": "We use ISBNs for a lot of data linking. In order to speed up ISBN-based operations, we map textual ISBNs to numeric ’ISBN IDs`.\n\n\nbook-links/all-isbns.parquet\n\nThis file manages ISBN IDs and their mappings, along with statistics about their usage in other records.\n\n\n\nColumn\nPurpose\n\n\n\n\nisbn_id\nISBN identifier\n\n\nisbn\nTextual ISBNs\n\n\n\nEach type of ISBN (ISBN-10, ISBN-13) is considered a distinct ISBN. We also consider other ISBN-like things, particularly ASINs, to be ISBNs.\nAdditional fields in this table contain the number of records from different sources that reference this ISBN.\nFile details\n\nSchema for book-links/all-isbns.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nisbn_id\n\n\nInt32\n\n\n\n\nisbn\n\n\nUtf8\n\n\n\n\nLOC\n\n\nUInt32\n\n\n\n\nOL\n\n\nUInt32\n\n\n\n\nGR\n\n\nInt64\n\n\n\n\nAZ14\n\n\nUInt32\n\n\n\n\nAZ18\n\n\nUInt32\n\n\n\n\n\n\nMany other tables that work with ISBNs use ISBN IDs."
  414. },
  415. {
  416. "objectID": "data/ids.html#sec-book-codes",
  417. "href": "data/ids.html#sec-book-codes",
  418. "title": "Common Identifiers",
  419. "section": "",
  420. "text": "We also use book codes, common identifiers for integrated ‘books’ across data sets. These are derived from identifiers in the various data sets. Each book code source is assigned to a different 100M number band (a ‘numspace’) so we can, if needed, derive the source from a book code.\n\n\n\nSource\nNamespace Object\nNumspace\n\n\n\n\nOL Work\nNS_WORK\n100M\n\n\nOL Edition\nNS_EDITION\n200M\n\n\nLOC Record\nNS_LOC_REC\n300M\n\n\nGR Work\nNS_GR_WORK\n400M\n\n\nGR Book\nNS_GR_BOOK\n500M\n\n\nLOC Work\nNS_LOC_WORK\n600M\n\n\nLOC Instance\nNS_LOC_INSTANCE\n700M\n\n\nISBN\nNS_ISBN\n900M\n\n\n\nThe bookdata::ids::codes module contains the Rust API for working with these codes (including each of the namespace objects) and converting identifiers into and out of them.\nThe LOC Work and Instance sources are not currently used; they are intended for future use when we are able to import BIBFRAME data from the Library of Congress."
  421. },
  422. {
  423. "objectID": "data/loc.html",
  424. "href": "data/loc.html",
  425. "title": "Library of Congress",
  426. "section": "",
  427. "text": "One of our sources of book data is the Library of Congress MDSConnect Books bibliography records.\nWe download and import the XML versions of these files.\nImported data lives under the loc-mds directory.\n\n\n\n\nerDiagram\n book-ids |o--|{ book-fields : contains\n book-ids ||--o{ book-isbns : \"\"\n book-ids ||--o{ book-isbn-ids : \"\"\n book-ids ||--o{ book-authors : \"\"\n\n\n\n\n\n\n\nThe import is controlled by the following DVC steps:\n\nscan-books\n\nScan the book MARC data from data/loc-books into Parquet files (described in book data).\n\nbook-isbn-ids\n\nResolve ISBNs from LOC books into ISBN IDs, producing loc-mds/book-isbn-ids.parquet.\n\nbook-authors\n\nExtract (and clean up) author names for LOC books.\n\n\n\n\n\nWhen importing MARC data, we create a “fields” file that contains the data exactly as recorded in MARC. We then process this data to produce additional files. One of these MARC field files contains the following columns (defined by FieldRecord):\n\nrec_id\n\nThe record identifier (generated at import)\n\nfld_no\n\nThe field number. This corresponds to a single MARC field entry; rows in this table containing data from MARC subfields will share a fld_no with their containing field.\n\ntag\n\nThe MARC tag; either a three-digit number, or -1 for the MARC leader.\n\nind1, ind2\n\nMARC indicators. Their meanings are defined in the MARC specification.\n\nsf_code\n\nMARC subfield code.\n\ncontents\n\nThe raw textual content of the MARC field or subfield.\n\n\n\n\n\nWe extract a number of tables from the LOC MDS book data. These tables only contain information about actual “books” in the collection, as opposed to other types of materials. We consider a book to be anything that has MARC record type ‘a’ or ‘t’ (language material), and is not also classified as a government record in MARC field 008.\n\n\nloc-mds/book-fields.parquet (struct FieldRecord)\n\nThe book-fields table contains the raw data imported from the MARC files, as MARC fields. The LOC book data follows the MARC 21 Bibliographic Data format; the various tags, field codes, and indicators are defined there. This table is not terribly useful on its own, but it is the source from which the other tables are derived.\nFile details\n\nSchema for loc-mds/book-fields.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nrec_id\n\n\nUInt32\n\n\n\n\nfld_no\n\n\nUInt32\n\n\n\n\ntag\n\n\nInt16\n\n\n\n\nind1\n\n\nUInt8\n\n\n\n\nind2\n\n\nUInt8\n\n\n\n\nsf_code\n\n\nUInt8\n\n\n\n\ncontents\n\n\nUtf8\n\n\n\n\n\n\n\n\nloc-mds/book-ids.parquet (struct BookIds)\n\nThis table includes code information for each book record.\n\nRecord ID\nMARC Control Number\nLibrary of Congress Control Number (LCCN)\nRecord status\nRecord type\nBibliographic level\n\nMore information about the last three is in the leader specification.\nFile details\n\nSchema for loc-mds/book-ids.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nrec_id\n\n\nUInt32\n\n\n\n\nmarc_cn\n\n\nUtf8\n\n\n\n\nlccn\n\n\nUtf8\n\n\n\n\nstatus\n\n\nUInt8\n\n\n\n\nrec_type\n\n\nUInt8\n\n\n\n\nbib_level\n\n\nUInt8\n\n\n\n\n\n\n\n\nloc-mds/book-isbns.parquet (struct ISBNrec)\n\nTextual ISBNs as extracted from LOC records. The actual ISBN strings (tag 020 subfield ‘a’) are quite messy; the parser in bookdata::cleaning::isbns parses out ISBNs, along with additional tags or descriptors, from the ISBN strings using a number of best-effort heuristics. This table contains the results of that process.\nFile details\n\nSchema for loc-mds/book-isbns.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nrec_id\n\n\nUInt32\n\n\n\n\nisbn\n\n\nUtf8\n\n\n\n\ntag\n\n\nUtf8\n\n\n\n\n\n\n\n\nloc-mds/book-isbn-ids.parquet\n\nMap book records (LOC book rec_id values) to ISBN IDs. It is produced by converting the ISBNs in loc-mds/book-isbns.parquet into ISBN IDs.\nFile details\n\nSchema for loc-mds/book-isbn-ids.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nrec_id\n\n\nUInt32\n\n\n\n\nisbn_id\n\n\nInt32\n\n\n\n\n\n\n\n\nloc-mds/book-authors.parquet\n\nAuthor names for book records. This only extracts the primary author name (MARC field 100 subfield ‘a’).\nFile details\n\nSchema for loc-mds/book-authors.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nrec_id\n\n\nUInt32\n\n\n\n\nauthor_name\n\n\nUtf8"
  428. },
  429. {
  430. "objectID": "data/loc.html#import-steps",
  431. "href": "data/loc.html#import-steps",
  432. "title": "Library of Congress",
  433. "section": "",
  434. "text": "The import is controlled by the following DVC steps:\n\nscan-books\n\nScan the book MARC data from data/loc-books into Parquet files (described in book data).\n\nbook-isbn-ids\n\nResolve ISBNs from LOC books into ISBN IDs, producing loc-mds/book-isbn-ids.parquet.\n\nbook-authors\n\nExtract (and clean up) author names for LOC books."
  435. },
  436. {
  437. "objectID": "data/loc.html#sec-marc-format",
  438. "href": "data/loc.html#sec-marc-format",
  439. "title": "Library of Congress",
  440. "section": "",
  441. "text": "When importing MARC data, we create a “fields” file that contains the data exactly as recorded in MARC. We then process this data to produce additional files. One of these MARC field files contains the following columns (defined by FieldRecord):\n\nrec_id\n\nThe record identifier (generated at import)\n\nfld_no\n\nThe field number. This corresponds to a single MARC field entry; rows in this table containing data from MARC subfields will share a fld_no with their containing field.\n\ntag\n\nThe MARC tag; either a three-digit number, or -1 for the MARC leader.\n\nind1, ind2\n\nMARC indicators. Their meanings are defined in the MARC specification.\n\nsf_code\n\nMARC subfield code.\n\ncontents\n\nThe raw textual content of the MARC field or subfield."
  442. },
  443. {
  444. "objectID": "data/loc.html#extracted-book-tables",
  445. "href": "data/loc.html#extracted-book-tables",
  446. "title": "Library of Congress",
  447. "section": "",
  448. "text": "We extract a number of tables from the LOC MDS book data. These tables only contain information about actual “books” in the collection, as opposed to other types of materials. We consider a book to be anything that has MARC record type ‘a’ or ‘t’ (language material), and is not also classified as a government record in MARC field 008.\n\n\nloc-mds/book-fields.parquet (struct FieldRecord)\n\nThe book-fields table contains the raw data imported from the MARC files, as MARC fields. The LOC book data follows the MARC 21 Bibliographic Data format; the various tags, field codes, and indicators are defined there. This table is not terribly useful on its own, but it is the source from which the other tables are derived.\nFile details\n\nSchema for loc-mds/book-fields.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nrec_id\n\n\nUInt32\n\n\n\n\nfld_no\n\n\nUInt32\n\n\n\n\ntag\n\n\nInt16\n\n\n\n\nind1\n\n\nUInt8\n\n\n\n\nind2\n\n\nUInt8\n\n\n\n\nsf_code\n\n\nUInt8\n\n\n\n\ncontents\n\n\nUtf8\n\n\n\n\n\n\n\n\nloc-mds/book-ids.parquet (struct BookIds)\n\nThis table includes code information for each book record.\n\nRecord ID\nMARC Control Number\nLibrary of Congress Control Number (LCCN)\nRecord status\nRecord type\nBibliographic level\n\nMore information about the last three is in the leader specification.\nFile details\n\nSchema for loc-mds/book-ids.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nrec_id\n\n\nUInt32\n\n\n\n\nmarc_cn\n\n\nUtf8\n\n\n\n\nlccn\n\n\nUtf8\n\n\n\n\nstatus\n\n\nUInt8\n\n\n\n\nrec_type\n\n\nUInt8\n\n\n\n\nbib_level\n\n\nUInt8\n\n\n\n\n\n\n\n\nloc-mds/book-isbns.parquet (struct ISBNrec)\n\nTextual ISBNs as extracted from LOC records. The actual ISBN strings (tag 020 subfield ‘a’) are quite messy; the parser in bookdata::cleaning::isbns parses out ISBNs, along with additional tags or descriptors, from the ISBN strings using a number of best-effort heuristics. This table contains the results of that process.\nFile details\n\nSchema for loc-mds/book-isbns.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nrec_id\n\n\nUInt32\n\n\n\n\nisbn\n\n\nUtf8\n\n\n\n\ntag\n\n\nUtf8\n\n\n\n\n\n\n\n\nloc-mds/book-isbn-ids.parquet\n\nMap book records (LOC book rec_id values) to ISBN IDs. It is produced by converting the ISBNs in loc-mds/book-isbns.parquet into ISBN IDs.\nFile details\n\nSchema for loc-mds/book-isbn-ids.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nrec_id\n\n\nUInt32\n\n\n\n\nisbn_id\n\n\nInt32\n\n\n\n\n\n\n\n\nloc-mds/book-authors.parquet\n\nAuthor names for book records. This only extracts the primary author name (MARC field 100 subfield ‘a’).\nFile details\n\nSchema for loc-mds/book-authors.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nrec_id\n\n\nUInt32\n\n\n\n\nauthor_name\n\n\nUtf8"
  449. },
  450. {
  451. "objectID": "data/index.html",
  452. "href": "data/index.html",
  453. "title": "Data Organization",
  454. "section": "",
  455. "text": "Data Organization\nThis section describes the layout of the imported data, and the logic behind its integration.\nWe organize the data and pipelines in directories as follows:\n\ndata\n\nContains the raw import data as downloaded from its original source. Manually-downloaded files and files that can be natively downloaded by DVC are tracked with a .dvc file; the dvc.yaml pipeline contains stages to automatically download additional files. The only processing in this directory is downloading.\nData sets consisting of multiple files generally get a subdirectory under this directory.\n\nloc-mds\n\nContains the results of processing data from the Library of Congress MDSConnect Open MARC service. See LOC for details.\n\nopenlibrary\n\nContains the results of processing the OpenLibrary data.\n\nviaf\n\nContains Virtual Internet Authority File processing.\n\nbx\n\nContains the results of integrating BookCrossing.\n\naz2014\n\nContains the results of integrating the Amazon 2014 ratings data set.\n\ngoodreads\n\nContains the GoodReads processing and integration\n\nbook-links\n\nContains linking book identifiers for integrating the whole set, including the clustering and the integrated author genders.\n\n\nEach directory has a DVC pipeline for managing that directory’s outputs. Post-clustering integrations are stored in the data source directory; e.g. the goodreads directory contains both the direct tabular GoodReads data, and the conversion of ratings into ratings for book clusters based on book-links (so the flow from directory to directory is not one-directional)."
  456. },
  457. {
  458. "objectID": "data/bx.html",
  459. "href": "data/bx.html",
  460. "title": "BookCrossing",
  461. "section": "",
  462. "text": "The BookCrossing data set consists of user-provided ratings — both implicit and explicit — of books.\n\n\n\n\n\n\nNote\n\n\n\nThe BookCrossing site is no longer online, so this data cannot be obtained from its original source and the BookCrossing integration is disabled by default. If you have a copy of this data, save the BX-CSV-Dump.zip file in the data directory and enable BookCrossing in config.yaml to use it.\n\n\n\n\n\n\n\n\nImportant\n\n\n\nIf you use the BookCrossing data, cite:\n\nCai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, and Georg Lausen. 2005. Improving Recommendation Lists Through Topic Diversification. Proceedings of the 14th International World Wide Web Conference (WWW ’05), May 10-14, 2005, Chiba, Japan. DOI:10.1145/1060745.1060754.\n\n\n\nImported data lives in the bx directory.\n\n\nThe import is controlled by the following DVC steps:\n\ndata/BX-CSV-Dump.zip.dvc\n\nDownload the BookCrossing zip file.\n\nclean-ratings\n\nUnpack ratings from the downloaded zip file and clean up their invalid characters.\n\ncluster-ratings\n\nCombine BookCrossing ratings with book clusters to produce (user, cluster, rating) from the explicit-feedback ratings. BookCrossing implicit feedback entries (rating of 0) are excluded. Produces bx/bx-cluster-ratings.parquet.\n\ncluster-actions\n\nCombine BookCrossing interactions with book clusters to produce (user, cluster) implicit-feedback records. These records include the BookCrossing implicit feedback entries (rating of 0). Produces bx/bx-cluster-actions.parquet.\n\n\n\n\n\nThe raw rating data, with invalid characters cleaned up, is in the bx/cleaned-ratings.csv file. It has the following columns:\n\nuser_id\n\nThe user identifier (numeric).\n\nisbn\n\nThe book ISBN (text).\n\nrating\n\nThe book rating \\(r_{ui}\\). The ratings are on a 1-10 scale, with 0 indicating an implicit-feedback record.\n\n\n\n\n\n\n\nbx/bx-cluster-ratings.parquet\n\nThe explicit-feedback ratings (\\(r_{ui} > 0\\) from {{ERR unknown file bx/cleaned-ratings.csv}}), with book clusters as the items.\nFile details\n\nSchema for bx/bx-cluster-ratings.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nuser\n\n\nInt64\n\n\n\n\nitem\n\n\nInt32\n\n\n\n\nrating\n\n\nFloat64\n\n\n\n\nnratings\n\n\nUInt32\n\n\n\n\n\n\n\n\nbx/bx-cluster-actions.parquet\n\nAll user-item interactions from {{ERR unknown file bx/cleaned-ratings.csv}}, with book clusters as the items.\nFile details\n\nSchema for bx/bx-cluster-actions.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nuser\n\n\nInt64\n\n\n\n\nitem\n\n\nInt32\n\n\n\n\nnactions\n\n\nUInt32"
  463. },
  464. {
  465. "objectID": "data/bx.html#import-steps",
  466. "href": "data/bx.html#import-steps",
  467. "title": "BookCrossing",
  468. "section": "",
  469. "text": "The import is controlled by the following DVC steps:\n\ndata/BX-CSV-Dump.zip.dvc\n\nDownload the BookCrossing zip file.\n\nclean-ratings\n\nUnpack ratings from the downloaded zip file and clean up their invalid characters.\n\ncluster-ratings\n\nCombine BookCrossing ratings with book clusters to produce (user, cluster, rating) from the explicit-feedback ratings. BookCrossing implicit feedback entries (rating of 0) are excluded. Produces bx/bx-cluster-ratings.parquet.\n\ncluster-actions\n\nCombine BookCrossing interactions with book clusters to produce (user, cluster) implicit-feedback records. These records include the BookCrossing implicit feedback entries (rating of 0). Produces bx/bx-cluster-actions.parquet."
  470. },
  471. {
  472. "objectID": "data/bx.html#sec-bx-raw",
  473. "href": "data/bx.html#sec-bx-raw",
  474. "title": "BookCrossing",
  475. "section": "",
  476. "text": "The raw rating data, with invalid characters cleaned up, is in the bx/cleaned-ratings.csv file. It has the following columns:\n\nuser_id\n\nThe user identifier (numeric).\n\nisbn\n\nThe book ISBN (text).\n\nrating\n\nThe book rating \\(r_{ui}\\). The ratings are on a 1-10 scale, with 0 indicating an implicit-feedback record."
  477. },
  478. {
  479. "objectID": "data/bx.html#sec-bx-extracted",
  480. "href": "data/bx.html#sec-bx-extracted",
  481. "title": "BookCrossing",
  482. "section": "",
  483. "text": "bx/bx-cluster-ratings.parquet\n\nThe explicit-feedback ratings (\\(r_{ui} > 0\\) from {{ERR unknown file bx/cleaned-ratings.csv}}), with book clusters as the items.\nFile details\n\nSchema for bx/bx-cluster-ratings.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nuser\n\n\nInt64\n\n\n\n\nitem\n\n\nInt32\n\n\n\n\nrating\n\n\nFloat64\n\n\n\n\nnratings\n\n\nUInt32\n\n\n\n\n\n\n\n\nbx/bx-cluster-actions.parquet\n\nAll user-item interactions from {{ERR unknown file bx/cleaned-ratings.csv}}, with book clusters as the items.\nFile details\n\nSchema for bx/bx-cluster-actions.parquet.\n\n\n\n\n\n\n\nField\n\n\nType\n\n\n\n\n\n\nuser\n\n\nInt64\n\n\n\n\nitem\n\n\nInt32\n\n\n\n\nnactions\n\n\nUInt32"
  484. },
  485. {
  486. "objectID": "history.html",
  487. "href": "history.html",
  488. "title": "History",
  489. "section": "",
  490. "text": "This page documents the release history of the Book Data Tools. Each numbered, released version has a corresponding Git tag (e.g. v2.0).\nIf you use the Book Data Tools in published research, we ask that you do the following:\n\nCite the UMUAI paper, regardless of which version of the data set you use.\nCite the papers corresponding to the individual ratings, review, or consumption data sets you are using.\nClearly state the version of the data tools you are using in your paper.\nLet us know about your work so we can add you to the list.\n\n\n\n\nMake the pipeline configurable so individual rating datasets can be disabled.\nOnly support the full JSON GoodReads interaction data, because it is now publicly available.\nUse jsonnet to generate DVC pipelines, taking configuration settings into account.\nUpdate to newer VIAF and OpenLibrary dumps.\nExtract GoodReads author information into goodreads/gr-author-info.parquet.\nSupport full-text reviews from the GoodReads and Amazon 2018 data sets (enabled by default).\nDisable the BookCrossing data by default since the source website is offline.\nExtract 5-cores of interaction files.\nUpdate to OpenLibrary and VIAF dumps from the beginning of 2024 (OpenLibrary 2023-12-31, VIAF 2024-01-01).\n\n\n\n\n🪲 GoodReads cluster & work rating timestamps were on incorrect scale\n\n\n\n\n\nVersion 2.1 has a few updates but does not change existing data schemas when run with the full GoodReads interaction files. It does have improved book/author linking that increases coverage due to a revised and corrected name parsing & normalization flow.\nThe tools now support the GoodReads interaction CSV file, which is available without registration, and uses this by default. See the GoodReads data docs for the details. This means that, in their default configuration, the book data integration uses only data that is publicly available without special request.\n\n\n\nUpdated VIAF to May 1, 2022 dump\nUpdated OpenLibrary to March 29, 2022 dump\nAdded 2018 version of the Amazon ratings\nAdded code to extract edition and work subjects\nUpdated docs for current extraction layout\nAdded openlibrary/work-clusters.parquet to simplify OpenLibrary integration\n\n\n\n\n\nSwitched from DataFusion to Polars, to reduce volatility and improve maintainability. This also involved a switch from Arrow to Arrow2, which seems to have cleaner code (and less custom logic needed for IO).\nRewrote logic that was previously in DataFusion + custom TCL in Rust, so all integration code is in Rust for consistency (and to avoid redundancy in things like logging configuration between Rust and Python). The code is now in 2 languages: Rust integration and Python notebooks to report on integration statistics.\nImproved name parsing\n\nReplaced nom-based name parser for name_variants with a new one written in peg, that is both easier to read/maintain and more efficient.\nCorrected errors in name parser that emitted empty-string names for some authors.\nAdded clean_name function, used across all name formatting, to normalize whitespace and punctuation in name records from any source.\nAdded more tests for name parsing and normalization.\n\nFixed a bug in GoodReads integration, where we were not extracting ASINs.\nExtract book genres and series from GoodReads.\nUpdated various Rust dependencies, and upgraded from StructOpt to clap’s derive macros.\nBetter progress reporting for data scans.\n\n\n\n\n\nThis is the updated release of the Book Data Tools, using the same source data as 1.0 but with DataFusion and Rust-based import logic, instead of PostgreSQL. It is significantly easier to install and use.\n\n\n\nThe original release that used PostgreSQL. There were a couple of versions of this for the RecSys and UMUAI papers; the tagged 1.0 release corresponds to the data used for the UMUAI paper."
  491. },
  492. {
  493. "objectID": "history.html#book-data-3.0-in-progress",
  494. "href": "history.html#book-data-3.0-in-progress",
  495. "title": "History",
  496. "section": "",
  497. "text": "Make the pipeline configurable so individual rating datasets can be disabled.\nOnly support the full JSON GoodReads interaction data, because it is now publicly available.\nUse jsonnet to generate DVC pipelines, taking configuration settings into account.\nUpdate to newer VIAF and OpenLibrary dumps.\nExtract GoodReads author information into goodreads/gr-author-info.parquet.\nSupport full-text reviews from the GoodReads and Amazon 2018 data sets (enabled by default).\nDisable the BookCrossing data by default since the source website is offline.\nExtract 5-cores of interaction files.\nUpdate to OpenLibrary and VIAF dumps from the beginning of 2024 (OpenLibrary 2023-12-31, VIAF 2024-01-01).\n\n\n\n\n🪲 GoodReads cluster & work rating timestamps were on incorrect scale"
  498. },
  499. {
  500. "objectID": "history.html#book-data-2.1",
  501. "href": "history.html#book-data-2.1",
  502. "title": "History",
  503. "section": "",
  504. "text": "Version 2.1 has a few updates but does not change existing data schemas when run with the full GoodReads interaction files. It does have improved book/author linking that increases coverage due to a revised and corrected name parsing & normalization flow.\nThe tools now support the GoodReads interaction CSV file, which is available without registration, and uses this by default. See the GoodReads data docs for the details. This means that, in their default configuration, the book data integration uses only data that is publicly available without special request.\n\n\n\nUpdated VIAF to May 1, 2022 dump\nUpdated OpenLibrary to March 29, 2022 dump\nAdded 2018 version of the Amazon ratings\nAdded code to extract edition and work subjects\nUpdated docs for current extraction layout\nAdded openlibrary/work-clusters.parquet to simplify OpenLibrary integration\n\n\n\n\n\nSwitched from DataFusion to Polars, to reduce volatility and improve maintainability. This also involved a switch from Arrow to Arrow2, which seems to have cleaner code (and less custom logic needed for IO).\nRewrote logic that was previously in DataFusion + custom TCL in Rust, so all integration code is in Rust for consistency (and to avoid redundancy in things like logging configuration between Rust and Python). The code is now in 2 languages: Rust integration and Python notebooks to report on integration statistics.\nImproved name parsing\n\nReplaced nom-based name parser for name_variants with a new one written in peg, that is both easier to read/maintain and more efficient.\nCorrected errors in name parser that emitted empty-string names for some authors.\nAdded clean_name function, used across all name formatting, to normalize whitespace and punctuation in name records from any source.\nAdded more tests for name parsing and normalization.\n\nFixed a bug in GoodReads integration, where we were not extracting ASINs.\nExtract book genres and series from GoodReads.\nUpdated various Rust dependencies, and upgraded from StructOpt to clap’s derive macros.\nBetter progress reporting for data scans."
  505. },
  506. {
  507. "objectID": "history.html#book-data-2.0",
  508. "href": "history.html#book-data-2.0",
  509. "title": "History",
  510. "section": "",
  511. "text": "This is the updated release of the Book Data Tools, using the same source data as 1.0 but with DataFusion and Rust-based import logic, instead of PostgreSQL. It is significantly easier to install and use."
  512. },
  513. {
  514. "objectID": "history.html#book-data-1.0",
  515. "href": "history.html#book-data-1.0",
  516. "title": "History",
  517. "section": "",
  518. "text": "The original release that used PostgreSQL. There were a couple of versions of this for the RecSys and UMUAI papers; the tagged 1.0 release corresponds to the data used for the UMUAI paper."
  519. },
  520. {
  521. "objectID": "using/sources.html",
  522. "href": "using/sources.html",
  523. "title": "Source Data",
  524. "section": "",
  525. "text": "These import tools will integrate several data sets. Some of them are auto-downloaded, but others you will need to download yourself and save in the data directory. The data sources are:\n\nLibrary of Congress MDSConnect Open MARC Records (auto-downloaded).\nLoC MDSConnect Name Authorities (auto-downloaded).\nVirtual Internet Authority File MARC 21 XML data (auto-downloaded, but usually needs configuration to access current data file; see the documentation for details).\nOpenLibrary Dump (auto-downloaded).\nAmazon Ratings (2014) ‘ratings only’ data for Books (not auto-downloaded — save CSV file in data/az2014). If you use this data, cite the paper on that site.\nAmazon Ratings (2018) ‘ratings only’ data for Books (not auto-downloaded — save CSV file in data/az2014). If you use this data, cite the paper on that site.\nBookCrossing (auto-downloaded). If you use this data, cite the paper on that site.\nGoodReads data from UCSD Book Graph — the GoodReads books, works, authors, series, and interaction files (not auto-downloaded - save GZip’d JSON files in data/goodreads). If you use this data, cite the paper on that site. More information on options are in the docs.\n\nIf all files are properly downloaded, dvc status -R data will show that all files are up to date (it may also display warnings about locked files).\nSee Data Model for details on how each data source appears in the final data.\n\n\nThe pipeline is reconfigurable to use subsets of this data. To change the pipeline options:\n\nEdit config.yaml to specify the options you want, such as using full GoodReads interaction files.\nRe-render the pipeline with cargo run --release pipeline render\nCommit the updated pipeline to git (optional, but recommended prior to running)\n\nA dvc repro will now use the reconfigured pipeline."
  526. },
  527. {
  528. "objectID": "using/sources.html#configuration",
  529. "href": "using/sources.html#configuration",
  530. "title": "Source Data",
  531. "section": "",
  532. "text": "The pipeline is reconfigurable to use subsets of this data. To change the pipeline options:\n\nEdit config.yaml to specify the options you want, such as using full GoodReads interaction files.\nRe-render the pipeline with cargo run --release pipeline render\nCommit the updated pipeline to git (optional, but recommended prior to running)\n\nA dvc repro will now use the reconfigured pipeline."
  533. },
  534. {
  535. "objectID": "using/storage.html",
  536. "href": "using/storage.html",
  537. "title": "Data Storage",
  538. "section": "",
  539. "text": "Data Storage\nOnce you have set up the software environment, the one remaining piece is to set up your data storage if you want to share the book data with collaborators or between machines. Since this project uses DVC, you will need to configure a DVC remote to store your data. This will require around 200GB of space for all of the relevant data files, in addition to the files in your local repository.\n\n\n\n\n\n\nNote\n\n\n\nIt is possible to work without a remote if you only need one copy of the data, but as soon as you want to move the data between multiple machines or use DVC’s import facilities to load it into an experiment project, you will need a remote.\n\n\nDue to data redistribution restrictions we can’t share access to the remote we use within our research group.\nWhat you need to do:\n\nAdd your remote (with dvc remote add or by editing .dvc/config). You can use any remote type supported by DVC.\nConfigure your remote as the default (with dvc remote default).\n\n\n\n\n\n\n\nTip\n\n\n\nIf you don’t want to pay for cloud storage for hte data, there are several good options for local hosting if you have a server with sufficient storage space:\n\nGarage and Minio provide S3-compatible storage APIs. Both store the data in an internal format (allowing checksums and deduplication), not in raw files on your file system, so you can only access the data through the S3 api.\nCaddy with the webdav plugin is the easiest way I have found to run a webdav server. I’ve started moving towards webdav instead of S3 for in-house remotes so that the data can be accessed directly on the server filesystem. Apache HTTPD also has good webdav support, but it is somewhat more cumbersome to configure.\n\n\n\n\n\n\n\n\n\nNote\n\n\n\nIf you are a member of our research group, or a direct collaborator, using these tools, contact Michael for access to our remote."
  540. },
  541. {
  542. "objectID": "reports/LinkageStats.html",
  543. "href": "reports/LinkageStats.html",
  544. "title": "Book Data Linkage Statistics",
  545. "section": "",
  546. "text": "This notebook presents statistics of the book data integration."
  547. },
  548. {
  549. "objectID": "reports/LinkageStats.html#setup",
  550. "href": "reports/LinkageStats.html#setup",
  551. "title": "Book Data Linkage Statistics",
  552. "section": "Setup",
  553. "text": "Setup\n\nimport pandas as pd\nimport matplotlib as mpl\nimport matplotlib.pyplot as plt\nimport numpy as np"
  554. },
  555. {
  556. "objectID": "reports/LinkageStats.html#load-link-stats",
  557. "href": "reports/LinkageStats.html#load-link-stats",
  558. "title": "Book Data Linkage Statistics",
  559. "section": "Load Link Stats",
  560. "text": "Load Link Stats\nWe compute dataset linking statistics as gender-stats.csv as part of the integration. Let’s load those:\n\nlink_stats = pd.read_csv('book-links/gender-stats.csv')\nlink_stats.head()\n\n\n\n\n\n\n\n\ndataset\ngender\nn_books\nn_actions\n\n\n\n\n0\nLOC-MDS\nno-book-author\n600216\nNaN\n\n\n1\nLOC-MDS\nunknown\n1084460\nNaN\n\n\n2\nLOC-MDS\nambiguous\n73989\nNaN\n\n\n3\nLOC-MDS\nmale\n2424008\nNaN\n\n\n4\nLOC-MDS\nfemale\n743105\nNaN\n\n\n\n\n\n\n\nNow let’s define variables for our variou codes. We are first going to define our gender codes. We’ll start with the resolved codes:\n\nlink_codes = ['female', 'male', 'ambiguous', 'unknown']\n\nWe want the unlink codes in order, so the last is the first link failure:\n\nunlink_codes = ['no-author-rec', 'no-book-author', 'no-book']\n\n\nall_codes = link_codes + unlink_codes"
  561. },
  562. {
  563. "objectID": "reports/LinkageStats.html#processing-statistics",
  564. "href": "reports/LinkageStats.html#processing-statistics",
  565. "title": "Book Data Linkage Statistics",
  566. "section": "Processing Statistics",
  567. "text": "Processing Statistics\nNow we’ll pivot each of our count columns into a table for easier reference.\n\nbook_counts = link_stats.pivot('dataset', 'gender', 'n_books')\nbook_counts = book_counts.reindex(columns=all_codes)\nbook_counts.assign(total=book_counts.sum(axis=1))\n\n/var/folders/rp/hd85d1b94pd2cfs8q8h9fjx52t1n0n/T/ipykernel_15237/233082166.py:1: FutureWarning: In a future version of pandas all arguments of DataFrame.pivot will be keyword-only.\n book_counts = link_stats.pivot('dataset', 'gender', 'n_books')\n\n\n\n\n\n\n\n\ngender\nfemale\nmale\nambiguous\nunknown\nno-author-rec\nno-book-author\nno-book\ntotal\n\n\ndataset\n\n\n\n\n\n\n\n\n\n\n\n\nAZ14\n248863.0\n550877.0\n24064.0\n239915.0\n155511.0\n167948.0\n870268.0\n2257446.0\n\n\nAZ18\n318004.0\n670899.0\n27977.0\n300300.0\n239917.0\n152438.0\n1144899.0\n2854434.0\n\n\nBX-E\n40256.0\n58484.0\n5596.0\n15281.0\n5692.0\n5428.0\n17481.0\n148218.0\n\n\nBX-I\n71441.0\n102756.0\n9528.0\n31440.0\n11562.0\n10861.0\n35009.0\n272597.0\n\n\nGR-E\n225840.0\n334136.0\n18516.0\n106501.0\n60515.0\n738282.0\nNaN\n1483790.0\n\n\nGR-I\n228142.0\n338411.0\n18709.0\n108333.0\n61601.0\n750118.0\nNaN\n1505314.0\n\n\nLOC-MDS\n743105.0\n2424008.0\n73989.0\n1084460.0\n306291.0\n600216.0\nNaN\n5232069.0\n\n\n\n\n\n\n\n\nact_counts = link_stats.pivot('dataset', 'gender', 'n_actions')\nact_counts = act_counts.reindex(columns=all_codes)\nact_counts.drop(index='LOC-MDS', inplace=True)\nact_counts\n\n/var/folders/rp/hd85d1b94pd2cfs8q8h9fjx52t1n0n/T/ipykernel_15237/71450322.py:1: FutureWarning: In a future version of pandas all arguments of DataFrame.pivot will be keyword-only.\n act_counts = link_stats.pivot('dataset', 'gender', 'n_actions')\n\n\n\n\n\n\n\n\ngender\nfemale\nmale\nambiguous\nunknown\nno-author-rec\nno-book-author\nno-book\n\n\ndataset\n\n\n\n\n\n\n\n\n\n\n\nAZ14\n4977284.0\n7105363.0\n849025.0\n2157265.0\n1100127.0\n2359170.0\n3879190.0\n\n\nAZ18\n12377052.0\n15603235.0\n1844630.0\n4692726.0\n3312340.0\n2820794.0\n10008921.0\n\n\nBX-E\n142252.0\n183945.0\n41768.0\n24554.0\n7130.0\n7234.0\n19920.0\n\n\nBX-I\n401483.0\n468156.0\n104008.0\n69361.0\n18597.0\n18882.0\n47275.0\n\n\nGR-E\n36335167.0\n33249747.0\n13230835.0\n3570086.0\n1039410.0\n11168052.0\nNaN\n\n\nGR-I\n82889862.0\n69977512.0\n22091068.0\n10242726.0\n3545964.0\n29784689.0\nNaN\n\n\n\n\n\n\n\nWe’re going to want to compute versions of this table as fractions, e.g. the fraction of books that are written by women. We will use the following helper function:\n\ndef fractionalize(data, columns, unlinked=None):\n fracs = data[columns]\n fracs.columns = fracs.columns.astype('str')\n if unlinked:\n fracs = fracs.assign(unlinked=data[unlinked].sum(axis=1))\n totals = fracs.sum(axis=1)\n return fracs.divide(totals, axis=0)\n\nAnd a helper function for plotting bar charts:\n\ndef plot_bars(fracs, ax=None, cmap=mpl.cm.Dark2):\n if ax is None:\n ax = plt.gca()\n size = 0.5\n ind = np.arange(len(fracs))\n start = pd.Series(0, index=fracs.index)\n for i, col in enumerate(fracs.columns):\n vals = fracs.iloc[:, i]\n rects = ax.barh(ind, vals, size, left=start, label=col, color=cmap(i))\n for j, rec in enumerate(rects):\n if vals.iloc[j] < 0.1 or np.isnan(vals.iloc[j]): continue\n y = rec.get_y() + rec.get_height() / 2\n x = start.iloc[j] + vals.iloc[j] / 2\n ax.annotate('{:.1f}%'.format(vals.iloc[j] * 100),\n xy=(x,y), ha='center', va='center', color='white',\n fontweight='bold')\n start += vals.fillna(0)\n ax.set_xlabel('Fraction of Books')\n ax.set_ylabel('Data Set')\n ax.set_yticks(ind)\n ax.set_yticklabels(fracs.index)\n ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))"
  568. },
  569. {
  570. "objectID": "reports/LinkageStats.html#resolution-of-books",
  571. "href": "reports/LinkageStats.html#resolution-of-books",
  572. "title": "Book Data Linkage Statistics",
  573. "section": "Resolution of Books",
  574. "text": "Resolution of Books\nWhat fraction of unique books are resolved from each source?\n\nfractionalize(book_counts, link_codes + unlink_codes)\n\n\n\n\n\n\n\ngender\nfemale\nmale\nambiguous\nunknown\nno-author-rec\nno-book-author\nno-book\n\n\ndataset\n\n\n\n\n\n\n\n\n\n\n\nAZ14\n0.110241\n0.244027\n0.010660\n0.106277\n0.068888\n0.074397\n0.385510\n\n\nAZ18\n0.111407\n0.235037\n0.009801\n0.105205\n0.084051\n0.053404\n0.401095\n\n\nBX-E\n0.271600\n0.394581\n0.037755\n0.103098\n0.038403\n0.036622\n0.117941\n\n\nBX-I\n0.262076\n0.376952\n0.034953\n0.115335\n0.042414\n0.039843\n0.128428\n\n\nGR-E\n0.152205\n0.225191\n0.012479\n0.071776\n0.040784\n0.497565\nNaN\n\n\nGR-I\n0.151558\n0.224811\n0.012429\n0.071967\n0.040922\n0.498313\nNaN\n\n\nLOC-MDS\n0.142029\n0.463298\n0.014141\n0.207272\n0.058541\n0.114719\nNaN\n\n\n\n\n\n\n\n\nplot_bars(fractionalize(book_counts, link_codes + unlink_codes))\n\n\n\n\n\nfractionalize(book_counts, link_codes, unlink_codes)\n\n\n\n\n\n\n\ngender\nfemale\nmale\nambiguous\nunknown\nunlinked\n\n\ndataset\n\n\n\n\n\n\n\n\n\nAZ14\n0.110241\n0.244027\n0.010660\n0.106277\n0.528795\n\n\nAZ18\n0.111407\n0.235037\n0.009801\n0.105205\n0.538549\n\n\nBX-E\n0.271600\n0.394581\n0.037755\n0.103098\n0.192966\n\n\nBX-I\n0.262076\n0.376952\n0.034953\n0.115335\n0.210685\n\n\nGR-E\n0.152205\n0.225191\n0.012479\n0.071776\n0.538349\n\n\nGR-I\n0.151558\n0.224811\n0.012429\n0.071967\n0.539236\n\n\nLOC-MDS\n0.142029\n0.463298\n0.014141\n0.207272\n0.173260\n\n\n\n\n\n\n\n\nplot_bars(fractionalize(book_counts, link_codes, unlink_codes))\n\n\n\n\n\nplot_bars(fractionalize(book_counts, ['female', 'male']))"
  575. },
  576. {
  577. "objectID": "reports/LinkageStats.html#resolution-of-ratings",
  578. "href": "reports/LinkageStats.html#resolution-of-ratings",
  579. "title": "Book Data Linkage Statistics",
  580. "section": "Resolution of Ratings",
  581. "text": "Resolution of Ratings\nWhat fraction of rating actions have each resolution result?\n\nfractionalize(act_counts, link_codes + unlink_codes)\n\n\n\n\n\n\n\ngender\nfemale\nmale\nambiguous\nunknown\nno-author-rec\nno-book-author\nno-book\n\n\ndataset\n\n\n\n\n\n\n\n\n\n\n\nAZ14\n0.221928\n0.316816\n0.037857\n0.096189\n0.049053\n0.105191\n0.172966\n\n\nAZ18\n0.244318\n0.308001\n0.036412\n0.092632\n0.065384\n0.055681\n0.197572\n\n\nBX-E\n0.333297\n0.430983\n0.097862\n0.057530\n0.016706\n0.016949\n0.046673\n\n\nBX-I\n0.356000\n0.415120\n0.092225\n0.061503\n0.016490\n0.016743\n0.041919\n\n\nGR-E\n0.368536\n0.337241\n0.134196\n0.036210\n0.010542\n0.113274\nNaN\n\n\nGR-I\n0.379303\n0.320217\n0.101089\n0.046871\n0.016226\n0.136295\nNaN\n\n\n\n\n\n\n\n\nplot_bars(fractionalize(act_counts, link_codes + unlink_codes))\n\n\n\n\n\nfractionalize(act_counts, link_codes, unlink_codes)\n\n\n\n\n\n\n\ngender\nfemale\nmale\nambiguous\nunknown\nunlinked\n\n\ndataset\n\n\n\n\n\n\n\n\n\nAZ14\n0.221928\n0.316816\n0.037857\n0.096189\n0.327210\n\n\nAZ18\n0.244318\n0.308001\n0.036412\n0.092632\n0.318637\n\n\nBX-E\n0.333297\n0.430983\n0.097862\n0.057530\n0.080327\n\n\nBX-I\n0.356000\n0.415120\n0.092225\n0.061503\n0.075152\n\n\nGR-E\n0.368536\n0.337241\n0.134196\n0.036210\n0.123816\n\n\nGR-I\n0.379303\n0.320217\n0.101089\n0.046871\n0.152521\n\n\n\n\n\n\n\n\nplot_bars(fractionalize(act_counts, link_codes, unlink_codes))\n\n\n\n\n\nplot_bars(fractionalize(act_counts, ['female', 'male']))"
  582. },
  583. {
  584. "objectID": "reports/LinkageStats.html#metrics",
  585. "href": "reports/LinkageStats.html#metrics",
  586. "title": "Book Data Linkage Statistics",
  587. "section": "Metrics",
  588. "text": "Metrics\nFinally, we’re going to write coverage metrics.\n\nbook_tots = book_counts.sum(axis=1)\nbook_link = book_counts['male'] + book_counts['female'] + book_counts['ambiguous']\nbook_cover = book_link / book_tots\nbook_cover\n\ndataset\nAZ14 0.364927\nAZ18 0.356246\nBX-E 0.703936\nBX-I 0.673980\nGR-E 0.389875\nGR-I 0.388797\nLOC-MDS 0.619469\ndtype: float64\n\n\n\nbook_cover.to_json('book-coverage.json')"
  589. },
  590. {
  591. "objectID": "reports/audit-cluster-stats.html",
  592. "href": "reports/audit-cluster-stats.html",
  593. "title": "ISBN Cluster Changes",
  594. "section": "",
  595. "text": "This notebook audits for significant changes in the clustering results in the book data, to allow us to detect the significance of shifts from version to version. It depends on the aligned cluster identities in isbn-version-clusters.parquet.\nData versions are indexed by month; versions corresponding to tagged versions also have the version in their name.\nWe are particularly intersted in the shift in number of clusters, and shifts in which cluster an ISBN is associated with (while cluster IDs are not stable across versions, this notebook works on an aligned version of the cluster-ISBN associations).\nimport pandas as pd\nimport matplotlib.pyplot as plt"
  596. },
  597. {
  598. "objectID": "reports/audit-cluster-stats.html#load-data",
  599. "href": "reports/audit-cluster-stats.html#load-data",
  600. "title": "ISBN Cluster Changes",
  601. "section": "Load Data",
  602. "text": "Load Data\nDefine the versions we care about:\n\nversions = ['pgsql', '2022-03-2.0', '2022-07', '2022-10', '2022-11-2.1', 'current']\n\nLoad the aligned ISBNs:\n\nisbn_clusters = pd.read_parquet('isbn-version-clusters.parquet')\nisbn_clusters.info()"
  603. },
  604. {
  605. "objectID": "reports/audit-cluster-stats.html#cluster-counts",
  606. "href": "reports/audit-cluster-stats.html#cluster-counts",
  607. "title": "ISBN Cluster Changes",
  608. "section": "Cluster Counts",
  609. "text": "Cluster Counts\nLet’s look at the # of ISBNs and clusters in each dataset:\n\nmetrics = isbn_clusters[versions].agg(['count', 'nunique']).T.rename(columns={\n 'count': 'n_isbns',\n 'nunique': 'n_clusters',\n})\nmetrics"
  610. },
  611. {
  612. "objectID": "reports/audit-cluster-stats.html#cluster-size-distributions",
  613. "href": "reports/audit-cluster-stats.html#cluster-size-distributions",
  614. "title": "ISBN Cluster Changes",
  615. "section": "Cluster Size Distributions",
  616. "text": "Cluster Size Distributions\nNow we’re going to look at how the sizes of clusters, and the distribution of cluster sizes and changes.\n\nsizes = dict((v, isbn_clusters[v].value_counts()) for v in versions)\nsizes = pd.concat(sizes, names=['version', 'cluster'])\nsizes.name = 'size'\nsizes\n\nCompute the histogram:\n\nsize_hist = sizes.groupby('version').value_counts()\nsize_hist.name = 'count'\nsize_hist\n\nAnd plot the cumulative distributions:\n\nfor v in versions:\n vss = size_hist.loc[v].sort_index()\n vsc = vss.cumsum() / vss.sum()\n plt.plot(vsc.index, vsc.values, label=v)\n\nplt.title('Distribution of Cluster Sizes')\nplt.ylabel('Cum. Frac. of Clusters')\nplt.xlabel('Cluster Size')\nplt.xscale('symlog')\nplt.legend()\nplt.show()\n\nSave more metrics:\n\nmetrics['max_size'] = pd.Series({\n v: sizes[v].max()\n for v in versions\n})\nmetrics"
  617. },
  618. {
  619. "objectID": "reports/audit-cluster-stats.html#different-clusters",
  620. "href": "reports/audit-cluster-stats.html#different-clusters",
  621. "title": "ISBN Cluster Changes",
  622. "section": "Different Clusters",
  623. "text": "Different Clusters\n\nISBN Changes\nHow many ISBNs changed cluster across each version?\n\nstatuses = ['same', 'added', 'changed', 'dropped']\nchanged = isbn_clusters[['isbn_id']].copy(deep=False)\nfor (v1, v2) in zip(versions, versions[1:]):\n v1c = isbn_clusters[v1]\n v2c = isbn_clusters[v2]\n cc = pd.Series('same', index=changed.index)\n cc = cc.astype('category').cat.set_categories(statuses, ordered=True)\n cc[v1c.isnull() & v2c.notnull()] = 'added'\n cc[v1c.notnull() & v2c.isnull()] = 'dropped'\n cc[v1c.notnull() & v2c.notnull() & (v1c != v2c)] = 'changed'\n changed[v2] = cc\n del cc\nchanged.set_index('isbn_id', inplace=True)\nchanged.head()\n\nCount number in each trajectory:\n\ntrajectories = changed.value_counts()\ntrajectories = trajectories.to_frame('count')\ntrajectories['fraction'] = trajectories['count'] / len(changed)\ntrajectories['cum_frac'] = trajectories['fraction'].cumsum()\n\n\ntrajectories\n\n\nmetrics['new_isbns'] = (changed[versions[1:]] == 'added').sum().reindex(metrics.index)\nmetrics['dropped_isbns'] = (changed[versions[1:]] == 'dropped').sum().reindex(metrics.index)\nmetrics['changed_isbns'] = (changed[versions[1:]] == 'changed').sum().reindex(metrics.index)\nmetrics\n\nThe biggest change is that the July 2022 update introduced a large number (8.2M) of new ISBNs. This update incorporated more current book data, and changed the ISBN parsing logic, so it is not surprising.\nLet’s save these book changes to a file for future re-analysis:\n\nchanged.to_parquet('isbn-cluster-changes.parquet', compression='zstd')"
  624. },
  625. {
  626. "objectID": "reports/audit-cluster-stats.html#final-saved-metrics",
  627. "href": "reports/audit-cluster-stats.html#final-saved-metrics",
  628. "title": "ISBN Cluster Changes",
  629. "section": "Final Saved Metrics",
  630. "text": "Final Saved Metrics\nNow we’re going to save this metric file to a CSV.\n\nmetrics.index.name = 'version'\nmetrics\n\n\nmetrics.to_csv('audit-metrics.csv')"
  631. },
  632. {
  633. "objectID": "implementation/dataset.html",
  634. "href": "implementation/dataset.html",
  635. "title": "Design for Datasets",
  636. "section": "",
  637. "text": "The general import philosophy is that we scan raw data from underlying data sets into a tabular form, and then integrate it with further code; import and processing stages are written in Rust, using the Polars library for data frames. We use Parquet for storing all outputs, both intermediate stages and final products; when an output is particularly small, and a CSV version would be convenient, we sometimes also produce compressed CSV.\n\n\nIn general, to add new data, you need to do a few things:\n\nAdd the source files under data, and commit them to DVC.\nImplement code to extract the source files into tabular Parquet that keeps identifiers, etc. from the original source, but is easier to process for subsequent stages. This typically includes a new Rust command to process the data, and a DVC stage to run it.\nIf the data source provides additional ISBNs, add them to src/cli/collect_isbns.rs so that they are included in ISBN indexing.\nImplement code to process the extracted source files into cluster-aggregated files, if needed (typically used for rating data).\nUpdate the analytics and statistics to include the new data.\n\nAll of the CLI tools live in bookdata::cli, with support code elsewhere in the source tree."
  638. },
  639. {
  640. "objectID": "implementation/dataset.html#adding-a-data-set",
  641. "href": "implementation/dataset.html#adding-a-data-set",
  642. "title": "Design for Datasets",
  643. "section": "",
  644. "text": "In general, to add new data, you need to do a few things:\n\nAdd the source files under data, and commit them to DVC.\nImplement code to extract the source files into tabular Parquet that keeps identifiers, etc. from the original source, but is easier to process for subsequent stages. This typically includes a new Rust command to process the data, and a DVC stage to run it.\nIf the data source provides additional ISBNs, add them to src/cli/collect_isbns.rs so that they are included in ISBN indexing.\nImplement code to process the extracted source files into cluster-aggregated files, if needed (typically used for rating data).\nUpdate the analytics and statistics to include the new data.\n\nAll of the CLI tools live in bookdata::cli, with support code elsewhere in the source tree."
  645. },
  646. {
  647. "objectID": "implementation/layout.html",
  648. "href": "implementation/layout.html",
  649. "title": "Code Layout",
  650. "section": "",
  651. "text": "The import code consists primarily of Rust, wired together with DVC, with data in several directories to facilitate ease of discovery. We use Python and R in Quarto documents for analytics and reporting.\n\n\nThe Rust code all lives under src, with the various command-line programs in src/cli. The Rust tools are implemented as a monolithic executable with subcommands for various operations, to save disk space and compile time. To see the help:\ncargo run help\nThe programs are run through cargo run in --release mode; the bd.cmd jsonnet function automates this, so we only need to specify the subcommand and its options in our pipeline definitions.\nFor writing new commands, there is a lot of utility code under src. Consult the Rust API documentation for further details.\nThe Rust code makes extensive use of the polars, arrow2, and parquet2 crates for data analysis and IO. arrow2_convert is used to automate converstion for Parquet serialization."
  652. },
  653. {
  654. "objectID": "implementation/layout.html#rust",
  655. "href": "implementation/layout.html#rust",
  656. "title": "Code Layout",
  657. "section": "",
  658. "text": "The Rust code all lives under src, with the various command-line programs in src/cli. The Rust tools are implemented as a monolithic executable with subcommands for various operations, to save disk space and compile time. To see the help:\ncargo run help\nThe programs are run through cargo run in --release mode; the bd.cmd jsonnet function automates this, so we only need to specify the subcommand and its options in our pipeline definitions.\nFor writing new commands, there is a lot of utility code under src. Consult the Rust API documentation for further details.\nThe Rust code makes extensive use of the polars, arrow2, and parquet2 crates for data analysis and IO. arrow2_convert is used to automate converstion for Parquet serialization."
  659. },
  660. {
  661. "objectID": "index.html",
  662. "href": "index.html",
  663. "title": "Overview",
  664. "section": "",
  665. "text": "The PIReT Book Data Tools are a set of tools for ingesting, integrating, and indexing a variety of sources of book data, created by the People and Information Research Team at Boise State University. The result of running these tools is a set of Parquet files with raw data in a more usable form, various useful extracted features, and integrated identifiers across the various data sources for cross-linking. These tools are updated from the version used to support our original paper; we have dropped PostgreSQL in favor of a pipeline using DVC to script extraction and integration tools implemented in Rust that is more efficient (integration times have dropped from 8 hours to less than 3) and requires significantly less disk space.1\nIf you use these scripts in any published research, cite our paper (PDF):\n\nMichael D. Ekstrand and Daniel Kluver. 2021. Exploring Author Gender in Book Rating and Recommendation. User Modeling and User-Adapted Interaction (February 2021) DOI:10.1007/s11257-020-09284-2.\n\nWe also ask that you contact Michael Ekstrand to let us know about your use of the data, so we can include your paper in our list of relying publications.\n\n\n\n\n\n\nWarning\n\n\n\nThe “Limitations” section of the paper contains important information about the limitations of the data these scripts compile. Do not use the gender information in this data or tools without understanding those limitations. In particular, VIAF’s gender information is incomplete and, in a number of cases, incorrect.\n\n\nIn addition, several of the data sets integrated by this project come from other sources with their own publications. If you use any of the rating or interaction data, cite the appropriate original source paper. For each data set below, we have provided a link to the page that describes the data and its appropriate citation.\nSee the Setup page to get started and for system requirements.\n\n\nI recorded a video walking through the integration as an example for my Data Science class. This discusses the PostgreSQL version of the integration, but the concepts have remained the same in terms of linking logic.\n\n\n\n\n\n\n\nThese tools are under the MIT license:\n\nCopyright 2019-2021 Boise State University\nPermission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\nThe above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\nTHE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n\n\n\n\nThis material is based upon work supported by the National Science Foundation under Grant No. IIS 17-51278. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. This page has not been approved by Boise State University and does not reflect official university positions."
  666. },
  667. {
  668. "objectID": "index.html#video",
  669. "href": "index.html#video",
  670. "title": "Overview",
  671. "section": "",
  672. "text": "I recorded a video walking through the integration as an example for my Data Science class. This discusses the PostgreSQL version of the integration, but the concepts have remained the same in terms of linking logic."
  673. },
  674. {
  675. "objectID": "index.html#license",
  676. "href": "index.html#license",
  677. "title": "Overview",
  678. "section": "",
  679. "text": "These tools are under the MIT license:\n\nCopyright 2019-2021 Boise State University\nPermission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\nThe above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\nTHE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE."
  680. },
  681. {
  682. "objectID": "index.html#acknowledgments",
  683. "href": "index.html#acknowledgments",
  684. "title": "Overview",
  685. "section": "",
  686. "text": "This material is based upon work supported by the National Science Foundation under Grant No. IIS 17-51278. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. This page has not been approved by Boise State University and does not reflect official university positions."
  687. },
  688. {
  689. "objectID": "index.html#footnotes",
  690. "href": "index.html#footnotes",
  691. "title": "Overview",
  692. "section": "Footnotes",
  693. "text": "Footnotes\n\n\nThe original tools are available on the before-fusion tag in the Git repository.↩︎"
  694. }
  695. ]
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...