<h1>History<a class="headerlink" href="#history" title="Permalink to this heading">#</a></h1>
<p>This page documents the release history of the Book Data Tools. Each numbered,
released version has a corresponding Git tag (e.g. <code class="docutils literal notranslate"><span class="pre">v2.0</span></code>).</p>
<p>If you use the Book Data Tools in published research, we ask that you do the
following:</p>
<ol class="arabic simple">
<li><p>Cite the <a class="reference external" href="https://md.ekstrandom.net/pubs/bag-extended">UMUAI paper</a>,
regardless of which version of the data set you use.</p></li>
<li><p>Cite the papers corresponding to the individual ratings, review, or
consumption data sets you are using.</p></li>
<li><p>Clearly state the version of the data tools you are using in your paper.</p></li>
<li><p><a class="reference internal" href="papers.html"><span class="doc std std-doc">Let us know</span></a> about your use so we can add you to the list.</p></li>
</ol>
<section id="book-data-2-2-in-progress">
<h2>Book Data 2.2 (in progress)<a class="headerlink" href="#book-data-2-2-in-progress" title="Permalink to this heading">#</a></h2>
<ul class="simple">
<li><p>Extract GoodReads author information into <a class="reference internal" href="data/goodreads.html#file-goodreads-gr-author-info.parquet"><code class="xref std std-file docutils literal notranslate"><span class="pre">goodreads/gr-author-info.parquet</span></code></a>.</p></li>
<li><p>Extract 5-cores of interaction files.</p></li>
<li><p>🪲 GoodReads cluster & work rating timestamps were on incorrect scale</p></li>
<li><p>Use <a class="reference internal" href="implementation/pipeline.html"><span class="doc std std-doc">lightweight DSL</span></a> to generate DVC pipelines in a configurable manner</p></li>
</ul>
</section>
<section id="book-data-2-1">
<h2>Book Data 2.1<a class="headerlink" href="#book-data-2-1" title="Permalink to this heading">#</a></h2>
<p>Version 2.1 has a few updates but does not change existing data schemas when run
with the full GoodReads interaction files. It does have improved book/author
linking that increases coverage due to a revised and corrected name parsing &
normalization flow.</p>
<p>The tools now support the GoodReads interaction CSV file, which is available
without registration, and uses this by default. See the <a class="reference internal" href="data/goodreads.html"><span class="doc std std-doc">GoodReads data
docs</span></a> for the details. This means that, in their default
configuration, the book data integration uses only data that is publicly
available without special request.</p>
<section id="data-updates">
<h3>Data Updates<a class="headerlink" href="#data-updates" title="Permalink to this heading">#</a></h3>
<ul class="simple">
<li><p>Updated VIAF to May 1, 2022 dump</p></li>
<li><p>Updated OpenLibrary to March 29, 2022 dump</p></li>
<li><p>Added 2018 version of the Amazon ratings</p></li>
<li><p>Added code to extract edition and work subjects</p></li>
<li><p>Updated docs for current extraction layout</p></li>
with a new one written in [<code class="docutils literal notranslate"><span class="pre">peg</span></code>], that is both easier to read/maintain and more efficient.</p></li>
<li><p>Corrected errors in name parser that emitted empty-string names for some authors.</p></li>
<li><p>Added <code class="docutils literal notranslate"><span class="pre">clean_name</span></code> function, used across all name formatting, to normalize whitespace and
punctuation in name records from any source.</p></li>
<li><p>Added more tests for name parsing and normalization.</p></li>
</ul>
</li>
<li><p>Fixed a bug in GoodReads integration, where we were not extracting ASINs.</p></li>
<li><p>Extract book genres and series from GoodReads.</p></li>
<li><p>Updated various Rust dependencies, and upgraded from StructOpt to <code class="docutils literal notranslate"><span class="pre">clap</span></code>’s derive macros.</p></li>
<li><p>Better progress reporting for data scans.</p></li>
</ul>
</section>
</section>
<section id="book-data-2-0">
<h2>Book Data 2.0<a class="headerlink" href="#book-data-2-0" title="Permalink to this heading">#</a></h2>
<p>This is the updated release of the Book Data Tools, using the same source data
as 1.0 but with DataFusion and Rust-based import logic, instead of PostgreSQL.
It is significantly easier to install and use.</p>
</section>
<section id="book-data-1-0">
<h2>Book Data 1.0<a class="headerlink" href="#book-data-1-0" title="Permalink to this heading">#</a></h2>
<p>The original release that used PostgreSQL. There were a couple of versions of
this for the RecSys and UMUAI papers; the tagged 1.0 release corresponds to the
Press p or to see the previous file or,
n or to see the next file
Comments
Integrate Google Cloud Storage
Use Google Storage
Select bucket
Upload key
Finish
Use Google Cloud Storage!
Browsing data directories saved to Google Cloud Storage is possible with DAGsHub. Let's configure
your repository to easily display your data in the context of any commit!
Specify your Google Storage bucket
Congratulations!
Bookdata-tools is now integrated with Google Cloud Storage!
Delete Storage Key
Are you sure you want to delete this access key?
No
Yes
Integrate AWS S3
Use S3 remote
Select bucket
Access key
Finish
Use AWS S3 as storage!
Browsing data directories saved to S3 is possible with DAGsHub. Let's configure
your repository to easily display your data in the context of any commit!
Specify your S3 bucket
Select Region
af-south-1 - Africa (Cape Town)
ap-northeast-1 - Asia Pacific (Tokyo)
ap-northeast-2 - Asia Pacific (Seoul)
ap-south-1 - Asia Pacific (Mumbai)
ap-southeast-1 - Asia Pacific (Singapore)
ap-southeast-2 - Asia Pacific (Sydney)
ca-central-1 - Canada (Central)
eu-central-1 - EU (Frankfurt)
eu-north-1 - EU (Stockholm)
eu-west-1 - EU (Ireland)
eu-west-2 - EU (London)
eu-west-3 - EU (Paris)
sa-east-1 - South America (São Paulo)
us-east-1 - US East (N. Virginia)
us-east-2 - US East (Ohio)
us-gov-east-1 - US Gov East 1
us-gov-west-1 - US Gov West 1
us-west-1 - US West (N. California)
us-west-2 - US West (Oregon)
Congratulations!
Bookdata-tools is now integrated with AWS S3!
Delete Storage Key
Are you sure you want to delete this access key?
No
Yes
Integrate S3 compatible storage
Use S3 like remote
Select bucket
Access key
Finish
Use any S3 compatible storage!
Browsing data directories saved to S3 compatible storage is possible with DAGsHub. Let's configure
your repository to easily display your data in the context of any commit!
Specify your S3 bucket
Congratulations!
Bookdata-tools is now integrated with your S3 compatible storage!
Delete Storage Key
Are you sure you want to delete this access key?
No
Yes
Integrate Azure Cloud Storage
Use Azure Storage
Select bucket
Set key
Finish
Use Azure Cloud Storage!
Browsing data directories saved to Azure Cloud Storage is possible with DAGsHub. Let's configure
your repository to easily display your data in the context of any commit!
Specify your Azure Storage bucket
Congratulations!
Bookdata-tools is now integrated with Azure Cloud Storage!