Are you sure you want to delete this access key?
title | parent | nav_order |
---|---|---|
Status Tracking | Implementation | 1 |
{: .no_toc}
The book tools are built around Data Version Control, a tool for managing data processing pipelines. We use the software in a somewhat unusal way due to our use of PostgreSQL as our primary storage.
Import is organized into stages that are also tracked in the PostgreSQL database, so that we can get current status information from the DB about the data that has been loaded into it.
Because our primary data storage is in a database, and DVC likes to track files, our design is a little unusual for a DVC project. Each stage that produces output in the database (import, index, etc.) is implemented as two stages:
.transcript
file as output; it works just like any normal
DVC stage..transcript
file and produces a .status
file. This stage
is marked as always_run
, so it always re-runs even if the transcript is unchanged, so that it
can make sure the .status
file contains the current state of the database.Stages that take database state as input depend on the corresponding .status
file, not the
.transcript
file, so that their need to update is triggered based on the current database state.
This wires together all of the dependencies, and uses the current state in the database instead of
files that might become out-of-sync with the database to track import status.
There are a couple of holes in this design, but it's the best we can do and it works.
The stage name matches the name of the .dvc
file.
The reason for this somewhat bizarre layoutis that if we just wrote the output files, and the database was reloaded or corrupted, the DVC status-checking logic would not be ableto keep track of it. This double-file design allows us to make subsequent steps depend on the actual results of the import, not our memory of the import in the Git repository.
Import steps are tracked in the stage_status
table in the database. For completed stages, this can
include a key (checksum, UUID, or other identifier) to identify a 'version' of the stage. Stages
can also have dependencies, which are solely used for computing the status of a stage (all actual
dependency relationships are handled by DVC):
stage_deps
tracks stage-to-stage dependencies, to say that one stage used another as input.stage_file
tracks stage-to-file dependencies, to say that a stage used a file as input.The source_file
table tracks input file checksums.
Projects using the book database can also use stage_status
to obtain data version information, to
see if they are up-to-date.
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?