Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

index.md 1.4 KB

You have to be logged in to leave a comment. Sign In
title nav_order has_children
Implementation 4 true

Design and Implementation

{: .no_toc}

These data and integration tools are designed to support several goals:

  • Complete end-to-end reproducibility with a single command (dvc repro)
  • Self-documenting import stage dependencies
  • Automatically re-run downstream steps when a data file or integration logic changes
  • Support updates (e.g. new OpenLibrary dumps) by replacing the file and re-running
  • Efficient import and integration
  1. TOC {:toc}

Implementation Principles

These goals are realized through a few technology and design decisions:

  • Script all import steps with a tool that can track stage dependencies and check whether a stage is up-to-date (DVC).
  • Stage each data source with a schema file, a raw import, and subsequent SQL files that reformat and extract data into more usable format.
  • Implement as much data integration as possible in declarative SQL.
  • Make SQL scripts re-runnable, so they will either refresh or delete and recreate their outputs. Deletes cascading to downstream steps are fine, because the stage runner will re-run those stages anyway.

Further Information

DVC Dependency Graph

DVC Dep Graph

Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...