Are you sure you want to delete this access key?
Legend |
---|
DVC Managed File |
Git Managed File |
Metric |
Stage File |
External File |
Legend |
---|
DVC Managed File |
Git Managed File |
Metric |
Stage File |
External File |
Many machine learning projects begin with querying data from a huge database. For time and space efficiency, it is nowadays common practice that people host databases with huge amount of information, instead of on their local machines, on a cloud-based data warehouse platform like Snowflake. This project explores how people could leverage the power of Dagshub tracking feature to version the data querying process from a Snowflake database in order to achieve reproducibility when working with a Snowflake database.
To run the code in this project, you need to install the following packages:
In this project I use a public dataset with schemas obtained from Kaggle, which collects different features and price of Lego sets in the market. Then the data files are uploaded to an AWS S3 bucket.
To host a database on Snowflake using datasets on AWS S3 buckets, you first need to configure your Snowflake account for AWS S3 integration, and then load data into Snowflake. A detailed documentation for configuration and data loading is available here for you to follow along.
After loading the data into Snowflake, you can create a database and start managing your data using SQL statements. For this project, I create a table for each csv file and build the schemas based on what is provided in the Kaggle dataset.
The goal of this machine learning project is to predict the price of a published Lego set using its numerous features provided in the database.
The first step is to query some relative information from the Snowflake database. This involves writing SQL queries and saving the query output as csv files. With Dagshub tracking features, we could track the sql queries as texts by Git and track the query output as csv files by DVC. To achieve reproducibility and optimize model performance, it is essential to track all data querying attempts thoroughly. In this project, using Git to track SQL queries and DVC to track result tables makes it easier to parallelize different data preprocessing attempts to locate the optimal one for modeling in future steps.
In the model training process, I collect the category, theme, publishing year, number of parts, and average set color in RGB format of each Lego set, and model these features against the price. Using a simple Lasso regression model, we achieve a R2 score of 0.487 on an independent test set. Then we make multiple querying attempts with different SQL queries and compare their corresponding model performance. Versioning the data querying process makes it possible to revert to a previous version of SQL queries and output tables with best model performance using just one git command.
This project demonstrates the importance of version control in achieving reproducibility in a machine learning pipeline when working with a cloud database hosted on Snowflake.
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?