Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
General:  snowflake Data Domain:  tabular Integration:  dvc git mlflow
6913e6d979
try sql queries v2.0 and get different result tables
1 year ago
b0e6b7386e
add stages for the data pipeline
1 year ago
b0e6b7386e
add stages for the data pipeline
1 year ago
src
b0e6b7386e
add stages for the data pipeline
1 year ago
f97879ea4a
load data with sql queries
1 year ago
f97879ea4a
load data with sql queries
1 year ago
503393891a
update readme
1 year ago
5a3493f364
upload dvc.lock
1 year ago
b0e6b7386e
add stages for the data pipeline
1 year ago
0640413190
add snowflake config code
1 year ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

You have to be logged in to leave a comment. Sign In

dagshub-snowflake symbol

Lego Set Price Prediction with Snowflake

Introduction

Many machine learning projects begin with querying data from a huge database. For time and space efficiency, it is nowadays common practice that people host databases with huge amount of information, instead of on their local machines, on a cloud-based data warehouse platform like Snowflake. This project explores how people could leverage the power of Dagshub tracking feature to version the data querying process from a Snowflake database in order to achieve reproducibility when working with a Snowflake database.

Prerequisites

To run the code in this project, you need to install the following packages:

  • mlflow
  • sklearn
  • requests
  • pillow
  • snowflake python connector: the installation guide can be found here.

Building a Snowflake Database

In this project I use a public dataset with schemas obtained from Kaggle, which collects different features and price of Lego sets in the market. Then the data files are uploaded to an AWS S3 bucket.

To host a database on Snowflake using datasets on AWS S3 buckets, you first need to configure your Snowflake account for AWS S3 integration, and then load data into Snowflake. A detailed documentation for configuration and data loading is available here for you to follow along.

After loading the data into Snowflake, you can create a database and start managing your data using SQL statements. For this project, I create a table for each csv file and build the schemas based on what is provided in the Kaggle dataset.

Querying Data from Snowflake Database

The goal of this machine learning project is to predict the price of a published Lego set using its numerous features provided in the database.

The first step is to query some relative information from the Snowflake database. This involves writing SQL queries and saving the query output as csv files. With Dagshub tracking features, we could track the sql queries as texts by Git and track the query output as csv files by DVC. To achieve reproducibility and optimize model performance, it is essential to track all data querying attempts thoroughly. In this project, using Git to track SQL queries and DVC to track result tables makes it easier to parallelize different data preprocessing attempts to locate the optimal one for modeling in future steps.

Training Model on Queried Dataframes

In the model training process, I collect the category, theme, publishing year, number of parts, and average set color in RGB format of each Lego set, and model these features against the price. Using a simple Lasso regression model, we achieve a R2 score of 0.487 on an independent test set. Then we make multiple querying attempts with different SQL queries and compare their corresponding model performance. Versioning the data querying process makes it possible to revert to a previous version of SQL queries and output tables with best model performance using just one git command.

Conclusion

This project demonstrates the importance of version control in achieving reproducibility in a machine learning pipeline when working with a cloud database hosted on Snowflake.

Tip!

Press p or to see the previous file or, n or to see the next file

About

This project explores how people could leverage the power of Dagshub tracking feature to version the data querying process from a Snowflake database in order to achieve reproducibility when working with a Snowflake database.

Collaborators 2

Comments

Loading...