Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Integration:  dvc git
a76e14a3f9
training price prediction model with metrics logging, dvc pipeline and sql queries logging
1 year ago
a76e14a3f9
training price prediction model with metrics logging, dvc pipeline and sql queries logging
1 year ago
a76e14a3f9
training price prediction model with metrics logging, dvc pipeline and sql queries logging
1 year ago
src
2b36d33c8c
track sql_queries.py by git
1 year ago
a76e14a3f9
training price prediction model with metrics logging, dvc pipeline and sql queries logging
1 year ago
2b9cb5f754
add .DS_Store to .gitignore
1 year ago
762d050597
Update 'README.md'
1 year ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

You have to be logged in to leave a comment. Sign In

dagshub-snowflake symbol

Lego Set Price Prediction - Use DagsHub with Snowflake

Introduction

Many machine learning projects begin with querying data from a huge database. For time and space efficiency, it is nowadays common practice that people host databases with huge amount of information, instead of on their local machines, on a cloud-based data warehouse platform like Snowflake. This project explores how people could integrate Dagshub tracking and data versioning capabilities with querying data from a Snowflake database when building a machine learning project on Dagshub.

Prerequisites

To run the code in this project, you need to install the following packages:

  • mlflow
  • sklearn
  • requests
  • pillow
  • snowflake python connector: the installation guide can be found here.

Building a Snowflake Database

In this project I use a public dataset with schemas obtained from Kaggle, which collects different features and price of Lego sets in the market. Then the data files are uploaded to an AWS S3 bucket.

To host a database on Snowflake from an AWS S3 bucket, you first need to configure your Snowflake account with AWS S3, and then load data into Snowflake. A detailed documentation for configuration and data loading is available here for you to follow along.

After loading the data into Snowflake, you can create a database and start managing your data using SQL statements. For this project, I create a table for each csv file and build the schemas based on what is provided in the Kaggle dataset.

How to Version Query data from Snowflake database?

The goal of this machine learning project is to predict the price of a published Lego set using its numerous features provided in the database.

The first step is to query some relative information from the Snowflake database. This involves writing SQL queries and saving the query output as csv files. With Dagshub tracking features, we could track the sql queries as texts by Git and track the query output as csv files with DVC or DDA. This way, it becomes easier to revert to a previous version of an SQL queries and output tables, a great benifit for reproduacbility

Train model on queried dataframes

In the model training process, I collect the category, theme, publishing year, number of parts, and average set color in RGB format of each Lego set, and model these features against the price. Using a simple Lasso regression model, we achieve a R2 score of 0.488 on an independent test set. This is merely a baseline model for the task. Since the primary objective of this project is to illustrate the process of integrating the process of accessing data from Snowflake database with tracking features on Dagshub, we will not move forward for a better-performing model.

Conclusion

This projects shows a standard workflow of using a cloud database hosted on Snowflake to build a ML model with DVC and MLflow tracking SQL queries and the tabular data queried from the Snowflake database.

Tip!

Press p or to see the previous file or, n or to see the next file

About

No description

Collaborators 1

Comments

Loading...