Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Integration:  dvc git mlflow github
1 month ago
2188952250
first code added.
11 months ago
1 month ago
app
1 month ago
861435d1d0
docker setup successful
5 months ago
2188952250
first code added.
11 months ago
1 month ago
c49f4dcb30
random forest code added.
11 months ago
src
1 month ago
2188952250
first code added.
11 months ago
1 month ago
2188952250
first code added.
11 months ago
861435d1d0
docker setup successful
5 months ago
2188952250
first code added.
11 months ago
2188952250
first code added.
11 months ago
861435d1d0
docker setup successful
5 months ago
1 month ago
2188952250
first code added.
11 months ago
1 month ago
1 month ago
861435d1d0
docker setup successful
5 months ago
861435d1d0
docker setup successful
5 months ago
861435d1d0
docker setup successful
5 months ago
1 month ago
861435d1d0
docker setup successful
5 months ago
861435d1d0
docker setup successful
5 months ago
861435d1d0
docker setup successful
5 months ago
c49f4dcb30
random forest code added.
11 months ago
861435d1d0
docker setup successful
5 months ago
861435d1d0
docker setup successful
5 months ago
Storage Buckets
Data Pipeline
Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

You have to be logged in to leave a comment. Sign In

Heart Disease

🩺 Predicting Heart Disease: An MLOps Approach

A hands-on project leveraging MLOps best practices to build, deploy, and monitor a heart disease prediction model.

🧐 Problem Description

Heart disease remains one of the leading causes of death worldwide, with key risk factors including high blood pressure, high cholesterol, obesity, smoking, and lack of physical activity. Early detection and prediction of heart disease can play a crucial role in preventive healthcare and patient outcomes.

πŸ”Ή Project Motivation & Objectives

This project leverages machine learning techniques to predict the likelihood of heart disease based on self-reported health indicators collected via telephonic surveys. The dataset, sourced from the Centers for Disease Control and Prevention (CDC), is part of the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual health-related surveys across the United States.

πŸ“Œ Project Scope

The goal of this project is to develop an end-to-end MLOps pipeline that automates the training, deployment, and monitoring of a machine learning model capable of predicting heart disease risk. This includes:

  • Data Preprocessing: Handling missing values, encoding categorical variables, and feature engineering.
  • Model Training & Evaluation: Applying classification algorithms such as logistic regression, random forests, and gradient boosting.
  • Deployment & MLOps Integration: Serving the trained model via a REST API using Flask and Docker, while leveraging CI/CD pipelines and model monitoring to ensure performance and reliability.
  • Scalability & Reproducibility: Utilizing cloud-based storage and MLflow for model tracking and versioning.

πŸ“Š Dataset Information

The dataset consists of 40 features derived from nearly 300 original variables, carefully curated to represent key indicators of heart disease. It includes factors such as BMI, smoking status, alcohol consumption, diabetes history, and physical activity levels. Given its real-world nature, the dataset presents challenges such as class imbalance, requiring thoughtful model selection and evaluation strategies.

πŸš€ Why This Matters

With the increasing availability of health-related data, applying MLOps principles to healthcare prediction models ensures not only the accuracy and reliability of predictions but also the ability to seamlessly deploy, maintain, and monitor models in production. This project demonstrates the power of machine learning in solving real-world healthcare challenges while emphasizing best practices in MLOps.

This repository serves as an educational portfolio project and was developed as part of the MLOps Zoomcamp by DataTalks.Club.


πŸ† Modeling

The model's performance is evaluated using the F1 score, which is particularly effective for handling imbalanced datasets. Given that heart disease is relatively rare in the dataset, the F1 score ensures a balanced assessment between precision and recall, helping to optimize the model for real-world use cases.

πŸ” Overview

This project integrates multiple tools and frameworks to streamline the machine learning lifecycle. The image below highlights the core technologies used:

Tools

πŸ“Š Exploratory Data Analysis (EDA)

Before model development, a comprehensive Exploratory Data Analysis (EDA) is conducted to gain insights into the dataset’s structure, distributions, and feature correlations. EDA helps identify missing values, outliers, and potential data transformations to improve model performance.

The notebooks/ directory contains a dedicated EDA notebook where the dataset is thoroughly analyzed. This step lays the foundation for feature selection and engineering.

πŸ§ͺ Experiment Tracking & Model Registry

To ensure reproducibility and effective model management, this project utilizes MLflow for experiment tracking and model registry. MLflow facilitates:

  • Logging and comparing multiple model runs.
  • Tracking hyperparameters and performance metrics.
  • Registering and versioning models for deployment.

πŸ“Œ Key Resources:

DagsHub integrates DVC, MLflow, and Git, providing a unified environment for managing experiments, model artifacts, and version control.

πŸ”„ Workflow Orchestration

This project employs Data Version Control (DVC) to orchestrate the workflow. DVC provides:

  • Version control for datasets, models, and intermediate files.
  • Seamless integration with Git, ensuring that data and code are versioned together.
  • Reproducibility by enabling easy rollback to previous states.

πŸ“Œ Key Resource:

βš™οΈ Model Deployment

The trained model is deployed as a REST API using FastAPI, making it accessible for real-time predictions. The deployment pipeline includes:

  • Containerization with Docker to ensure portability.
  • Scalability for serving multiple inference requests.
  • Endpoint exposure via app/main.py for seamless integration.

The containerized image is publicly available and can be pulled from Docker Hub:

πŸ“ˆ Model Monitoring

To track model performance over time, the project integrates Evidently, PostgreSQL, and Grafana for interactive monitoring and analytics.

πŸ“Œ Monitoring Components:

  1. Evidently – Provides insights into model drift, data drift, and feature importance.
  2. PostgreSQL – Stores model predictions and performance metrics for historical analysis.
  3. Grafana – Visualizes key metrics and trends, helping to detect performance degradation.

πŸ“Š Simulation & Monitoring Process:

  • The model is simulated on multiple data batches (500 samples per batch).
  • A daily batch processing scenario is simulated, where new data is processed each day.
  • Metrics are stored in PostgreSQL and visualized in Grafana dashboards.

Below is an example of the real-time monitoring dashboard:

Dashboard

πŸ–₯️ Reproducibility

To ensure reproducibility, the entire pipeline is defined in dvc.yaml. Running the pipeline will automatically execute necessary steps, ensuring consistency in data processing, model training, and evaluation.

πŸš€ Running the Pipeline

To execute the full pipeline, run:

make dvc

This command checks which stages have already been completed and only runs the remaining ones. Upon completion, the trained model and metrics will be stored in the MLflow server.

Once the model is trained, it can be downloaded with:

make save_model

πŸ› οΈ Step-by-Step Workflow

Follow these steps to set up and execute the project:

1️⃣ Installation


Clone the repository:

git clone https://github.com/Danselem/heart_disease_mlflow.git

Navigate into the project directory:

cd heart_disease_mlflow

2️⃣ Set Up the Environment


Install uv according to your platform, then install dependencies:

make install

Set up environment variables:

make env

Then, update .env with the required credentials and DagsHub repository details.

3️⃣ Load and Prepare Data


Split the dataset:

make spdata

Clean the data:

make cleandata

4️⃣ Train and Optimize the Model


Train the model:

make model

Modify params.yaml to experiment with different hyperparameters, then retrain the model using the command above.

Once satisfied with the performance, fetch the best model:

make save_model

This downloads the best-performing model as model.pkl for deployment.

5️⃣ Model Serving


Generate a sample input JSON:

make sample

Run the model locally:

make serve_local

6️⃣ Deploy with Docker


To containerize the model, build a Docker image:

make build

Run the Docker container:

make run

Once the container is running, generate predictions by executing:

make serve

This ensures the model is deployed and can be accessed via API for real-world inference.

πŸͺ– Best Practices

This project follows best practices to ensure code quality, maintainability, and smooth deployment.

βœ… Continuous Integration & Code Quality Checks

Every commit triggers a CI/CD pipeline that performs static code analysis using flake8. If any errors are detected, the pipeline fails, ensuring that only high-quality code is merged.

To enforce code quality checks locally, pre-commit hooks are configured in .pre-commit-config.yaml. These hooks can be installed and executed before committing changes, avoiding delays caused by waiting for CI/CD validation.

πŸ› οΈ Setting Up Pre-Commit Locally

Install pre-commit hooks:

pre-commit install

Run pre-commit checks on all files:

pre-commit run --all-files

πŸ“¦ Makefile Automation

A Makefile is provided to streamline development tasks, including:

  • Running tests
  • Checking code quality
  • Building and pushing Docker images

To execute all necessary checks and publish the Docker image, simply run:

make publish

πŸ“ License

This project is licensed under the MIT License. See the LICENSE file for full details.

Tip!

Press p or to see the previous file or, n or to see the next file

About

A repository for heart disease ML Experiment

Collaborators 1

Comments

Loading...