Are you sure you want to delete this access key?
Legend |
---|
DVC Managed File |
Git Managed File |
Metric |
Stage File |
External File |
Legend |
---|
DVC Managed File |
Git Managed File |
Metric |
Stage File |
External File |
A hands-on project leveraging MLOps best practices to build, deploy, and monitor a heart disease prediction model.
Heart disease remains one of the leading causes of death worldwide, with key risk factors including high blood pressure, high cholesterol, obesity, smoking, and lack of physical activity. Early detection and prediction of heart disease can play a crucial role in preventive healthcare and patient outcomes.
This project leverages machine learning techniques to predict the likelihood of heart disease based on self-reported health indicators collected via telephonic surveys. The dataset, sourced from the Centers for Disease Control and Prevention (CDC), is part of the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual health-related surveys across the United States.
The goal of this project is to develop an end-to-end MLOps pipeline that automates the training, deployment, and monitoring of a machine learning model capable of predicting heart disease risk. This includes:
The dataset consists of 40 features derived from nearly 300 original variables, carefully curated to represent key indicators of heart disease. It includes factors such as BMI, smoking status, alcohol consumption, diabetes history, and physical activity levels. Given its real-world nature, the dataset presents challenges such as class imbalance, requiring thoughtful model selection and evaluation strategies.
With the increasing availability of health-related data, applying MLOps principles to healthcare prediction models ensures not only the accuracy and reliability of predictions but also the ability to seamlessly deploy, maintain, and monitor models in production. This project demonstrates the power of machine learning in solving real-world healthcare challenges while emphasizing best practices in MLOps.
This repository serves as an educational portfolio project and was developed as part of the MLOps Zoomcamp by DataTalks.Club.
The model's performance is evaluated using the F1 score, which is particularly effective for handling imbalanced datasets. Given that heart disease is relatively rare in the dataset, the F1 score ensures a balanced assessment between precision and recall, helping to optimize the model for real-world use cases.
This project integrates multiple tools and frameworks to streamline the machine learning lifecycle. The image below highlights the core technologies used:
Before model development, a comprehensive Exploratory Data Analysis (EDA) is conducted to gain insights into the datasetβs structure, distributions, and feature correlations. EDA helps identify missing values, outliers, and potential data transformations to improve model performance.
The notebooks/
directory contains a dedicated EDA notebook where the dataset is thoroughly analyzed. This step lays the foundation for feature selection and engineering.
To ensure reproducibility and effective model management, this project utilizes MLflow for experiment tracking and model registry. MLflow facilitates:
DagsHub integrates DVC, MLflow, and Git, providing a unified environment for managing experiments, model artifacts, and version control.
This project employs Data Version Control (DVC) to orchestrate the workflow. DVC provides:
The trained model is deployed as a REST API using FastAPI, making it accessible for real-time predictions. The deployment pipeline includes:
app/main.py
for seamless integration.The containerized image is publicly available and can be pulled from Docker Hub:
To track model performance over time, the project integrates Evidently, PostgreSQL, and Grafana for interactive monitoring and analytics.
Below is an example of the real-time monitoring dashboard:
To ensure reproducibility, the entire pipeline is defined in dvc.yaml
. Running the pipeline will automatically execute necessary steps, ensuring consistency in data processing, model training, and evaluation.
To execute the full pipeline, run:
make dvc
This command checks which stages have already been completed and only runs the remaining ones. Upon completion, the trained model and metrics will be stored in the MLflow server.
Once the model is trained, it can be downloaded with:
make save_model
Follow these steps to set up and execute the project:
Clone the repository:
git clone https://github.com/Danselem/heart_disease_mlflow.git
Navigate into the project directory:
cd heart_disease_mlflow
Install uv according to your platform, then install dependencies:
make install
Set up environment variables:
make env
Then, update .env
with the required credentials and DagsHub repository details.
Split the dataset:
make spdata
Clean the data:
make cleandata
Train the model:
make model
Modify params.yaml
to experiment with different hyperparameters, then retrain the model using the command above.
Once satisfied with the performance, fetch the best model:
make save_model
This downloads the best-performing model as model.pkl
for deployment.
Generate a sample input JSON:
make sample
Run the model locally:
make serve_local
To containerize the model, build a Docker image:
make build
Run the Docker container:
make run
Once the container is running, generate predictions by executing:
make serve
This ensures the model is deployed and can be accessed via API for real-world inference.
This project follows best practices to ensure code quality, maintainability, and smooth deployment.
Every commit triggers a CI/CD pipeline that performs static code analysis using flake8
. If any errors are detected, the pipeline fails, ensuring that only high-quality code is merged.
To enforce code quality checks locally, pre-commit hooks are configured in .pre-commit-config.yaml
. These hooks can be installed and executed before committing changes, avoiding delays caused by waiting for CI/CD validation.
Install pre-commit hooks:
pre-commit install
Run pre-commit checks on all files:
pre-commit run --all-files
A Makefile is provided to streamline development tasks, including:
To execute all necessary checks and publish the Docker image, simply run:
make publish
This project is licensed under the MIT License. See the LICENSE file for full details.
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?