No Description

Annalie Kruseman 194ffa6f03 Rename src folder 1 month ago
.dvc 674ea84e1f Exploratory Analysis 1 month ago
data ce5b1d1bcf Update pipeline. 2 months ago
dist 922975ba09 Package repository 1 month ago
figures 674ea84e1f Exploratory Analysis 1 month ago
model 83b01fdd0f Improve code layout. 1 month ago
predicted 9245860c13 Update files 2 months ago
reports 83b01fdd0f Improve code layout. 1 month ago
star_types 194ffa6f03 Rename src folder 1 month ago
tests b7402823c5 Initialize git. 2 months ago
.DS_Store 194ffa6f03 Rename src folder 1 month ago
.dvcignore f01580a46a Initialize DVC 2 months ago
.gitignore 194ffa6f03 Rename src folder 1 month ago
.python-version b7402823c5 Initialize git. 2 months ago
LICENCE 922975ba09 Package repository 1 month ago
README.md 1748e64de5 Update readme 1 month ago
dvc.lock 83b01fdd0f Improve code layout. 1 month ago
dvc.yaml f65b8cbffd Add makefile. 2 months ago
makefile 9245860c13 Update files 2 months ago
params.yaml 27b7a0ff07 Add reports/ folder 2 months ago
poetry.lock 922975ba09 Package repository 1 month ago
pyproject.toml 922975ba09 Package repository 1 month ago
requirements.txt 9245860c13 Update files 2 months ago

Data Pipeline

Legend
DVC Managed File
Git Managed File
Metric
Stage File
External File

README.md

Star Type Prections

Description

This repository shows an example of how to run experiments with the data version control package DVC. The objective is to predict the type of a star using only a handful of variables. From exploratory analysis we will see that stars follow a certain graph in the celestial space by researching only this handful of variables. A representation of this graph is called the Herzsprung-Russell Diagram, or HR-diagram. Consequently, we can classify stars by plotting its features based on that graph.

Getting Started

This repository is organized as follows:

In the src/ folder you find a file to prepare the data for training the model, a file to train and save the model, and a file to evaluate the results of the model. These results are written to the folder metrics/ which contains both scores and model output as well as a confusion matrix and ROC AUC curve. The src/ folder also contains the full code in the file star_type_predictions.py. This file runs a hypertuning function on four classification models, stores each estimator, outputs the accuracy score of each estimator and writes the predictions of the best performing model to the corresponding file in the predicted/ folder.

The exploratory analysis is also located in the src/ folder.

To run experiments with various classification models change the model parameter in the params.yaml file to the model of your interest (all lower case). The classification models that are tested for this project are: kneighbors, logistic regression, support vector machine, and random forest.

Prerequisites

This project makes use of DVC data version control. The raw data is stored in a personal AWS S3 bucket. To replicate this project first download the raw data from the Kaggle project website and store it in the data/ folder.

Installation

To get started first download this repository and create your virtual environment. When using poetry to create your virtual environment install the dependencies with:

poetry install

Otherwise install the dependencies with:

env/bin/pip install -r requirements.txt

To run experiments initialize the directory as a DVC folder inside a Git project.

dvc init

To replicate an experiment run the below line of code. This code will run the pipeline 'prepare - train - evaluate' as described in dvc.yaml.

dvc repro --no-commit

To predict the type of a star and store the results in a file run the makefile in the root folder. This file runs two stages: the data preparation phase and the run phase which outputs a file with predictions. For this step it is not necessary to have the dependencies already installed, this is included in the makefile.

make

Usage

An example of the Herzsprung-Russell Diagram can be found in the below figure. Where the yellow dot denotes our sun as a reference point. We clearly see that star types follow a sphere and are grouped together in terms of their temperature and absolute magnitude.

H-R diagram

Below figure shows a correlation matrix of the numerical variables used in the dataset. From this matrix we find that the absolute magnitude is higly correlated with the type of the star, where on a scale of 1-6 1 denotes a dwarf star abd 6 denotes a hyper giant star.

Correlation Matrix

Authors and acknowledgment

Annalie Kruseman

While I'm not schooled in astronomy, astrophyscis caught my interest while recovering after a recent surgery. During this project I learned a lot more about the features of stars that are used in this dataset. In the meantime I really enjoyed reading the book 'Reality is not wat it seems' by Carlo Rovelli. With the 5 years of physics and chemistry I have had during high school it was really fun to rehearse this knowledge and understand more about the concepts of general relativity, quantum theory, and quantum gravity.

Dataset can be found at Kaggle.

Feel free to contact me for any questions on annaliakruseman@gmail.com