|Annalie Kruseman 194ffa6f03 Rename src folder||1 month ago|
|.dvc||1 month ago|
|data||2 months ago|
|dist||1 month ago|
|figures||1 month ago|
|model||1 month ago|
|predicted||2 months ago|
|reports||1 month ago|
|star_types||1 month ago|
|tests||2 months ago|
|.DS_Store||1 month ago|
|.dvcignore||2 months ago|
|.gitignore||1 month ago|
|.python-version||2 months ago|
|LICENCE||1 month ago|
|README.md||1 month ago|
|dvc.lock||1 month ago|
|dvc.yaml||2 months ago|
|makefile||2 months ago|
|params.yaml||2 months ago|
|poetry.lock||1 month ago|
|pyproject.toml||1 month ago|
|requirements.txt||2 months ago|
|DVC Managed File|
|Git Managed File|
|DVC Managed File|
|Git Managed File|
This repository shows an example of how to run experiments with the data version control package DVC. The objective is to predict the type of a star using only a handful of variables. From exploratory analysis we will see that stars follow a certain graph in the celestial space by researching only this handful of variables. A representation of this graph is called the Herzsprung-Russell Diagram, or HR-diagram. Consequently, we can classify stars by plotting its features based on that graph.
This repository is organized as follows:
src/ folder you find a file to prepare the data for training the model, a file to train and save the model, and a file to evaluate the results of the model. These results are written to the folder metrics/ which contains both scores and model output as well as a confusion matrix and ROC AUC curve.
src/ folder also contains the full code in the file star_type_predictions.py. This file runs a hypertuning function on four classification models, stores each estimator, outputs the accuracy score of each estimator and writes the predictions of the best performing model to the corresponding file in the predicted/ folder.
The exploratory analysis is also located in the
To run experiments with various classification models change the model parameter in the params.yaml file to the model of your interest (all lower case). The classification models that are tested for this project are: kneighbors, logistic regression, support vector machine, and random forest.
This project makes use of DVC data version control. The raw data is stored in a personal AWS S3 bucket. To replicate this project first download the raw data from the Kaggle project website and store it in the
To get started first download this repository and create your virtual environment. When using poetry to create your virtual environment install the dependencies with:
Otherwise install the dependencies with:
env/bin/pip install -r requirements.txt
To run experiments initialize the directory as a DVC folder inside a Git project.
To replicate an experiment run the below line of code. This code will run the pipeline 'prepare - train - evaluate' as described in dvc.yaml.
dvc repro --no-commit
To predict the type of a star and store the results in a file run the makefile in the root folder. This file runs two stages: the data preparation phase and the run phase which outputs a file with predictions. For this step it is not necessary to have the dependencies already installed, this is included in the makefile.
An example of the Herzsprung-Russell Diagram can be found in the below figure. Where the yellow dot denotes our sun as a reference point. We clearly see that star types follow a sphere and are grouped together in terms of their temperature and absolute magnitude.
Below figure shows a correlation matrix of the numerical variables used in the dataset. From this matrix we find that the absolute magnitude is higly correlated with the type of the star, where on a scale of 1-6 1 denotes a dwarf star abd 6 denotes a hyper giant star.
While I'm not schooled in astronomy, astrophyscis caught my interest while recovering after a recent surgery. During this project I learned a lot more about the features of stars that are used in this dataset. In the meantime I really enjoyed reading the book 'Reality is not wat it seems' by Carlo Rovelli. With the 5 years of physics and chemistry I have had during high school it was really fun to rehearse this knowledge and understand more about the concepts of general relativity, quantum theory, and quantum gravity.
Dataset can be found at Kaggle.
Feel free to contact me for any questions on email@example.com