Photo by SIMON LEE on Unsplash

Dagshub Glossary

Model Evaluation

Machine learning models are powerful tools for identifying patterns, making predictions, and automating tasks. You must ensure the model performs well in any environment for new data samples. That is where model evaluation comes in.

Model evaluation is the process of assessing a machine learning model’s performance on unseen data. It involves using various metrics to quantify how well the model generalizes to new examples and achieves its intended purpose. In this post, you will learn about the complete model evaluation process of machine learning models. 

Importance of Model Evaluation

Evaluation plays a crucial role in the machine learning lifecycle for several reasons:

  • Ensuring Accuracy: Evaluation helps identify if the model is learning the underlying relationships in the data or simply memorizing patterns.
  • Assessing Performance: Different metrics provide insights into various aspects of model performance, such as its ability to identify true positives and negatives or minimize prediction errors.
  • Enhancing Model Reliability: Model Evaluation recognizes biases, weaknesses, and limitations. By addressing these issues, you can improve the model’s overall reliability and predictability.
  • Supporting Decision-Making: Evaluation results inform decisions about deploying the model in real-world applications. You can determine if the model is accurate and trustworthy enough for practical use.
  • Identifying Model Weaknesses: Evaluation helps pinpoint areas where the model struggles. This knowledge allows us to refine and improve the model, directing further development efforts.

Improve your data
quality for better AI

Easily curate and annotate your vision, audio,
and document data with a single platform

Book A Demo
https://dagshub.com/wp-content/uploads/2024/11/Data_Engine-1.png

Types of Model Evaluation Metrics

The choice of ML model evaluation metrics depends on the type of machine learning task (classification, regression, clustering) and the specific goals of the model. You will now delve deeper into various evaluation metrics with mathematical formulas and examples.

Classification Metrics

Accuracy (ACC)

Accuracy represents the overall proportion of correctly classified instances in the entire dataset.

Example: Imagine a spam filter classifying emails: 

Accuracy = (correctly classified spam + correctly classified non-spam) / total emails.

Precision (P)

Precision measures the proportion of predicted positives amongst positive instances, indicating how well the model avoids false positives.

Example: In the spam filter example: 

Precision = correctly classified spam emails / all emails predicted as spam.

Recall (Sensitivity) (R)

Recall measures the proportion of actual positive instances that get identified correctly, indicating how well the model captures all relevant cases.

Example: In the spam filter example:

Recall = correctly classified spam emails / all actual spam emails.

F1-Score

F1-Score is the harmonic mean of Precision and Recall, addressing the limitations of either metric.

Example: The F1 score considers both the number of correctly classified spam emails and the number of missed spam emails, providing a more comprehensive picture.

F1-Score = ​​2 x [(Precision x Recall) / (Precision + Recall)]

Specificity (Specificity)

Specificity measures the proportion of correctly identified but actual negative instances. This metric is more beneficial for imbalanced datasets.

Example: In the spam filter example: 

Specificity = correctly classified non-spam emails / all actual non-spam emails.

Area Under the ROC Curve (AUC-ROC)

AUC-ROC metric represents the probability that the model ranks a random positive instance higher than a random negative instance. It’s a threshold-independent metric, making it useful for comparing models with different classification thresholds.

Example: Imagine a model predicting customer churn. A higher AUC indicates that the model can effectively differentiate customers likely to churn from those who will remain.

Regression Metrics

Mean Absolute Error (MAE)

MAE represents the average of the absolute differences between the predicted values and the actual values.

Example: For a model predicting house prices, MAE represents the average absolute difference between the estimated price and the actual selling price.

MAE = (1/n) Σ(i=1 to n) |y_i – ŷ_i|

Where:

  • n is the number of observations in the dataset.
  • y_i is the true value.
  • ŷ_i is the predicted value.

Mean Squared Error (MSE)

MSE represents the average of the squared differences between the predicted values and the actual values. It is more sensitive to outliers compared to MAE.

Example: Similar to MAE, significant errors (squared differences) are penalized more heavily, making it sensitive to outliers in the data.

MSE = Σ(yi − pi)2n

Where: 

  • yi is the true value.
  • pi is the corresponding predicted value for yi
  • n is the number of observations.

Root Mean Squared Error (RMSE) 

RMSE is the square root of the MSE, expressed in the same units as the original data.

Example: RMSE is the square root of the average squared difference between predicted and actual prices.

​RMSE = sqrt [(Σ(Pi – Oi)²) / n]

Where: 

  • Pi denotes the predicted value, 
  • Oi represents the observed value
  • n is the total number of observations or data points.​ 

R-squared (Coefficient of Determination)

R-squared represents the proportion of variance in the dependent variable explained by the independent variable. It ranges from 0 (no explanatory power) to 1 (perfect fit).

Example: In a regression model predicting customer spending based on income, R-squared indicates how well the income explains the variations in customer spending.

R2 = 1 − (sum squared regression (SSR) / total sum of squares (SST))

Clustering Metrics

Silhouette Score (S)

Silhouette Score Measures how well data points within a cluster are similar compared to data points in other clusters. It ranges from -1 (worst clustering) to 1 (perfect clustering).

Example: Imagine clustering customer data based on purchase history. A higher average Silhouette Score indicates that customers within each cluster have similar buying patterns compared to customers in other clusters.

Davies-Bouldin Index (DBI)

DBI evaluates the within-cluster scatter compared to the between-cluster separation. Lower values indicate better clustering.

Example: In the customer segmentation example, a lower DBI indicates the clusters are well-separated, and customers within each cluster share similar characteristics.

Adjusted Rand Index (ARI)

ARI measures the similarity between the model’s cluster and reference clustering (ground truth). It ranges from -1 (worst clustering) to 1 (perfect agreement).

Example: Evaluating image segmentation performance, ARI compares the model’s segmentation of an image with a manually labeled ground truth, indicating how well the model captures the intended object boundaries.

Other Metrics

Log Loss

Log loss measures the performance of a classification model based on the probability estimates it assigns to classes. It is often used as a loss function during model training, aiming to minimize the log loss. Lower log loss indicates better model performance.

Example: In a spam classification model, log loss considers the model’s assigned probabilities for an email being spam. Lower log loss indicates that the model is confident in its correct classifications.

Confusion Matrix

A confusion matrix is a table that summarizes the model’s performance on a classification task. It depicts the number of true positives, true negatives, false positives, and false negatives for each class.

Model Evaluation Techniques

Now that you understand the evaluation metrics. Let’s explore the various techniques used to assess a model’s performance on unseen data:

Train-Test Split

It is a simple approach where you divide the data into two sets. The first one is the training set used to build the model and another one is testing data used to evaluate model performance. This approach is really easy to implement. In Python, the scikit-learn library provides the implementation of train test split.  There is only one shortcoming of this approach that is, performance of the model can be sensitive to the specific split chosen.

Cross-Validation

Cross-validation is a more robust approach that involves splitting the data into multiple folds. The model is trained on a combination of folds (e.g., k-1 folds) and evaluated on the remaining fold (validation fold). This process is repeated for each fold, providing a more comprehensive evaluation.

K-Fold Cross-Validation

It splits the data into k equal folds. The model is trained on k-1 folds and evaluated on the remaining fold. This process is repeated k times, ensuring each data point is used for validation once. A common choice for k is 5 or 10.

Stratified k-Fold Cross-Validation

It is handy for imbalanced datasets where classes have unequal representation. It ensures that each fold maintains the same class distribution as the original data, providing a fairer evaluation.

Leave-One-Out Cross-Validation (LOOCV)

It uses each data point as the validation set once, resulting in n folds for n data points. While providing a thorough evaluation, it can be computationally expensive for large datasets.

Cross-validation provides a more robust estimate of model performance than a simple train-test split and also reduces the risk of overfitting the training data. Some of its shortcomings include, that it can be computationally expensive, especially for LOOCV. Also, it requires a precise number of folds (k) for k-fold Cross-Validation.

Bootstrapping

Bootstrapping creates multiple random samples (with replacement) from the original data. A model gets trained on each sample, and the evaluation metrics are averaged across all models. This technique is useful when the dataset is limited. Or if you want to assess the variability of the model’s performance. This method is widely suited for smaller datasets where a dedicated validation set might be too small. Also, It offers information about the variability in the model’s performance.

One shortcoming of bootstrapping is that it can introduce bias due to sampling with replacement. Also, It may not be as reliable as cross-validation techniques for larger datasets.

Holdout Method

The Holdout method is similar to the train-test split. It involves holding out a portion of the data for testing (e.g., 20% or 30%). The most significant difference between the train-test split and holdout is that the train-test split is a more straightforward and specific implementation of the broader holdout method. On the other hand, the holdout method itself can encompass more complex procedures, including the use of a validation set for model tuning. Also, this method is less efficient as compared to cross-validation techniques.

Other Techniques

Time Series Validation

Time series validation evaluates models on time-series data where the order of data points is essential. It involves splitting the data into contiguous folds that preserve the temporal sequence.

Nested Cross-Validation

If you have an inner and outer loop, you can use nested cross-validation for hyperparameter tuning. The inner loop performs cross-validation to select the best hyperparameter combination, while the outer loop evaluates the model’s performance with the chosen hyperparameters on a separate validation set.

Improve your data
quality for better AI

Easily curate and annotate your vision, audio,
and document data with a single platform

Book A Demo
https://dagshub.com/wp-content/uploads/2024/11/Data_Engine-1.png
Back to top
Back to top