Credit Card Default Prediction with Machine Learning: A Benchmarking Study on Imbalance Handling and Model Interpretability

Authors

  • Shuangyi Liu Author

Keywords:

Credit Card Default Prediction, Logistic regression (LR), random forest (RF), eXtreme Gradient Boosting (XGBoost)

Abstract

Credit card default risk prediction is crucial for financial institutions to mitigate potential losses and ensure regulatory compliance. This paper addresses the challenge of imbalanced data and model interpretability in predicting default using the University of California, Irvine (UCI) Credit Card dataset. Experiments were conducted on a dataset of 30,000 clients, and feature engineering was applied to create a 33-dimensional space through logarithmic transformations and ratio feature construction. Logistic regression (LR), random forest (RF), and eXtreme Gradient Boosting (XGBoost) models were trained using stratified five-fold cross-validation and the Synthetic Minority Over-sampling Technique (SMOTE) to address class imbalance, and were evaluated with accuracy, recall, and area under the receiver operating characteristic curve (ROC-AUC). On a held-out 20 % test set, LR achieved an accuracy of 0.7896, a recall of 0.2266, and an ROC-AUC of 0.7255; RF achieved an accuracy of 0.7861, a recall of 0.5156, and an ROC-AUC of 0.7568; XGBoost achieved an accuracy of 0.8174, a recall of 0.5078, and an ROC-AUC of 0.7611. Shapley additive explanations (SHAP) analysis identified recent payment status, first-month bill amount, and payment-to-bill ratio as key predictors, thereby enhancing interpretability and supporting transparent model evaluation for financial risk management.

Downloads

Published

2025-10-24

Issue

Section

Articles