The U.S. Department of Justice (DOJ) estimates healthcare fraud drains about $100 billion annually, approximately 10% of U.S. healthcare spending.
A data enthusiast worked on analyzing historical claim data to predict potentially fraudulent healthcare providers, aiming to reduce fraud costs by billions of dollars.
The analysis was based on Kaggle's Medicare fraud dataset using tools like Python, Pandas, NumPy, Matplotlib, Seaborn, and Scikit-learn.
Fraudulent providers, while only 9.35% of the list, accounted for over half of the total reimbursements in the dataset.
Key features for identifying fraud included total treatment time billed by a provider and the number of claims submitted.
Modeling efforts using Logistic Regression revealed the need for addressing imbalanced data, leading to the adoption of the SMOTE technique for better results.
SMOTE combined with Logistic Regression improved recall to 0.86 and ROC score to 0.9611, emphasizing the importance of catching fraudulent providers even with some false alarms.
Exploration of more models like Random Forest, LightGBM, and real-time fraud detection scenarios are suggested for future work in healthcare fraud detection.
The project highlighted the significance of feature engineering, model tuning, and addressing imbalanced datasets in developing effective fraud detection models.
Looking beyond basic metrics like accuracy is crucial when dealing with real-world, imbalanced datasets to build more meaningful solutions in fraud detection.
The project showcased the importance of critical thinking in problem-solving and the continuous exploration of advanced techniques for fraud detection in the healthcare sector.