Fraud Detection in Electricity and Gas Consumption
The Tunisian Company of Electricity and Gas (STEG) is a public and non-administrative company responsible for delivering electricity and gas across Tunisia. The company has suffered tremendous losses amounting to 200 million Tunisian Dinars due to fraudulent manipulations of meters by consumers. Using the client's billing history, the aim of this challenge is to detect and recognize clients involved in fraudulent activities. The solution will enhance the company's revenues and reduce the losses caused by such fraudulent activities.
The challenge is to build a model that can accurately identify fraudulent behavior based on clients' billing history and other relevant features. This task is crucial for STEG to mitigate financial losses and ensure fair service provision to all customers.
Data preparation
The dataset consists of billing history and various other features such as consumption levels, billing history, meter statuses, customer tenure, customer residence, agent remarks, etc. For more details about the challenge and dataset, refer to the original challenge description.
Feature Engineering
Feature engineering played a crucial role in improving the model's ability to detect fraudulent activities. We created an extensive set of features from the raw data to capture various aspects of customer behavior and consumption patterns. Key steps in our feature engineering process included:
1. Data Preprocessing:
Renamed columns for consistency and clarity
Converted appropriate columns to categorical data types
Handled date/time conversions
2. Consumption-based Features:
Calculated total consumption per billing cycle
Created aggregated features (sum, min, max, mean, std, range) for each consumption level (1-4)
Computed overall energy consumption statistics
3. Time-based Features:
Derived features from invoice dates (e.g., month, year)
Calculated time deltas between invoices
Computed contract duration ('coop_time')
4. Meter and Billing Features:
Aggregated counter status, agent remarks, counter coefficient, and counter code
Created transaction count feature
5. Normalized Features:
Calculated invoices per cooperation period
6. Categorical Encoding:
Applied appropriate encoding techniques for categorical variables
This comprehensive feature set aimed to capture various patterns and anomalies that might indicate fraudulent behavior. The full feature engineering code is available in our GitHub repository for those interested in the implementation details.
Model Building and Training
We implemented two models for comparison: a. Logistic Regression (as a baseline) b. XGBoost Classifier (as the primary model). We finally opted for XGBoost classifier due to its robustness and ability to handle imbalanced datasets, robustness to outliers and feature importance ranking capabilities. The model was trained using a combination of original and engineered features. The training process included hyperparameter tuning through grid search to optimize the model's performance. For more details on model training and tuning, visit the project’s GitHub repo.
Model Evaluation
Confusion matrix
We evaluated our model using a confusion matrix and AUC-roc curve. The resulting confusion matrix is visualised below:
Interpretation:
The model successfully detected 80% of fraudulent cases (high recall).
However, it also misclassified 34% of non-fraudulent activities as fraudulent (high false positive rate).
This trade-off suggests that while the model is effective at catching fraud, it may lead to some unnecessary investigations of legitimate customers.
Feature importance
The top features for fraud detection were:
Total consumption levels
Number of invoices
Meter status changes
Specific agent remarks
Conclusion
Our XGBoost model demonstrates promising results in detecting fraudulent activities in electricity and gas consumption. Key findings include:
Feature engineering was crucial for improving model performance.
The model achieves a high recall rate but struggles with false positives.
Boosting methods (XGBoost, LGBMBoost, AdaBoost) outperformed traditional approaches like Random Forest for this problem.
To further improve the model, we suggest:
Collecting more granular data on consumption patterns.
Incorporating external data sources (e.g., weather, holidays) to account for legitimate consumption spikes.
Experimenting with anomaly detection techniques to reduce false positives.
Implementing a two-stage model: an initial screening followed by a more detailed investigation.
By refining this approach, utility companies can significantly enhance their fraud detection capabilities, leading to substantial cost savings and improved service delivery.
We hope this post provides valuable insights into the process of building and evaluating a fraud detection model using machine learning. Stay tuned for more data science and analytics topics!
References
For the complete code and additional resources, please visit our GitHub repository.