Csaba Burger, DPhilData Science Advisor
|
Error spotting with gradient boosting: anomaly detection using supervised learning for data quality at the Central Bank of Hungary (MNB)
Central banks have been collecting data from supervised banks for ages. Despite the best efforts of all participants, collected data may contain errors. In this talk, we introduce a novel, supervised-learning-based anomaly-detection algorithm to identify potential data errors in granular data sets collected by Central Bank of Hungary (MNB).
In a nutshell, we rely on the tenet of a ‘ground truth’ in the data, which in other words assumes correctness in the majority of the cases. Points deviating from such relationships, outliers, are flagged as potential data errors. More specifically, our outlier-based error-spotting algorithm uses extreme gradient boosting, or xgboost, and we illustrate our experience with data from the Credit Registry (Hitelregiszter). The novelty of our approach compared to unsupervised methods such as the isolation forest or a clustering algorithm is the fact that we look for inherent structures within the data, and flag anomalies as a second step.
Our talk provides a blueprint for such undertakings, including pre-processing and feature engineering steps (e.g. loss-function choice) and results’ interpretation. Finally, we not only discuss how various model specification-sets performed in terms of re-identifying a set of synthetic errors, but talk about the experience we received during the pilot project from business and from supervised banks. This entails highlighting the best practices we found to present our results to a non-AI-literate audience.
BIO
Csaba Burger, DPhil, CFA, has been leading machine learning projects at the MNB Directorate Statistics. His current tasks strecth from developing novel methods to measure financial and climate change risks, to data-centric machine learning applications to support further modelling tasks.