Gianluca Bontempi

Université Libre de Bruxelles

[intermediate/advanced] Big Data Analytics in Fraud Detection and Churn Prevention: from Prediction to Causal Inference

Summary

Designing machine learning algorithms for real big data raises a number of research challenges which deserve a specific attention not only from an applied perspective. This lecture will focus on two real business analytics cases to illustrate a number of recent research contributions of my group.

Credit-card fraud detection: the design of efficient fraud detection algorithms is key for reducing billions of dollars of yearly losses due to fraudulent credit card transactions. More and more algorithms rely on advanced machine learning techniques to assist fraud investigators. The design of fraud detection algorithms is however particularly challenging due to non stationary distribution of the data, highly imbalanced classes distributions and continuous streams of transactions. At the same time, public data are scarcely available for confidentiality issues, leaving unanswered many questions about which is the best strategy to deal with them. In this talk we will discuss a number of lessons learned during our long-standing collaboration with the R&D team of Worldline. In particular, we will focus on best practices for the assessment of credit card fraud detection models and we will discuss the impact of data unbalancedness and non-stationarity on the resulting accuracy. More recent directions of research, including big data infrastructure, active and transfer learning, will be sketched as well.

Churn detection: this is an important issue for telecommunication companies evolving in a highly competitive market where attracting new customers is much more expensive than retaining existing ones. Retention campaigns can be used to prevent customer churn, but their effectiveness depends on the availability of accurate prediction models. Churn prediction shares a number of issues with fraud detection notably in terms of the large amount of data, non-linearity, imbalance and low separability between the classes of churners and non-churners. However, the design of retention campaigns raises a number of research issues which go beyond predictive aspects and concern causal inference, notably uplift and counterfactuals. The uplift measures the causal effect of some action, or treatment, on the outcome of an individual. Counterfactual reasoning is crucial in retention campaign designs since customers could be stratified according to four counterfactual behaviours: (i) Sure thing: customer not churning regardless of the action. (ii) Persuadable: customer churning only if not contacted. (iii) Do-not-disturb: customer churning only if contacted. (iv) Lost cause: customer churning regardless of the action. The last part of the lecture will present recently published results about the bounds on the probability of counterfactuals and their assessment on a large real-world customer data set provided by Orange Belgium.

Syllabus

Slot 1:

Introduction to fraud detection systems
Introduction to churn detection
Formalisation of the detection tasks in terms of machine learning: unsupervised vs supervised classification
From prediction to causal inference in big data

Slot 2: Research challenges in fraud detection:

The unbalancedness issue
Nonstationarity
Transfer learning
Scalable computing
Reproducibility of results

Slot 3: Research challenges in churn detection:

Uplift modeling
Counterfactuals
Theoretical results

References

Dal Pozzolo, Andrea; Caelen, Olivier; Johnson, Reid A.; Bontempi, Gianluca. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015.

Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective. Expert Systems with Applications, 41, 10, 4915-4928, 2014.

Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy. IEEE Transactions on Neural Networks and Learning Systems, 29, 8, 3784-3797, IEEE, 2018.

Dal Pozzolo, Andrea. Adaptive machine learning for credit card fraud detection. ULB MLG PhD thesis (supervised by G. Bontempi).

Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark. Information Fusion, 41, 182-194, 2018.

Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization. International Journal of Data Science and Analytics, 5, 4, 285-300, 2018.

Lebichot, Bertrand; Le Borgne, Yann-Aël; He, Liyun; Oblé, Frederic; Bontempi, Gianluca. Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection. INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, 78-88, 2019.

Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Oblé, Frederic; Bontempi, Gianluca. Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection. Information Sciences, 2019.

Le Borgne, Yann-A; Bontempi, Gianluca. Reproducible machine learning for credit card fraud detection – practical handbook, https://fraud-detection-handbook.github.io/.

Verhelst, Théo; Caelen, Olivier; Dewitte, Jean-Christophe; Lebichot, Bertrand; Bontempi, Gianluca. Understanding telecom customer churn with machine learning: from prediction to causal inference. Artificial Intelligence and Machine Learning: 31st Benelux AI Conference, BNAIC 2019, and 28th Belgian-Dutch Machine Learning Conference, BENELEARN 2019, Brussels, Belgium, Springer International Publishing.

Verhelst, Théo; Shrestha, Jeevan; Mercier, Denis; Dewitte, Jean-Christophe; Bontempi, Gianluca. Predicting Reach To Find Persuadable Customers: Improving Uplift Models for Churn Prevention. Discovery Science: 24th International Conference, DS 2021, Halifax, NS, Canada, October 11–13, 2021, 44-54, 2021, Springer International Publishing.

Verhelst, Théo; Mercier, Denis; Shrestha, Jeevan; Bontempi, Gianluca. Partial counterfactual identification and uplift modeling: theoretical results and real-world assessment. arXiv preprint arXiv:2211.07264,2022. To appear in Machine Learning Journal.

Gianluca Bontempi. Statistical foundations of machine learning: the book. https://leanpub.com/statisticalfoundationsofmachinelearning

Pre-requisites

Basic knowledge of machine learning and classification.

Short bio

Gianluca Bontempi is Full Professor in the Computer Science Department at the Université Libre de Bruxelles (ULB), Brussels, Belgium, co-head of the ULB Machine Learning Group (mlg.ulb.ac.be). He has been Director of (IB)2, the ULB/VUB Interuniversity Institute of Bioinformatics in Brussels (ibsquare.be) in 2013-17. His main research interests are big data mining, machine learning, bioinformatics, causal inference, predictive modeling and their application to complex tasks in engineering (time series forecasting, fraud detection) and life science (network inference, gene signature extraction). He was Marie Curie fellow researcher, he was awarded in two international data analysis competitions and he took part to many research projects in collaboration with universities and private companies all over Europe. He is author of more than 250 scientific publications and his H-number is 64. He is associate editor of the International Journal of Forecasting and IEEE Senior Member. He was Belgian (French Community) national contact point of the CLAIRE network and co-leader of the CLAIRE COVID19 Task Force. He is also co-author of several open-source software packages for bioinformatics, data mining and prediction.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gat_gtag_UA_74880351_9	1 minute	This cookie is set by Google and is used to distinguish users.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.