François HU

About me, my research, my teaching and my experimentations

Follow me on GitHub

About me

I am a senior data scientist (PhD) and lecturer in machine learning and computational statistics at ENSAE, EPITA and Institut des Actuaires. In a nutshell…

Since 2022: Postdoctoral Researcher in the Department of Mathematics and Statistics at Université de Montréal (UdeM) with Arthur Charpentier (UQAM) and Manuel Morales (UdeM). TL;DR: My expertise lies in statistical learning and mathematics, primarily within the insurance domains, but it also extends to biostatistics and finance.

Short description: I am currently engaged in collaborative research within diverse domains, with notable expertise in the following areas:

  1. Statistics & Statistical learning: I specialize in algorithmic fairness, specifically in the realms of statistical (machine/deep) learning and mathematical modeling. My expertise is underscored by the publication of multiple papers in top journals and conferences within this domain.
  2. Epidemiology & Biostatistics: Within the field of biostatistics, my focus revolves around the application of spatiotemporal and NLP techniques. Specifically, I am actively involved in the development and implementation of an ‘Early Warning System for Infectious Diseases’. This project is part of the Mathematics for Public Health (Mfph) initiative.
  3. Finance: My active participation in collaborative research extends to the finance domain, where our efforts are concentrated on algorithmic fairness and NLP techniques. The objective is to identify Environmental, Social, and Governance (ESG) concepts within Canadian companies. This joint initiative, conducted in partnership with Algora Lab (affiliated with UdeM and Mila), strives to advance sustainable finance and instill ethical considerations within the realms of AI and ML.

2019 - 2022: Award for the best thesis in actuarial science in France.

3-years PhD thesis in Machine Learning and Insurance at Institut Polytechnique de Paris (CREST-ENSAE) titled Semi-supervised learning in insurance: Fairness and Labeling under the supervision of Caroline HILLAIRET and Romuald ELIE.

Short description: Insurance and financial institutions amass substantial volumes of unstructured data on a daily basis. However, effectively managing this extensive data presents numerous challenges within the field of machine learning and model transparency : (1) the existing manual tagging approach by experts is inefficient for handling large volumes and near-real-time information; and (2) the data may contain biased information, raising ethical concerns and making it unsuitable for use. To address these issues, there is a critical need for the implementation of a precise (in terms of prediction), cost-effective (in terms of labeling), and ethical (in terms of transparency and fairness) learning system within the insurance and finance sector. This thesis is dedicated to resolving these challenges, offering a comprehensive solution to improve data management and ensure compliance.

2018 - 2022: over 4 years of industry experience in Data Science in the Datalab of Société Générale Insurance with Marc JUILLARD.

Short description: I specialize in textual data analysis, emphasizing fairness and transparency in machine learning models. My responsibilities include overseeing various statistical projects, such as online learning, semi-supervised learning, transparency in deep learning models, and computer vision. I played a crucial role in deploying machine learning models for insurance scoring, covering habitation (MRH) and automotive (Auto) domains. Furthermore, I actively contributed to implementing ETL tools and remain involved in constructing a streamlined End-to-End ML orchestration pipeline, utilizing MLOps tools like Git (Github/Gitlab), MLflow, Kedro, CICD / automated tests …


My topics of interest are the following :

  • ML in Insurance & Finance
  • ML Fairness, Interpretability and Privacy
  • NLP related topics, examples: ESG Reporting for Sustainable Finance, news and posts about infectious diseases, …
  • Semi-supervised learning & sampling methods

Recent papers

Recent talks


EPITA (2020 - …)

Master of Science :

Master of Science in Artificial Intelligence Systems :

  • Numerical Algorithms (and optimization for Machine Learning) by François HU
    • Lecture 1 : Calculus refresher [Lecture] [Notebook]
    • Lecture 2 : Unconstrained optimization [Lecture] [Notebook]
    • Lecture 3 : Constrained optimization [Lecture]
    • Lecture 4 : Numerical methods in linear algebra [Lecture]
    • Lecture 5 : Machine learning applications
    • Lecture 6 : [Oral presentations]-> Titanic challenge
    • Practical work : Linear/Logistic regression, PCA (and SVM) [Notebook]
  • Bayesian Machine Learning by François HU
    • Lecture 1 : Bayesian statistics [Lecture]
    • Lecture 2 : Latent Variable Models and EM-algorithm [Lecture]
    • Lecture 3 : Variational Inference and intro to NLP [Lecture]
    • Lecture 4 : Markov Chain Monte Carlo (& Gaussian Process) [Lecture]
    • Lecture 5 : [Oral presentations]-> Topic models, Bayesian optim, Uncertainty and t-SNE
    • Practical work 1 : Naive Bayes Classifier [Notebook]
    • Practical work 2 : GMM, Probabilistic K-means and PCA [Notebook]
    • Practical work 3 : Topic Modeling with LDA [Notebook]
    • Practical work 4 : Sampling posteriors with MCMC [Notebook]
    • Practical work 5 : Bayesian Linear Regression [Notebook]
    • Bonus points : p85-ex1 Lec1 (0.5pt); PW1 (1pt); PW2 (2pt); PW3 (0.5pt); PW4 (1pt); PW5 (1pt)

Institut des Actuaires (2019 - …)

Formation Data Science pour l’Actuaire :

Teaching assistant

Institut polytechnique de Paris (ENSAE, Polytechnique) (2019 - …)

  • 1A - semester 1 (2019 - 2020) : Algorithme et programmation by Xavier Dupré
  • 2A - semester 2 (2019 - 2021) : Simulation et Monte Carlo by Nicolas Chopin
    • TD1 : Loi uniforme et Monte Carlo (ex1) [Corr in R] [Written corr]
    • TD2 : Méthode de rejet et Box-Muller Amélioré (ex2 et ex3) [Corr in R]
    • TD3 : Loi Géométrique, Variables de contrôle et Variables antithétiques (ex4 et ex5) [Corr in R]
    • TD4 : MCMC et Importance Sampling (ex6 et ex7) [Corr in R]
    • TD5 : Méthode Cross-Entropy et Quasi Monte Carlo (ex9 et ex5-falc) [1-Corr-R] et [2-Corr-R]
    • TD6 : Soutenance
  • 2A - semester 2 (2019 - 2020) : Theoretical foundations of Machine Learning by Vianney Perchet
    • Directed work 1 : Plug-in methods and over/under-fitting [Written corr]
    • Practical work 1 : Linear/polynomial regression and k-NN [Corr in Python]
    • Directed work 2 : Selection and penalization of models [Written corr]
    • Practical work 2 : k-NN, Perceptron and Cross-Validation [Corr in Python]
    • Practical work 3 : SVM, Decistion Tree and Random Forest [Corr in Python]
  • 3A - semester 1 (2020 - 2021) : Advanced Machine Learning by Vianney Perchet
    • This course is about ERM, SVM, Boosting, Neural Net and Optimization
    • Directed work : VC-dimension and ERM (correction soon available)
    • Practical work : Python, Linear Regression and SVM [Corr in Python]
    • Practical work : RKHS, optimization and neural networks [Written corr] [Neural Nets in python]
  • 3A - semester 2 (2019 - 2020) : Machine Learning for finance by Romuald Elie
    • Speaker in NLP