Our mission
conferences managed on Sciencesconf
papers transferred to HAL
papers submitted to Sciencesconf
REGISTRATION is FREE but MANDATORY
Conformal prediction is a statistical technique that provides calibrated prediction sets (i.e. confidence intervals for predictions) in various prediction tasks such as classification and regression. It works by inverting so-called « conformal » p-values, which can be constructed from any ML model, and without relying on distributional assumptions besides that data points are exchangeable. In this talk we consider the problem of using conformal prediction to control the false discovery rate in the link prediction problem. We will start by considering the problem of FDR control in a simple novelty detection task (without any network structure); then, we will see that the proposed procedure can be extended to the link prediction problem. In particular, we will see that theoretical guarantees can still be derived in such a setting, despite the graph structure inducing intricate dependence in the data.
Variable importance measures (VIMs) aim to quantify the contribution of each input covariate to the predictability of a given output. With the growing interest in explainable AI, numerous VIMs have been proposed, many of which are heuristic in nature. This is often justified by the inherent subjectivity of the notion of importance. This raises important questions regarding usage: What makes a good VIM? How can we compare different VIMs?
In this paper, we address these questions by: (1) proposing an axiomatic framework that bridges the gap between variable importance and variable selection. This framework formalizes the intuitive principle that features providing no additional information should not be assigned importance. It helps avoid false positives due to spurious correlations, which can arise with popular methods such as Shapley values; and (2) introducing a general pipeline for constructing VIMs, which clarifies the objective of various VIMs and thus facilitates meaningful comparisons. This approach is natural in statistics, but the literature has diverged from it.
Causal machine learning holds promise for estimating individual treatment effects from complex data. For successful real-world applications of machine learning methods, it is of paramount importance to obtain reliable insights into which variables drive heterogeneity in the response to treatment. We propose PermuCATE, an algorithm based on the Conditional Permutation Importance (CPI) method, for statistically rigorous global variable importance assessment in the estimation of the Conditional Average Treatment Effect (CATE). Theoretical analysis of the finite sample regime and empirical studies show that PermuCATE has lower variance than the Leave-One-Covariate-Out (LOCO) reference method and provides a reliable measure of variable importance. This property increases statistical power, which is crucial for causal inference in the limited-data regime common to biomedical applications. We empirically demonstrate the benefits of PermuCATE in simulated and real-world health datasets, including settings with up to hundreds of correlated variables.
In high-dimensional statistical tasks, the primary applied objective is often to identify a small set of relevant candidates (e.g., regions or genes) among a large number of potential variables. While causal inference offers a principled framework to refine this selection by distinguishing true causal drivers from mere correlations, its practical implementation faces critical challenges. Traditional causal methods rely on strong, often untestable assumptions—such as the absence of unmeasured confounding or the correctness of a specified causal graph—to guarantee that identified variables will reliably induce changes in outcomes when intervened upon. Yet, real-world applications demand well-posed problems with precise, often simplified, knowledge of variable relationships—rarely available in complex biological or clinical systems.
In this presentation, we will explore key research questions arising from practical applications, examining how variables assume distinct causal roles (e.g., exposures, mediators, colliders, confounders, or outcomes) and the associated identifiability challenges at the individual variable level. We will then review emerging directions from causal representation learning—a field at the intersection of high-dimensional causal graphs and causal discovery with latent variables—to address the inherent complexities of observed data structures. Specifically, we will highlight promising results that integrate causal knowledge into neural architectures (e.g., through structured variational autoencoders or invariant causal prediction), adapt traditional statistics approaches to the causal framework, such regularization to enforce sparsity and interpretability, and leverage novel testing approaches (e.g., conditional independence tests in high dimensions) to robustly infer causal relationships.
I will talk about non-parametric classification in the presence of multiple sources of data. Considering various forms risks, in the spirit of distributionally robust optimization/learning, I will describe the form of optimal classifiers. Then, relying on the semi-plug-in approach, I will present non-parametric upper bounds for various choices of the risk. It will be then shown that the bounds are (up to log factors) minimax optimal over standard classes of smoothness. These results demonstrate an intricate dependency between the obtained rates and the choice of the risk.
It is based on joint work with Tom Berret.
One appealing feature of random forests is the possibility to efficiently compute variable importance. Mean Decrease in Impurity (MDI), which sums up the impurity reduction of all nodes split along a given variable, is often used to assess the relevance of a variable to a given predictive task. In this talk, we will show that this measure should not be used to rank variables. By analyzing MDI, we prove that in the very specific setting of independent inputs with no interaction, MDI provides a variance decomposition of the output, where the contribution of each variable is clearly identified. However, for models exhibiting dependence between input variables or interaction, MDI is intrinsically ill-defined. Since these last models are the most common in practice, there are no theoretical guarantees supporting a correct behavior of MDI for detecting important variables.
Conformal prediction methods are statistical tools designed to quantify uncertainty and generate predictive sets with guaranteed coverage probabilities. The two works I will present, introduce refinements to these methods for classification tasks, one specifically tailored for long-tailed classification and, the other one, for scenarios where multiple observations (multi-inputs) of a single instance are available at prediction time. Our approach is particularly motivated by applications in citizen science such as Pl@ntnet, where the distribution of species has this behavior and where multiple images of the same plant or animal are captured by individuals at test time. These works have been done in collaboration with Joseph Salmon (UM), Tiffany Ding (UC Berkeley) and Mohamed Hebiri (Université Gustave Eiffel).