based on joint work with: G. Blanchard, E. Roquain
A. Blain, N. Enjalbert Courrech, B. Thirion
CNRS & Institut de Mathématiques de Toulouse
PCI Statistics and Machine Learning
Computo: a journal in statistics and Machine Learning promoting reproducibility
Find voxels whose average activity differs between two groups of samples
Find genes whose average activity differs between two groups of samples
Strategy: one test for each feature (gene/voxel) + choose significance threshold
State of the art: False Discovery Rate control (Benjamini and Hochberg (1995))
\(\mathrm{FDR} = \mathbb{E}(\mathrm{FDP}),\) where FDP = proportion of false discoveries (random)
FDR control is not FDP control
FDR: not robust to post hoc selection
Definition: post hoc bound
For a given \(\alpha\) in \((0,1)\), find \(V_\alpha\) such that
\[\mathbb{P} \left( \textcolor{red}{\forall S \subset \mathcal{H}},\:\:\: |S \cap \mathcal{H}_0| \leq V_\alpha(S) \right) \geq 1-\alpha\]
Important example: Simes bound2
\[V_\alpha(S) = \min_{1\leq k \leq |S|} (k-1) + \sum_{i \in S} \mathbb{1}\{p_i \geq \alpha k /m\}\]
FDR control \(\approx\) post hoc inference with \(\alpha=1/2\)
True Discovery Proportion (TDP): TDP = 1-FDP
\[\mathbf{\rm TDP} \geq 0.4\]
\[\mathbf{\rm TDP} \geq 0.6\]
Blanchard, Neuvial, and Roquain (2020)
\[\textrm{Goal: } V_\alpha \quad s.t. \quad\quad \mathbb{P} \left( \textcolor{red}{\forall S \subset \mathcal{H}},\:\:\: |S \cap \mathcal{H}_0| \leq V_\alpha(S) \right) \geq 1-\alpha\]
Joint Error Rate (JER)
Let \(\mathbf{t} = (t_k)_{k}\) be a non-decreasing sequence and \(R_k = \{i \in \mathcal{H}, p_i \leq t_k\}\)
\(\qquad\qquad\qquad\qquad JER(\mathbf{t}) := \mathbb{P} \big(\exists k \in\{1,\dots,p_0\} \::\: p_{(k:\mathcal{H}_0)} < t_k \big)\)
JER control yields valid post hoc bounds
\(JER(\mathbf{t}) \leq \alpha \quad\quad \Leftrightarrow \quad\quad \mathbb{P} \left( \textcolor{red}{\forall k},\:\:\: |R_k \cap \mathcal{H}_0| \leq k-1 \right) \geq 1-\alpha\)
yields \((1-\alpha)\)-level post hoc bound: \(V_\alpha(S) = \min_{1\leq k \leq |S|} (k-1) + \sum_{i \in S} \mathbb{1}\{p_i \geq t_k\}\)
Recovers Simes post hoc bound for \(t_k = \alpha k / m\)
Find \(\mathbf{t} = (t_k)_{k}\) such that \(\mathbb{P}\left( \exists k, p_{(k:\mathcal{H}_0)} \leq t_k \right) \leq \alpha\)
Blain et al. (2023)
Goal: select a subset of variables significantly associated with \(Y\).
how to quantify “association” between \(X_j\) and \(Y\)?
Standard approach to conditional association: Knockoffs (Candès et al. 2018)
Knockoffs provide non asymptotic FDR control
For irrelevant variables, the sign of the \(W_j\)’s are independent coin flips, conditional on \(|W|\)
FDR control for KO in terms of “\(\pi\)-statistics” (Nguyen et al. 2020)
\(\pi_{j}= \frac{1+Z_j}{p} \mathbb{1}_{W_{j} > 0} + \mathbb{1}_{W_{j} \leq 0}\qquad \qquad\) where \(Z_j = \left\vert\left\{k: W_{k} \leq-W_{j}\right\}\right\vert\).
Associated JER: \(\quad \quad JER(\mathbf{t}) = \mathbb{P} \big(\exists k \in\{1,\dots,p_0\} \::\: \pi_{(k:\mathcal{H}_0)} < t_k \big)\)
Key idea (already in Candès et al. (2018))
\(\pi^0_{j}= \frac{1+Z_j}{p} \mathbb{1}_{\chi^0_{j} = 1} + \mathbb{1}_{\chi^0_{j} = -1}\)
Contributions (Blain et al. 2023)
Empirical FDP and power for 42 contrast pairs (only CT aims for posthoc FDP control1
Detections for the SOCIAL contrast (Social motion vs random motion)
only classical KO (FDR \(< 0.2\)) and KOPI (FDP \(< 0.2\) with proba 0.9) yield discoveries
Blain et al. (2025)
Assuming that \(X \sim \mathcal{N}(0, \Sigma)\), Candès et al. (2018) provide a valid construction of knockoffs such that:
\[ [X, \tilde{X}] \sim \mathcal{N}(0, \mathbf{G}), \quad \text { where } \mathbf{G}=\left(\begin{array}{cc} \boldsymbol{\Sigma} & \boldsymbol{\Sigma}-\operatorname{diag}(s) \\ \boldsymbol{\Sigma}-\operatorname{diag}(s) & \boldsymbol{\Sigma} \end{array}\right) \]
This requires the knowledge of \(\Sigma\)
Simulation setup:
No obvious violation of exchangeability for nonparametric knockoffs!
Tool: Classifer Two-Sample Test 1
\[V_\alpha(S) = \min_{1\leq k \leq |S|} (k-1) + \sum_{i \in S} \mathbb{1}\{ p_i \geq t_k \}\]
Given a family of functions \((t_k)_k\) (e.g. \(t_k(\lambda) = \lambda k/m\)), estimate from the data the largest \(\lambda\) such that
\[\mathbb{P}\left( \exists k, p_{(k:\mathcal{H}_0)} \leq t_k(\lambda) \right) \leq \alpha\]
Estimate from the data the “largest” \(\mathbf{t} = (t_k)_{k}\) such that
\[\mathbb{P}\left( \exists k, p_{(k:\mathcal{H}_0)} \leq t_k \right) \leq \alpha\]
There exists a finite group of transformations \(\mathcal{G}\) such that
\[\forall g \in\mathcal{G},\:\: (p_{\mathcal{H}_0}(g'.X))_{g'\in \mathcal{G}} \sim (p_{\mathcal{H}_0}(g'.g.X))_{g'\in \mathcal{G}}\]
Theorem (Blanchard, Neuvial, and Roquain (2020))
Under (Rand), \(R_k := \{i: p_i \leq t_k(\lambda(\alpha))\}\) is a JER controlling family