
Probabilistic Model Incorporating Auxiliary Covariates to
Control FDR
Lin Qiu
lin.qiu.stats@gmail.com
The Pennsylvania State University
State College, PA, USA
Nils Murrugarra-Llerena
nmurrugarrallerena@weber.edu
Weber State University
Ogden, UT, USA
Vítor Silva
vitor.silva.sousa@gmail.com
Snap Inc.
Santa Monica, CA, USA
Lin Lin
l.lin@duke.edu
Duke University
Durham, NC, USA
Vernon M. Chinchilli
vchinchi@psu.edu
The Pennsylvania State University
Hershey, PA, USA
ABSTRACT
Controlling False Discovery Rate (FDR) while leveraging the side
information of multiple hypothesis testing is an emerging research
topic in modern data science. Existing methods rely on the test-
level covariates while ignoring metrics about test-level covariates.
This strategy may not be optimal for complex large-scale problems,
where indirect relations often exist among test-level covariates and
auxiliary metrics or covariates. We incorporate auxiliary covari-
ates among test-level covariates in a deep Black-Box framework
(
named as NeurT-FDR
) which boosts statistical power and controls
FDR for multiple hypothesis testing. Our method parametrizes the
test-level covariates as a neural network and adjusts the auxiliary
covariates through a regression framework, which enables exible
handling of high-dimensional features as well as ecient end-to-
end optimization. We show that
NeurT-FDR
makes substantially
more discoveries in three real datasets compared to competitive
baselines.
CCS CONCEPTS
•Mathematics of computing →Probabilistic algorithms.
KEYWORDS
Social Media Content Understanding, Multiple Hypothesis Testing,
FDR Control
ACM Reference Format:
Lin Qiu, Nils Murrugarra-Llerena, Vítor Silva, Lin Lin, and Vernon M. Chin-
chilli. 2022. Probabilistic Model Incorporating Auxiliary Covariates to Con-
trol FDR. In Proceedings of the 31st ACM International Conference on In-
formation and Knowledge Management (CIKM ’22), October 17–21, 2022,
Atlanta, GA, USA. ACM, New York, NY, USA, 5 pages. https://doi.org/10.
1145/3511808.3557672
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
CIKM ’22, October 17–21, 2022, Atlanta, GA, USA
©2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9236-5/22/10. . . $15.00
https://doi.org/10.1145/3511808.3557672
1 INTRODUCTION
In modern statistics, from genetics, neuroimaging, to online ad-
vertising, researchers routinely test thousands or millions of hy-
potheses at a time [
11
] to discover unique data instances. Current
approaches [
3
] solve this problem via Multiple Hypothesis Testing
(MHT). MHT aims to maximize the number of discoveries while
controlling the False Discovery Rate (FDR). For example, in social
media, we may want to identify popular social media posts than
normal ones. Also, in biology, we may want to discover which
cancer cells respond positively to the treatment under a new drug.
Existing MHT approaches [
8
,
9
,
11
] only use covariate-adaptive
FDR procedures on top of test-level covariates to improve the detec-
tion power while maintaining the target FDR. Test-level covariates
only provide characteristics of the samples in the dataset, which can
be metadata of social media posts, or genomic proles for each cell.
However, depending on the domain, we can access complementary
information besides test-level covariates that can facilitate the work
of MHT approaches. For example, as shown in Figure 1, in the social
media domain, the goal is to nd engaging content, and the post
can be represented by visual tags and metadata information. Addi-
tionally, content consumption metrics, such as the number of views
and content view time, are available. These metrics encapsulate
information that facilitates MHT work. This additional information
is called auxiliary covariates and corresponds to the samples in
the dataset. More specically, content consumption metrics do not
correspond to characteristics of the sample, i.e., posted content,
but how users interact in the platform to access this content. Typi-
cally, such auxiliary covariates are of lower dimension than those
test-level covariates (e.g., visual tags), and are more structured.
In this paper, we present a hierarchical probabilistic black-box
method which incorporates test and auxiliary covariates to con-
trol the FDR, named NeurT-FDR. Our main contributions can be
summarized as follows:
•
We pioneer the use of both auxiliary and the test-level co-
variates for multiple hypothesis testing problems.
•
We developed a novel MHT model that jointly learns test-
level and auxiliary covariates through a neural network,
which enables ecient optimization and gracefully handles
high-dimensional hypothesis covariates.
arXiv:2210.03178v1 [stat.ML] 6 Oct 2022