
Leveraging Instance Features for Label Aggregation in Programmatic Weak Supervision
tion explicitly depending on instance features. In par-
ticular, a predictive Gaussian process (GP) is adopted
to learn the distribution of mixture coefficients, con-
necting the correlation patterns with instance features.
However, the categorical distribution of mixture coef-
ficients is non-conjugate to the Gaussian prior, hinder-
ing the usage of efficient Bayesian inference algorithm,
e.g., variational inference. To overcome this, we in-
troduce a number of auxiliary variables to augment
the likelihood function to achieve the desired conju-
gate representation of our model. Note that there are
a couple of recently proposed neural network-based
models (Ren et al.,2020;R¨uhling Cachay et al.,2021)
that also leverage instance features, but via neural net-
work. We include them as baselines for comparison
and highlight that these neural network-based models
typically require gold validation set for hyperparame-
ter tuning and early stopping to be performant with
comparison to a statistical model like FABLE.
We conduct extensive experiments on synthetic
dataset with varying size and 11 benchmark datasets.
Compared with state-of-the-art baselines, FABLE
achieves the highest averaged performance and rank-
ing. More importantly, to help understand when FA-
BLE works well and verify our arguments, we mea-
sure the correlation of instance features and the LF
correctness, i.e., Corr(X, LFs). Then, we calculate
the Pearson’s correlation coefficient between Corr(X,
LFs) and the gain of FABLE over EBCC on synthetic
dataset, which is 0.496 with p-value <0.01, indicat-
ing that leveraging instance feature is more beneficial
when the LF correctness indeed depends on the fea-
tures.
2 RELATED WORKS
In PWS, researches have developed a bunch of statisti-
cal label model. Ratner et al. (2016) models the joint
distribution between LF and ground truth labels to
describe the distribution in terms of pre-defined factor
functions. Ratner et al. (2019) models the distribu-
tion via a Markov network and recover the parame-
ters via a matrix completion-style approach, while Fu
et al. (2020) models the distribution via a binary Ising
model and recover the parameters by triplet meth-
ods. There are other statistical models designed for ex-
tended PWS setting (Shin et al.,2021) or for extended
definition of LFs, e.g., partial labeling functions (Yu
et al.,2022), indirect labeling functions (Zhang et al.,
2021a), and positive-only labeling functions (Zhang
et al.,2022b). Besides the statistical label models, re-
searchers have recently proposed neural network-based
models to leverage instance features (Ren et al.,2020;
R¨uhling Cachay et al.,2021), while in this work, we
aim to incorporate instance features into a pure sta-
tistical model.
Prior to PWS, statistical models for label aggregation
were separately developed in the field of crowdsourc-
ing. Dawid and Skene (1979) used a confusion matrix
parameter to generative model LF labels conditioned
on the item’s true annotation, for clinical diagnostics.
Kim and Ghahramani (2012) formulated a Bayesian
generalization with Dirichlet priors and inference by
Gibbs sampling, while Li et al. (2019) incorporate
the subtypes as mixture correlation and decoupled the
confusion matrix, which make the Bayesian generation
process become a mixture model. Their analysis of in-
ferred worker confusion matrix clustering is a natural
precursor to modelling worker correlation.
3 PRELIMINARIES
In this section, we first introduce the setup and no-
tation of the programmatic weak supervision (PWS),
then discuss two representative Bayesian models that
can be used in PWS. We also discuss the multi-class
Gaussian process classification, which is related to our
proposed method.
3.1 Notation
Let X={~x1, ..., ~xN}denote a training set with Nfea-
tured data samples. Assume that there are Llabeling
functions (LFs) ~yi= [yi1, ..., yiL] with j∈[L], each of
which classifies Neach sample into one of Kcategories
or abstain (outputting −1). Let zibe the latent true
label of the sample i,yij the label that LF jassigns
to the item i,Yithe set of LFs who have labelled the
item i.
3.2 Bayesian Classifier Combination (BCC)
Models
Independent BCC. The iBCC (Kim and Ghahra-
mani,2012) model is a directed graphical model and
a popular extension to David-Skene (DS) (Dawid and
Skene,1979) model by making a conditional indepen-
dence assumption between LFs. The iBCC model as-
sumes that given the true label ziof xi, LF labels to
xiare generated independently by different LFs,
p(yi1, ..., yiL |zi) =
L
Y
j=1
p(yij |zi).(1)
This was referred as the LF’s conditional independence
assumption. However, the underlying independence
assumptions prevent the model from capturing corre-
lations between labels from different LF.