Leveraging Instance Features for Label Aggregation in Programmatic Weak Supervision Jieyu Zhang Linxin Song Alexander Ratner

2025-04-29 0 0 1.13MB 16 页 10玖币
侵权投诉
Leveraging Instance Features for Label Aggregation
in Programmatic Weak Supervision
Jieyu Zhang* Linxin Song* Alexander Ratner
University of Washington Waseda University University of Washington
Abstract
Programmatic Weak Supervision (PWS) has
emerged as a widespread paradigm to syn-
thesize training labels efficiently. The core
component of PWS is the label model, which
infers true labels by aggregating the out-
puts of multiple noisy supervision sources
abstracted as labeling functions (LFs). Ex-
isting statistical label models typically rely
only on the outputs of LF, ignoring the in-
stance features when modeling the underly-
ing generative process. In this paper, we at-
tempt to incorporate the instance features
into a statistical label model via the pro-
posed FABLE. In particular, it is built on
a mixture of Bayesian label models, each
corresponding to a global pattern of cor-
relation, and the coefficients of the mix-
ture components are predicted by a Gaussian
Process classifier based on instance features.
We adopt an auxiliary variable-based varia-
tional inference algorithm to tackle the non-
conjugate issue between the Gaussian Pro-
cess and Bayesian label models. Extensive
empirical comparison on eleven benchmark
datasets sees FABLE achieving the highest
averaged performance across nine baselines.
Our implementation of FABLE can be found
in https://github.com/JieyuZ2/wrench/
blob/main/wrench/labelmodel/fable.py.
1 INTRODUCTION
The deployment of machine learning models typically
relies on large-scale labeled data to regularly train
and evaluate the models. To collect labels, practi-
*Equal Contribution.
tioners have increasingly resorted to Programmatic
Weak Supervision (PWS) (Ratner et al.,2016;Zhang
et al.,2022a), a paradigm in which labels are generated
cheaply and efficiently. Specifically, in PWS, users de-
velop weak supervision sources abstracted as simple
programs called labeling functions (LFs), rather than
make individual annotations. These LFs could effi-
ciently produce noisy votes on the true label or ab-
stain from voting based on external knowledge bases,
heuristic rules, etc.. To infer the true labels, various
statistical label models (Ratner et al.,2016,2019;Fu
et al.,2020) are developed to aggregate the labels out-
put by LFs.
One of the major technical challenges in PWS is how
to infer the true labels given the noisy and potentially
conflict labels of multiple LFs. While being diverse in
assumptions and modeling techniques, existing statis-
tic label models typically rely solely on the LFs’ la-
bels (Ratner et al.,2016;Bach et al.,2017;Varma
et al.,2017;Cachay et al.,2021). In this paper, we
argue that incorporating instance features into a sta-
tistical label model has significant potential to improve
the inferred truth. Intuitively, statistical label models
aim to recover the pattern of correlation between the
LF labels and the ground truth; it is natural to as-
sume that similar instants would share a similar pat-
tern and therefore the instance features could be in-
dicative of the pattern of each instant. When ignor-
ing the instance features, statistical label models have
to assume that the patterns or the LF correctness is
instant-independent, which is unlikely to be true for
real-world dataset.
To attack this problem, we propose FABLE (Feature-
Aware laBeL modEl), which exploits the instance fea-
tures to help identify the correlation pattern of in-
stants. We build FABLE upon a recent model named
EBCC (Li et al.,2019), which is a mixture model
where each mixture component is a popular Bayesian
extension to the DS model (Dawid and Skene,1979)
and aims to capture one sort of LF and true label cor-
relation. To incorporate instance features, we propose
to make the mixture coefficients a categorical distribu-
arXiv:2210.02724v2 [cs.LG] 9 Oct 2022
Leveraging Instance Features for Label Aggregation in Programmatic Weak Supervision
tion explicitly depending on instance features. In par-
ticular, a predictive Gaussian process (GP) is adopted
to learn the distribution of mixture coefficients, con-
necting the correlation patterns with instance features.
However, the categorical distribution of mixture coef-
ficients is non-conjugate to the Gaussian prior, hinder-
ing the usage of efficient Bayesian inference algorithm,
e.g., variational inference. To overcome this, we in-
troduce a number of auxiliary variables to augment
the likelihood function to achieve the desired conju-
gate representation of our model. Note that there are
a couple of recently proposed neural network-based
models (Ren et al.,2020;uhling Cachay et al.,2021)
that also leverage instance features, but via neural net-
work. We include them as baselines for comparison
and highlight that these neural network-based models
typically require gold validation set for hyperparame-
ter tuning and early stopping to be performant with
comparison to a statistical model like FABLE.
We conduct extensive experiments on synthetic
dataset with varying size and 11 benchmark datasets.
Compared with state-of-the-art baselines, FABLE
achieves the highest averaged performance and rank-
ing. More importantly, to help understand when FA-
BLE works well and verify our arguments, we mea-
sure the correlation of instance features and the LF
correctness, i.e., Corr(X, LFs). Then, we calculate
the Pearson’s correlation coefficient between Corr(X,
LFs) and the gain of FABLE over EBCC on synthetic
dataset, which is 0.496 with p-value <0.01, indicat-
ing that leveraging instance feature is more beneficial
when the LF correctness indeed depends on the fea-
tures.
2 RELATED WORKS
In PWS, researches have developed a bunch of statisti-
cal label model. Ratner et al. (2016) models the joint
distribution between LF and ground truth labels to
describe the distribution in terms of pre-defined factor
functions. Ratner et al. (2019) models the distribu-
tion via a Markov network and recover the parame-
ters via a matrix completion-style approach, while Fu
et al. (2020) models the distribution via a binary Ising
model and recover the parameters by triplet meth-
ods. There are other statistical models designed for ex-
tended PWS setting (Shin et al.,2021) or for extended
definition of LFs, e.g., partial labeling functions (Yu
et al.,2022), indirect labeling functions (Zhang et al.,
2021a), and positive-only labeling functions (Zhang
et al.,2022b). Besides the statistical label models, re-
searchers have recently proposed neural network-based
models to leverage instance features (Ren et al.,2020;
R¨uhling Cachay et al.,2021), while in this work, we
aim to incorporate instance features into a pure sta-
tistical model.
Prior to PWS, statistical models for label aggregation
were separately developed in the field of crowdsourc-
ing. Dawid and Skene (1979) used a confusion matrix
parameter to generative model LF labels conditioned
on the item’s true annotation, for clinical diagnostics.
Kim and Ghahramani (2012) formulated a Bayesian
generalization with Dirichlet priors and inference by
Gibbs sampling, while Li et al. (2019) incorporate
the subtypes as mixture correlation and decoupled the
confusion matrix, which make the Bayesian generation
process become a mixture model. Their analysis of in-
ferred worker confusion matrix clustering is a natural
precursor to modelling worker correlation.
3 PRELIMINARIES
In this section, we first introduce the setup and no-
tation of the programmatic weak supervision (PWS),
then discuss two representative Bayesian models that
can be used in PWS. We also discuss the multi-class
Gaussian process classification, which is related to our
proposed method.
3.1 Notation
Let X={~x1, ..., ~xN}denote a training set with Nfea-
tured data samples. Assume that there are Llabeling
functions (LFs) ~yi= [yi1, ..., yiL] with j[L], each of
which classifies Neach sample into one of Kcategories
or abstain (outputting 1). Let zibe the latent true
label of the sample i,yij the label that LF jassigns
to the item i,Yithe set of LFs who have labelled the
item i.
3.2 Bayesian Classifier Combination (BCC)
Models
Independent BCC. The iBCC (Kim and Ghahra-
mani,2012) model is a directed graphical model and
a popular extension to David-Skene (DS) (Dawid and
Skene,1979) model by making a conditional indepen-
dence assumption between LFs. The iBCC model as-
sumes that given the true label ziof xi, LF labels to
xiare generated independently by different LFs,
p(yi1, ..., yiL |zi) =
L
Y
j=1
p(yij |zi).(1)
This was referred as the LF’s conditional independence
assumption. However, the underlying independence
assumptions prevent the model from capturing corre-
lations between labels from different LF.
Jieyu Zhang*, Linxin Song*, Alexander Ratner
Enhanced BCC. The EBCC model (Li et al.,2019)
is an extension of iBCC, which import Msubtypes to
capture the correlation between LFs and aggregated
the captured correlation by tensor rank decomposition.
The joint distribution of observing the outputs of mul-
tiple LFs can be approximated by a linear combination
of more rank-1 tensors, known also as tensor rank de-
composition (Hitchcock (1927)), i.e.,
p(y1, ..., yL|z=k)
M
X
m=1
~πkm~v1km ⊗ · · · ⊗ ~vLkm,(2)
where is the tensor product. EBCC interpreted
the tensor decomposition as a mixture model, where
~v1km · · ·~vLkm are mixture component shared by all
the data samples, and πkm is the mixture coefficient.
This comes out that
p(y1, ..., yL|z) =
M
X
m=1
p(g=m|z)
L
Y
j=1
p(yj|z, g =m)
here gis an auxiliary latent variable used for index-
ing mixture components. All the mixture components
are the result of categorical distribution governed by
parameter βkwhere βkk =aand βkk0=b, which is
equivalent to assuming that every LF has correctly la-
belled aitems under every class, and has to make all
kinds of mistakes btimes. The Mcomponents under
class kcan be seen as Msubtypes, each of which can
be used to explain the correlation between LF labels
given class k(Li et al.,2019).
3.3 Multi-class Gaussian Process
Classification
The multi-class Gaussian process (GP) classification
model consists of a latent GP prior for each class ~
f=
(fi1, ..., fiK ), where fiGP(m, Σ), mis the mean
over samples, Σ is the kernel function. The conditional
distribution is modeled by a categorical likelihood,
p(yi=k|xi,~
fi) = h(k)(~
fi(xi)),(3)
where h(k)(·) is a function that maps the real vector
of the GP values to a probability vector. For h(·), the
most common way to form a categorical likelihood is
through the softmax transformation
p(yi=k|~
fi) = exp(fik)
PK
k=1 exp(fik)(4)
where fik denotes the fk(xi) and for clarity, we omit
the conditioning on xi.
4 METHODS
In this section, we introduce the proposed FABLE
model. In a nutshell, it connects the mixture coeffi-
cients of the EBCC model with the instance features
via a predictive Gaussian process (GP). Then, we in-
troduce a bunch of auxiliary variables to handle the
non-conjugation in the model to ensure efficient vari-
ational inference. Finally, we present the generative
process, joint distribution, and the inference process
of the FABLE model.
4.1 Leveraging instance features via mixture
coefficient
In this work, we aim to explicitly incorporate in-
stance features into a statistical label model built upon
EBCC. To attack this problem, we leverage the Gaus-
sian process (GP) classification. Specifically, we model
the mixture coefficients of EBCC as the output of a
GP classifier, which inputs the instance features. We
generate N×K×Mfunctions for each data, class,
and subtypes, and take the logistic-softmax distribu-
tion for each subtype and class to acquire the mixture
coefficient for each data. In particular, we rewrite the
Equation 2as
p(yi1, ..., yiL |zi=k)
=
M
X
m=1
πikm [~v1km · · · ~vLkm]
M
X
m=1
h(k,m)
softmax(σ(~
fi)) [~v1km · · · ~vLkm],
where σ(·) is sigmoid function and πikm =p(gi=m|
zi). fiis GP’s latent functions for sample xiwith
fi=f(~xi) and fiGP(mi,Σ). We will soon discuss
the details and advantages of our usage of GP classifier
in the sequel.
4.2 Handling the Non-conjugate Prior
Given the proposed model, we would like to infer the
true labels via the standard mean field variational in-
ference process following prior work (Li et al.,2019).
However, a key challenge that prevents us from per-
forming variational inference is that as a categorical
likelihood function, softmax is non-conjugate to the
Gaussian prior, so the variational posterior q(fikm)
cannot be derived analytically. Inspired by Polson
et al. (2013); Galy-Fajou et al. (2020), we propose to
solve the non-conjugate mapping function in the com-
plete data likelihood by introducing a number of auxil-
iary latent variables such that the augmented complete
data likelihood falls into the exponential family, which
is conjugate to the Gaussian prior.
Leveraging Instance Features for Label Aggregation in Programmatic Weak Supervision
In the following section we (1) decouple the GP la-
tent variables fikm in the denominator by introduc-
ing of a set of auxiliary λ-variables and the logistic-
softmax function, (2) simplify the model likelihood by
introducing Poisson random variables, and (3) use a
olya-Gamma representation of the sigmoid function
to achieve the desired conjugate representation of our
model.
Decouple GP latent variables. Following Galy-
Fajou et al. (2020), we first replace the softmax likeli-
hood with the logistic-softmax likelihood,
πikm =h(k,m)
softmax(σ(~
fi)) = σ(fikm)
PK
j=1 PM
n=1 σ(fijn),(5)
where σ(z) = (1 + exp(z))1is the logistic func-
tion. To remedy the intractable normalizer term
PK
j=1 PM
n=1 σ(fijn), we use the integral identity 1
x=
R
0eλxand express the likelihood (5) as
h(k,m)
softmax(σ(~
fi))
=σ(fikm)Z
0
exp
λi
K
X
j=1
M
X
n=1
σ(fijn)
i.(6)
By interpreting λias an additional latent variable, we
obtain the augmented likelihood
p(πikm |fikm, λi) = σ(fikm)
K
Y
j=1
M
Y
n=1
exp(λiσ(fijn)),(7)
here we impose the improper prior p(λi)1[0,],i
[1, N]. The improper prior is not problematic since it
leads to a proper complete conditional distribution, as
we will see at the end of the section.
Poisson augmentation By leveraging the moment
generation function of the Poisson distribution Po(λ)
exp(λ(z1)) =
X
n=0
znPo(z|λ).
Using z=σ(f), we rewrite the exponential factors
as,
exp(λiσ(fikm)) = exp(λi(σ(fikm)1))
=
K
X
j=1
M
X
n=1
(σ(fijn))υijn Po(υijn |λi),
which leads to the augmented likelihood
p(πikm |fikm, υikm, λi)
=σ(fikm)·
K
Y
j=1
M
Y
n=1
(σ(fijn))υijn ,(8)
where υikm Po(λi).
Complete with P´olya-Gamma In the last step,
we aim for a Gaussian representation of the sigmoid
function. The P´olya-Gamma representation allows us
for rewriting the sigmoid function as a scale mixture
of Gaussian,
σ(z)n=Z
0
2nexp υz
2z2
2ωPG(ω|υ, 0) (9)
where PG(ω|υ, b) is a P´olya-Gamma distribution. By
applying this augmentation to Equation 8we obtain
p(πikm |fikm, υikm, ωikm)
=
2(πikm +υikm )exp n(πikm υikm )fikm
2o
exp n(fikm )2
2ωikmo,(10)
where ωikm PG(ωikm |υikm,0) are P´olya-Gamma
variables.
Finally, the complete conditions of the GPs’ fikm are
p(fikm |πikm, ωikm, υikm)
=Nfikm |1
2ˆ
Σkm (E[πikm]E[υikm]) ,ˆ
Σkm,(11)
where ˆ
Σkm = (Σ1
km + diag(E[ωikm]))1. For the con-
ditional distribution, λiwe have
p(λi|~υi) = Ga
1 +
K
X
j=1
M
X
n=1
γijn + 1, K
,(12)
where Ga(·|a, b) indicated a gamma distribution with
parameter aand b.γijn is the parameter of the
joint distribution of p(ωikm, υikm), detailed in Ap-
pendix B.1.
In summary, by integrating three auxiliary random
variable λi, υikm, ωikm, we successfully turn the pos-
terior of fikm from non-conjugate softmax to expo-
nential family form, the Gaussian distribution, which
is easy to infer by adopting variational inference with
q(fikm) N ( ˆm, ˆ
Σ).
4.3 The Generative Process and Joint
Distribution
Here, we summarize the generative process of the pro-
posed model. We use the GP latent functions Fand
the corresponding auxiliary variables Ω and Υ to gen-
erate the mixture coefficient Π. There are K×M
subtypes in total, and we assume the item ibelongs
to the gi-th subtype of the class, zias in EBCC. The
proposed model is shown in Figure 1and its generative
process is:
摘要:

LeveragingInstanceFeaturesforLabelAggregationinProgrammaticWeakSupervisionJieyuZhang*LinxinSong*AlexanderRatnerUniversityofWashingtonWasedaUniversityUniversityofWashingtonAbstractProgrammaticWeakSupervision(PWS)hasemergedasawidespreadparadigmtosyn-thesizetraininglabelseciently.ThecorecomponentofPWS...

展开>> 收起<<
Leveraging Instance Features for Label Aggregation in Programmatic Weak Supervision Jieyu Zhang Linxin Song Alexander Ratner.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:1.13MB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注