Leveraging Instance Features for Label Aggregation in Programmatic Weak Supervision Jieyu Zhang Linxin Song Alexander Ratner

2025-04-29 0 0 1.13MB 16 页 10玖币

侵权投诉

Leveraging Instance Features for Label Aggregation

in Programmatic Weak Supervision

Jieyu Zhang* Linxin Song* Alexander Ratner

University of Washington Waseda University University of Washington

Abstract

Programmatic Weak Supervision (PWS) has

emerged as a widespread paradigm to syn-

thesize training labels eﬃciently. The core

component of PWS is the label model, which

infers true labels by aggregating the out-

puts of multiple noisy supervision sources

abstracted as labeling functions (LFs). Ex-

isting statistical label models typically rely

only on the outputs of LF, ignoring the in-

stance features when modeling the underly-

ing generative process. In this paper, we at-

tempt to incorporate the instance features

into a statistical label model via the pro-

posed FABLE. In particular, it is built on

a mixture of Bayesian label models, each

corresponding to a global pattern of cor-

relation, and the coeﬃcients of the mix-

ture components are predicted by a Gaussian

Process classiﬁer based on instance features.

We adopt an auxiliary variable-based varia-

tional inference algorithm to tackle the non-

conjugate issue between the Gaussian Pro-

cess and Bayesian label models. Extensive

empirical comparison on eleven benchmark

datasets sees FABLE achieving the highest

averaged performance across nine baselines.

Our implementation of FABLE can be found

in https://github.com/JieyuZ2/wrench/

blob/main/wrench/labelmodel/fable.py.

1 INTRODUCTION

The deployment of machine learning models typically

relies on large-scale labeled data to regularly train

and evaluate the models. To collect labels, practi-

*Equal Contribution.

tioners have increasingly resorted to Programmatic

Weak Supervision (PWS) (Ratner et al.,2016;Zhang

et al.,2022a), a paradigm in which labels are generated

cheaply and eﬃciently. Speciﬁcally, in PWS, users de-

velop weak supervision sources abstracted as simple

programs called labeling functions (LFs), rather than

make individual annotations. These LFs could eﬃ-

ciently produce noisy votes on the true label or ab-

stain from voting based on external knowledge bases,

heuristic rules, etc.. To infer the true labels, various

statistical label models (Ratner et al.,2016,2019;Fu

et al.,2020) are developed to aggregate the labels out-

put by LFs.

One of the major technical challenges in PWS is how

to infer the true labels given the noisy and potentially

conﬂict labels of multiple LFs. While being diverse in

assumptions and modeling techniques, existing statis-

tic label models typically rely solely on the LFs’ la-

bels (Ratner et al.,2016;Bach et al.,2017;Varma

et al.,2017;Cachay et al.,2021). In this paper, we

argue that incorporating instance features into a sta-

tistical label model has signiﬁcant potential to improve

the inferred truth. Intuitively, statistical label models

aim to recover the pattern of correlation between the

LF labels and the ground truth; it is natural to as-

sume that similar instants would share a similar pat-

tern and therefore the instance features could be in-

dicative of the pattern of each instant. When ignor-

ing the instance features, statistical label models have

to assume that the patterns or the LF correctness is

instant-independent, which is unlikely to be true for

real-world dataset.

To attack this problem, we propose FABLE (Feature-

Aware laBeL modEl), which exploits the instance fea-

tures to help identify the correlation pattern of in-

stants. We build FABLE upon a recent model named

EBCC (Li et al.,2019), which is a mixture model

where each mixture component is a popular Bayesian

extension to the DS model (Dawid and Skene,1979)

and aims to capture one sort of LF and true label cor-

relation. To incorporate instance features, we propose

to make the mixture coeﬃcients a categorical distribu-

arXiv:2210.02724v2 [cs.LG] 9 Oct 2022

Leveraging Instance Features for Label Aggregation in Programmatic Weak Supervision

tion explicitly depending on instance features. In par-

ticular, a predictive Gaussian process (GP) is adopted

to learn the distribution of mixture coeﬃcients, con-

necting the correlation patterns with instance features.

However, the categorical distribution of mixture coef-

ﬁcients is non-conjugate to the Gaussian prior, hinder-

ing the usage of eﬃcient Bayesian inference algorithm,

e.g., variational inference. To overcome this, we in-

troduce a number of auxiliary variables to augment

the likelihood function to achieve the desired conju-

gate representation of our model. Note that there are

a couple of recently proposed neural network-based

models (Ren et al.,2020;R¨uhling Cachay et al.,2021)

that also leverage instance features, but via neural net-

work. We include them as baselines for comparison

and highlight that these neural network-based models

typically require gold validation set for hyperparame-

ter tuning and early stopping to be performant with

comparison to a statistical model like FABLE.

We conduct extensive experiments on synthetic

dataset with varying size and 11 benchmark datasets.

Compared with state-of-the-art baselines, FABLE

achieves the highest averaged performance and rank-

ing. More importantly, to help understand when FA-

BLE works well and verify our arguments, we mea-

sure the correlation of instance features and the LF

correctness, i.e., Corr(X, LFs). Then, we calculate

the Pearson’s correlation coeﬃcient between Corr(X,

LFs) and the gain of FABLE over EBCC on synthetic

dataset, which is 0.496 with p-value <0.01, indicat-

ing that leveraging instance feature is more beneﬁcial

when the LF correctness indeed depends on the fea-

tures.

2 RELATED WORKS

In PWS, researches have developed a bunch of statisti-

cal label model. Ratner et al. (2016) models the joint

distribution between LF and ground truth labels to

describe the distribution in terms of pre-deﬁned factor

functions. Ratner et al. (2019) models the distribu-

tion via a Markov network and recover the parame-

ters via a matrix completion-style approach, while Fu

et al. (2020) models the distribution via a binary Ising

model and recover the parameters by triplet meth-

ods. There are other statistical models designed for ex-

tended PWS setting (Shin et al.,2021) or for extended

deﬁnition of LFs, e.g., partial labeling functions (Yu

et al.,2022), indirect labeling functions (Zhang et al.,

2021a), and positive-only labeling functions (Zhang

et al.,2022b). Besides the statistical label models, re-

searchers have recently proposed neural network-based

models to leverage instance features (Ren et al.,2020;

R¨uhling Cachay et al.,2021), while in this work, we

aim to incorporate instance features into a pure sta-

tistical model.

Prior to PWS, statistical models for label aggregation

were separately developed in the ﬁeld of crowdsourc-

ing. Dawid and Skene (1979) used a confusion matrix

parameter to generative model LF labels conditioned

on the item’s true annotation, for clinical diagnostics.

Kim and Ghahramani (2012) formulated a Bayesian

generalization with Dirichlet priors and inference by

Gibbs sampling, while Li et al. (2019) incorporate

the subtypes as mixture correlation and decoupled the

confusion matrix, which make the Bayesian generation

process become a mixture model. Their analysis of in-

ferred worker confusion matrix clustering is a natural

precursor to modelling worker correlation.

3 PRELIMINARIES

In this section, we ﬁrst introduce the setup and no-

tation of the programmatic weak supervision (PWS),

then discuss two representative Bayesian models that

can be used in PWS. We also discuss the multi-class

Gaussian process classiﬁcation, which is related to our

proposed method.

3.1 Notation

Let X={~x1, ..., ~xN}denote a training set with Nfea-

tured data samples. Assume that there are Llabeling

functions (LFs) ~yi= [yi1, ..., yiL] with j∈[L], each of

which classiﬁes Neach sample into one of Kcategories

or abstain (outputting −1). Let zibe the latent true

label of the sample i,yij the label that LF jassigns

to the item i,Yithe set of LFs who have labelled the

item i.

3.2 Bayesian Classiﬁer Combination (BCC)

Models

Independent BCC. The iBCC (Kim and Ghahra-

mani,2012) model is a directed graphical model and

a popular extension to David-Skene (DS) (Dawid and

Skene,1979) model by making a conditional indepen-

dence assumption between LFs. The iBCC model as-

sumes that given the true label ziof xi, LF labels to

xiare generated independently by diﬀerent LFs,

p(yi1, ..., yiL |zi) =

j=1

p(yij |zi).(1)

This was referred as the LF’s conditional independence

assumption. However, the underlying independence

assumptions prevent the model from capturing corre-

lations between labels from diﬀerent LF.

Jieyu Zhang*, Linxin Song*, Alexander Ratner

Enhanced BCC. The EBCC model (Li et al.,2019)

is an extension of iBCC, which import Msubtypes to

capture the correlation between LFs and aggregated

the captured correlation by tensor rank decomposition.

The joint distribution of observing the outputs of mul-

tiple LFs can be approximated by a linear combination

of more rank-1 tensors, known also as tensor rank de-

composition (Hitchcock (1927)), i.e.,

p(y1, ..., yL|z=k)≈

m=1

~πkm~v1km ⊗ · · · ⊗ ~vLkm,(2)

where ⊗is the tensor product. EBCC interpreted

the tensor decomposition as a mixture model, where

~v1km ⊗· · ·⊗~vLkm are mixture component shared by all

the data samples, and πkm is the mixture coeﬃcient.

This comes out that

p(y1, ..., yL|z) =

m=1

p(g=m|z)

j=1

p(yj|z, g =m)

here gis an auxiliary latent variable used for index-

ing mixture components. All the mixture components

are the result of categorical distribution governed by

parameter βkwhere βkk =aand βkk0=b, which is

equivalent to assuming that every LF has correctly la-

belled aitems under every class, and has to make all

kinds of mistakes btimes. The Mcomponents under

class kcan be seen as Msubtypes, each of which can

be used to explain the correlation between LF labels

given class k(Li et al.,2019).

3.3 Multi-class Gaussian Process

Classiﬁcation

The multi-class Gaussian process (GP) classiﬁcation

model consists of a latent GP prior for each class ~

(fi1, ..., fiK ), where fi∼GP(m, Σ), mis the mean

over samples, Σ is the kernel function. The conditional

distribution is modeled by a categorical likelihood,

p(yi=k|xi,~

fi) = h(k)(~

fi(xi)),(3)

where h(k)(·) is a function that maps the real vector

of the GP values to a probability vector. For h(·), the

most common way to form a categorical likelihood is

through the softmax transformation

p(yi=k|~

fi) = exp(fik)

k=1 exp(fik)(4)

where fik denotes the fk(xi) and for clarity, we omit

the conditioning on xi.

4 METHODS

In this section, we introduce the proposed FABLE

model. In a nutshell, it connects the mixture coeﬃ-

cients of the EBCC model with the instance features

via a predictive Gaussian process (GP). Then, we in-

troduce a bunch of auxiliary variables to handle the

non-conjugation in the model to ensure eﬃcient vari-

ational inference. Finally, we present the generative

process, joint distribution, and the inference process

of the FABLE model.

4.1 Leveraging instance features via mixture

coeﬃcient

In this work, we aim to explicitly incorporate in-

stance features into a statistical label model built upon

EBCC. To attack this problem, we leverage the Gaus-

sian process (GP) classiﬁcation. Speciﬁcally, we model

the mixture coeﬃcients of EBCC as the output of a

GP classiﬁer, which inputs the instance features. We

generate N×K×Mfunctions for each data, class,

and subtypes, and take the logistic-softmax distribu-

tion for each subtype and class to acquire the mixture

coeﬃcient for each data. In particular, we rewrite the

Equation 2as

p(yi1, ..., yiL |zi=k)

m=1

πikm [~v1km ⊗ · · · ⊗ ~vLkm]

≈

m=1

h(k,m)

softmax(σ(~

fi)) [~v1km ⊗ · · · ⊗ ~vLkm],

where σ(·) is sigmoid function and πikm =p(gi=m|

zi). fiis GP’s latent functions for sample xiwith

fi=f(~xi) and fi∼GP(mi,Σ). We will soon discuss

the details and advantages of our usage of GP classiﬁer

in the sequel.

4.2 Handling the Non-conjugate Prior

Given the proposed model, we would like to infer the

true labels via the standard mean ﬁeld variational in-

ference process following prior work (Li et al.,2019).

However, a key challenge that prevents us from per-

forming variational inference is that as a categorical

likelihood function, softmax is non-conjugate to the

Gaussian prior, so the variational posterior q(fikm)

cannot be derived analytically. Inspired by Polson

et al. (2013); Galy-Fajou et al. (2020), we propose to

solve the non-conjugate mapping function in the com-

plete data likelihood by introducing a number of auxil-

iary latent variables such that the augmented complete

data likelihood falls into the exponential family, which

is conjugate to the Gaussian prior.

Leveraging Instance Features for Label Aggregation in Programmatic Weak Supervision

In the following section we (1) decouple the GP la-

tent variables fikm in the denominator by introduc-

ing of a set of auxiliary λ-variables and the logistic-

softmax function, (2) simplify the model likelihood by

introducing Poisson random variables, and (3) use a

P´olya-Gamma representation of the sigmoid function

to achieve the desired conjugate representation of our

model.

Decouple GP latent variables. Following Galy-

Fajou et al. (2020), we ﬁrst replace the softmax likeli-

hood with the logistic-softmax likelihood,

πikm =h(k,m)

softmax(σ(~

fi)) = σ(fikm)

j=1 PM

n=1 σ(fijn),(5)

where σ(z) = (1 + exp(−z))−1is the logistic func-

tion. To remedy the intractable normalizer term

j=1 PM

n=1 σ(fijn), we use the integral identity 1

R∞

0e−λxdλ and express the likelihood (5) as

h(k,m)

softmax(σ(~

fi))

=σ(fikm)Z∞

exp 

−λi

j=1

n=1

σ(fijn)

dλi.(6)

By interpreting λias an additional latent variable, we

obtain the augmented likelihood

p(πikm |fikm, λi) = σ(fikm)

j=1

n=1

exp(−λiσ(fijn)),(7)

here we impose the improper prior p(λi)∝1[0,∞],∀i∈

[1, N]. The improper prior is not problematic since it

leads to a proper complete conditional distribution, as

we will see at the end of the section.

Poisson augmentation By leveraging the moment

generation function of the Poisson distribution Po(λ)

exp(λ(z−1)) =

∞

n=0

znPo(z|λ).

Using z=σ(−f), we rewrite the exponential factors

as,

exp(λiσ(fikm)) = exp(λi(σ(fikm)−1))

j=1

n=1

(σ(−fijn))υijn Po(υijn |λi),

which leads to the augmented likelihood

p(πikm |fikm, υikm, λi)

=σ(fikm)·

j=1

n=1

(σ(−fijn))υijn ,(8)

where υikm ∼Po(λi).

Complete with P´olya-Gamma In the last step,

we aim for a Gaussian representation of the sigmoid

function. The P´olya-Gamma representation allows us

for rewriting the sigmoid function as a scale mixture

of Gaussian,

σ(z)n=Z∞

2−nexp υz

2−z2

2ωPG(ω|υ, 0) (9)

where PG(ω|υ, b) is a P´olya-Gamma distribution. By

applying this augmentation to Equation 8we obtain

p(πikm |fikm, υikm, ωikm)

2−(πikm +υikm )exp n(πikm −υikm )fikm

exp n(fikm )2

2ωikmo,(10)

where ωikm ∼PG(ωikm |υikm,0) are P´olya-Gamma

variables.

Finally, the complete conditions of the GPs’ fikm are

p(fikm |πikm, ωikm, υikm)

=Nfikm |1

2ˆ

Σkm (E[πikm]−E[υikm]) ,ˆ

Σkm,(11)

where ˆ

Σkm = (Σ−1

km + diag(E[ωikm]))−1. For the con-

ditional distribution, λiwe have

p(λi|~υi) = Ga 

1 +

j=1

n=1

γijn + 1, K

,(12)

where Ga(·|a, b) indicated a gamma distribution with

parameter aand b.γijn is the parameter of the

joint distribution of p(ωikm, υikm), detailed in Ap-

pendix B.1.

In summary, by integrating three auxiliary random

variable λi, υikm, ωikm, we successfully turn the pos-

terior of fikm from non-conjugate softmax to expo-

nential family form, the Gaussian distribution, which

is easy to infer by adopting variational inference with

q(fikm)∼ N ( ˆm, ˆ

Σ).

4.3 The Generative Process and Joint

Distribution

Here, we summarize the generative process of the pro-

posed model. We use the GP latent functions Fand

the corresponding auxiliary variables Ω and Υ to gen-

erate the mixture coeﬃcient Π. There are K×M

subtypes in total, and we assume the item ibelongs

to the gi-th subtype of the class, zias in EBCC. The

proposed model is shown in Figure 1and its generative

process is:

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LeveragingInstanceFeaturesforLabelAggregationinProgrammaticWeakSupervisionJieyuZhang*LinxinSong*AlexanderRatnerUniversityofWashingtonWasedaUniversityUniversityofWashingtonAbstractProgrammaticWeakSupervision(PWS)hasemergedasawidespreadparadigmtosyn-thesizetraininglabelseciently.ThecorecomponentofPWS...

展开>> 收起<<

Leveraging Instance Features for Label Aggregation in Programmatic Weak Supervision Jieyu Zhang Linxin Song Alexander Ratner.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Leveraging Instance Features for Label Aggregation in Programmatic Weak Supervision Jieyu Zhang Linxin Song Alexander Ratner

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: