SepLL Separating Latent Class Labels from Weak Supervision Noise Andreas Stephan12Vasiliki Kougia12 1Research Group Data Mining and Machine Learning

2025-05-03 0 0 530.49KB 12 页 10玖币

侵权投诉

SepLL: Separating Latent Class Labels from Weak Supervision Noise

Andreas Stephan1,2Vasiliki Kougia1,2

1Research Group Data Mining and Machine Learning,

Faculty of Computer Science, University of Vienna, Vienna, Austria

2UniVie Doctoral School Computer Science, Vienna, Austria

3Faculty of Philological and Cultural Studies, University of Vienna, Vienna, Austria

{andreas.stephan,vasiliki.kougia,benjamin.roth}@univie.ac.at

Benjamin Roth1,3

Abstract

In the weakly supervised learning paradigm,

labeling functions automatically assign heuris-

tic, often noisy, labels to data samples. In

this work, we provide a method for learn-

ing from weak labels by separating two types

of complementary information associated with

the labeling functions: information related to

the target label and information speciﬁc to

one labeling function only. Both types of

information are reﬂected to different degrees

by all labeled instances. In contrast to pre-

vious works that aimed at correcting or re-

moving wrongly labeled instances, we learn

a branched deep model that uses all data as-

is, but splits the labeling function information

in the latent space. Speciﬁcally, we propose

the end-to-end model SepLL which extends a

transformer classiﬁer by introducing a latent

space for labeling function speciﬁc and task-

speciﬁc information. The learning signal is

only given by the labeling functions matches,

no pre-processing or label model is required

for our method. Notably, the task prediction is

made from the latent layer without any direct

task signal. Experiments on Wrench text clas-

siﬁcation tasks show that our model is compet-

itive with the state-of-the-art, and yields a new

best average performance.

1 Introduction

In recent years, large language modelling ap-

proaches have proven their applicability to a wide

range of tasks, mainly due to the pre-training and

ﬁne-tuning paradigm. This has created a need for

large labeled datasets, as training on these datasets

enables models to achieve state-of-the-art perfor-

mance. However, obtaining manually created la-

bels is expensive, tedious and often requires expert

knowledge. As a consequence, signiﬁcant areas of

research are devoted to addressing this challenge

by minimizing the need for labeled data. For ex-

ample, research directions include transfer learning

(Ruder et al.,2019) or few-shot learning (Brown

et al.,2020). Another research direction to address

this challenge is weakly supervised learning. The

idea is to use human intuitions, heuristics and ex-

isting resources, e.g., related databases, to create

weak (noisy) labels.

Several approaches have been proposed to in-

crease the quality of the resulting labels. For exam-

ple, Ratner et al. (2017) use generative modeling

to learn a probability distribution over the labeling

function matches, i.e., weak labels, and unknown

true labels in order to denoise the labels and subse-

quently train a classiﬁer. Recently, several works

use student-teacher schemes that use knowledge in-

herent to pre-trained models (Karamanolakis et al.,

2021;Cachay et al.,2021;Ren et al.,2020). Usu-

ally a summary statistic of weak labels, such as ma-

jority vote, is used as ground truth and iteratively

updated during training, for example by employing

a regularization based on the prediction conﬁdence

of the model (Yu et al.,2021). Thus, most meth-

ods share the property that the weak labels, i.e.,

the learning signals, are transformed or updated

throughout the learning process.

Instead of updating the weak labels, we want to

keep them as-is and make use of a different intu-

ition. Each labeling function provides information

relevant to the prediction task but also information

only related to the function itself. Our idea is to

view these two types of information as complemen-

tary and build a model which separates them.

To this end, we propose SepLL, an end-to-end

model that stacks two branched latent layers, rep-

resenting target-task-related and labeling-function-

related information, on top of a transformer en-

coder and recombines them for predicting labeling

function occurrences (Figure 1). Then, the learn-

ing signal is only given by the weak labels. No-

tably, the task prediction is performed from the

latent space without any direct supervision. Multi-

ple information routing strategies are employed to

improve the separation.

arXiv:2210.13898v1 [cs.LG] 25 Oct 2022

In order to evaluate the performance, experi-

ments on the text classiﬁcation tasks of the Wrench

benchmark (Zhang et al.,2021) are performed. Our

model achieves state-of-the-art performance when

compared to standalone models as well as when

combined and compared with the self-improvement

method Cosine (Yu et al.,2021). An ablation study

shows the importance of each information routing

strategy. The experiments show that in addition to

its task performance, the model is able to memorize

the labeling function information.

The contributions can be summarized in three

parts: 1) We introduce a new intuition about the in-

formation provided by labeling functions and turn

it into a method, SepLL, reﬂecting the intuition in

the latent space. 2) We provide an analysis through

experiments on the Wrench benchmark, an abla-

tion study and an in depth analysis of the two latent

spaces. 3) We provide the code and a suitably trans-

formed version of the input data. 1

2 Related Work

Weak Supervision.

A main concern in machine

learning is that a large amount of labeled data is

needed in order to train models that achieve state-

of-the-art performance. Among others, the ﬁeld

of weak supervision aims to address this issue.

The idea is to formalize human knowledge or in-

tuitions into weak supervision sources, called la-

beling functions, which can be used to produce

weak labels. Examples of labeling functions are

heuristic rules, e.g., keywords, regular expressions,

other pre-trained classiﬁers or knowledge bases in

distant supervision (Craven and Kumlien,1999;

Mintz et al.,2009;Hoffmann et al.,2011;Taka-

matsu et al.,2012).

A main challenge that appears in a weak super-

vision setting is how to create accurate labeling

functions and how to unify and denoise them. Ma-

jority vote, Snorkel (Ratner et al.,2017) (based

on data programming) and Flying Squid (Fu et al.,

2020) are methods that compute weak labels based

on generative models over the labeling function

matches and unknown true labels. These models

are referred to as label models. Subsequently so

called end-models, e.g., BERT-style classiﬁers (De-

vlin et al.,2019), or methods dedicated to noisy

training labels are used to train a ﬁnal model.

Recently, neural methods, including the use of

pre-trained models, gained more traction. Cachay

1https://github.com/AndSt/sepll

et al. (2021) use a classiﬁer and a probabilistic

encoder for the labeling function matches and opti-

mize them using a noise-aware loss. Similarly, Ren

et al. (2020) combine a classiﬁer and a attention-

based denoiser, but also include unlabeled sam-

ples. Yu et al. (2021) introduced Cosine, which

is a method to self-optimize classiﬁcation models.

They leverage contrastive learning and conﬁdence

regularization, i.e., high-conﬁdence samples, to op-

timize a model’s performance.

Other approaches use additional signals. For

instance, ImplyLoss (Awasthi et al.,2020) uses

access to exemplars, i.e., single, correctly labeled

samples and ASTRA (Karamanolakis et al.,2021)

follows an attention based student-teacher mecha-

nism with an additional supervision of a few manu-

ally annotated labeled samples. Zhu et al. (2022)

uses a meta self-reﬁnement approach which makes

use of access to the validation performance.

Our experiments are built on the Weak Super-

vision Benchmark (Wrench) (Zhang et al.,2021),

which is a framework that aims to provide a uni-

ﬁed and standardized way to run and evaluate weak

supervision approaches. A wide range of tasks,

datasets and implementations of weak supervision

methods are available.

Latent Variable Modelling.

Existing work re-

garding latent variable modelling in different areas

of machine learning has inﬂuenced the rationale be-

hind this work. Research in representation learning

has focused on modelling mutually independent

factors of variation, e.g., color in computer vision,

explicitly in some latent space. Often this is called

disentanglement (Bengio et al.,2013). This is trans-

ferable to our setting as we aim to obtain the task

prediction as a disentangled factor. An important

early technique is Independent Component Anal-

ysis (ICA) (Comon,1994). Kingma and Welling

(2014) introduced variational autoencoders (VAE’s)

to neural networks, allowing complex data distri-

butions to be represented as simple distributions

in the latent space. An extension is given by

VAE (Higgins et al.,2017), which is more suitable

for disentanglement. In addition, there has been

progress on theoretical work, which aims to give

an insight on what information is identiﬁable by

using self-supervised learning (SSL), e.g., Zimmer-

mann et al. (2021) prove under certain assumptions

that it inverts the data generation process. An inter-

esting perspective is the separation of content and

style, e.g., the animal in a picture (content) and the

Input Text

CrossEntropy

LF-specific

path

Task-specific

path

Embedding LF distribution

Class Prediction

LF-matches

distribution

Example: Spam detection

%20

Spam Ham

T1: “You’ve won

$1000!”

T2: “Log in to verify

your bank account”

…

LF1: return “spam” if “won” in text

LF2: return “spam” if “bank account”

in text

Labeling functions

…

1 0 …

0 1 …

…

LF1 LF2 …

Figure 1: Overview of SepLL. Text gets embedded into Zby a Transformer encoder, and then this representation

is split into labeling function-speciﬁc and task-speciﬁc information. The task-speciﬁc information is translated

back into the LF space and re-combined into ˆ

L. A cross-entropy loss between the distribution of labeling function

matches Land ˆ

Lis minimized. The latent task prediction ˆ

Ycan be used for classiﬁcation.

camera angle of the image (style). Under milder

assumptions as in Zimmermann et al. (2021), it is

proved by von Kügelgen et al. (2021) that this sep-

aration is achieved using SSL. Mentioned works

are not directly applicable to our task, because we

want to separate general aspects of labeling func-

tions, which are useful for prediction tasks, from

labeling function speciﬁc aspects. Another line

of research models the true distribution in the la-

tent layer (Sukhbaatar et al.,2015;Goldberger and

Ben-Reuven,2017;Bekker and Goldberger,2016)

while training on the noisy training labels. The

typical assumption is that the noise distribution

only depends on the class. In weak supervision the

noise depends on the input, by deﬁnition of label-

ing functions, thus these types of assumptions are

not directly applicable.

3 Method

The motivation of this work is that each labeling

function provides two types of information. On

the one hand, it provides information about the

target task, e.g., spam detection, and on the other

hand it provides information related to the labeling

function itself. This translates to our model, called

SepLL, which aims to separate these two types of

signals in a latent space. Figure 1provides an

overview of SepLL.

In this section, we ﬁrst introduce some notation

and then describe the architecture of SepLL. Fol-

lowing, the training mechanisms, which aim to

support the separation of the two information types

are discussed.

3.1 Problem Setup and Notation

In general, the goal is to solve classiﬁcation tasks,

e.g., spam detection asks whether a text is spam or

not. The input space is denoted by

and the un-

known labels are denoted by

Y={y1, . . . , yc}

Additionally

labeling functions

li:X→

{y}∪∅, i = 1, . . . , m

are given where each label-

ing function (LF) either assigns a dedicated speciﬁc

label

y∈Y

to a sample or abstains from labeling.

If a label is assigned, we say a labeling function

matches a sample. The task is to use input

and

labeling functions

to learn a mapping

X→Y

We use the format of the Knodle (Sedova et al.,

2021) framework, where each labeling function is

encoded as a labeler for exactly one class. This is

in contrast to other conventions where a single la-

beling function is allowed to label multiple classes,

e.g., in Ratner et al. (2016). This convention can

easily be transformed into our setting, by splitting

multi-class LFs into multiple class-speciﬁc LFs.

The matching matrix

L∈ {0,1}n×m

describes

whether labeling function

matches sample

setting

Lij = 1

, otherwise

Lij = 0

. The mapping

matrix

T∈ {0,1}m×|Y|

reﬂects a simple mapping

between labeling function

and class

Tij = 1

otherwise Tij = 0.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

SepLL:SeparatingLatentClassLabelsfromWeakSupervisionNoiseAndreasStephan1;2VasilikiKougia1;21ResearchGroupDataMiningandMachineLearning,FacultyofComputerScience,UniversityofVienna,Vienna,Austria2UniVieDoctoralSchoolComputerScience,Vienna,Austria3FacultyofPhilologicalandCulturalStudies,UniversityofVien...

展开>> 收起<<

SepLL Separating Latent Class Labels from Weak Supervision Noise Andreas Stephan12Vasiliki Kougia12 1Research Group Data Mining and Machine Learning.pdf

共12页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

SepLL Separating Latent Class Labels from Weak Supervision Noise Andreas Stephan12Vasiliki Kougia12 1Research Group Data Mining and Machine Learning

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: