SepLL Separating Latent Class Labels from Weak Supervision Noise Andreas Stephan12Vasiliki Kougia12 1Research Group Data Mining and Machine Learning

2025-05-03 0 0 530.49KB 12 页 10玖币
侵权投诉
SepLL: Separating Latent Class Labels from Weak Supervision Noise
Andreas Stephan1,2Vasiliki Kougia1,2
1Research Group Data Mining and Machine Learning,
Faculty of Computer Science, University of Vienna, Vienna, Austria
2UniVie Doctoral School Computer Science, Vienna, Austria
3Faculty of Philological and Cultural Studies, University of Vienna, Vienna, Austria
{andreas.stephan,vasiliki.kougia,benjamin.roth}@univie.ac.at
Benjamin Roth1,3
Abstract
In the weakly supervised learning paradigm,
labeling functions automatically assign heuris-
tic, often noisy, labels to data samples. In
this work, we provide a method for learn-
ing from weak labels by separating two types
of complementary information associated with
the labeling functions: information related to
the target label and information specific to
one labeling function only. Both types of
information are reflected to different degrees
by all labeled instances. In contrast to pre-
vious works that aimed at correcting or re-
moving wrongly labeled instances, we learn
a branched deep model that uses all data as-
is, but splits the labeling function information
in the latent space. Specifically, we propose
the end-to-end model SepLL which extends a
transformer classifier by introducing a latent
space for labeling function specific and task-
specific information. The learning signal is
only given by the labeling functions matches,
no pre-processing or label model is required
for our method. Notably, the task prediction is
made from the latent layer without any direct
task signal. Experiments on Wrench text clas-
sification tasks show that our model is compet-
itive with the state-of-the-art, and yields a new
best average performance.
1 Introduction
In recent years, large language modelling ap-
proaches have proven their applicability to a wide
range of tasks, mainly due to the pre-training and
fine-tuning paradigm. This has created a need for
large labeled datasets, as training on these datasets
enables models to achieve state-of-the-art perfor-
mance. However, obtaining manually created la-
bels is expensive, tedious and often requires expert
knowledge. As a consequence, significant areas of
research are devoted to addressing this challenge
by minimizing the need for labeled data. For ex-
ample, research directions include transfer learning
(Ruder et al.,2019) or few-shot learning (Brown
et al.,2020). Another research direction to address
this challenge is weakly supervised learning. The
idea is to use human intuitions, heuristics and ex-
isting resources, e.g., related databases, to create
weak (noisy) labels.
Several approaches have been proposed to in-
crease the quality of the resulting labels. For exam-
ple, Ratner et al. (2017) use generative modeling
to learn a probability distribution over the labeling
function matches, i.e., weak labels, and unknown
true labels in order to denoise the labels and subse-
quently train a classifier. Recently, several works
use student-teacher schemes that use knowledge in-
herent to pre-trained models (Karamanolakis et al.,
2021;Cachay et al.,2021;Ren et al.,2020). Usu-
ally a summary statistic of weak labels, such as ma-
jority vote, is used as ground truth and iteratively
updated during training, for example by employing
a regularization based on the prediction confidence
of the model (Yu et al.,2021). Thus, most meth-
ods share the property that the weak labels, i.e.,
the learning signals, are transformed or updated
throughout the learning process.
Instead of updating the weak labels, we want to
keep them as-is and make use of a different intu-
ition. Each labeling function provides information
relevant to the prediction task but also information
only related to the function itself. Our idea is to
view these two types of information as complemen-
tary and build a model which separates them.
To this end, we propose SepLL, an end-to-end
model that stacks two branched latent layers, rep-
resenting target-task-related and labeling-function-
related information, on top of a transformer en-
coder and recombines them for predicting labeling
function occurrences (Figure 1). Then, the learn-
ing signal is only given by the weak labels. No-
tably, the task prediction is performed from the
latent space without any direct supervision. Multi-
ple information routing strategies are employed to
improve the separation.
arXiv:2210.13898v1 [cs.LG] 25 Oct 2022
In order to evaluate the performance, experi-
ments on the text classification tasks of the Wrench
benchmark (Zhang et al.,2021) are performed. Our
model achieves state-of-the-art performance when
compared to standalone models as well as when
combined and compared with the self-improvement
method Cosine (Yu et al.,2021). An ablation study
shows the importance of each information routing
strategy. The experiments show that in addition to
its task performance, the model is able to memorize
the labeling function information.
The contributions can be summarized in three
parts: 1) We introduce a new intuition about the in-
formation provided by labeling functions and turn
it into a method, SepLL, reflecting the intuition in
the latent space. 2) We provide an analysis through
experiments on the Wrench benchmark, an abla-
tion study and an in depth analysis of the two latent
spaces. 3) We provide the code and a suitably trans-
formed version of the input data. 1
2 Related Work
Weak Supervision.
A main concern in machine
learning is that a large amount of labeled data is
needed in order to train models that achieve state-
of-the-art performance. Among others, the field
of weak supervision aims to address this issue.
The idea is to formalize human knowledge or in-
tuitions into weak supervision sources, called la-
beling functions, which can be used to produce
weak labels. Examples of labeling functions are
heuristic rules, e.g., keywords, regular expressions,
other pre-trained classifiers or knowledge bases in
distant supervision (Craven and Kumlien,1999;
Mintz et al.,2009;Hoffmann et al.,2011;Taka-
matsu et al.,2012).
A main challenge that appears in a weak super-
vision setting is how to create accurate labeling
functions and how to unify and denoise them. Ma-
jority vote, Snorkel (Ratner et al.,2017) (based
on data programming) and Flying Squid (Fu et al.,
2020) are methods that compute weak labels based
on generative models over the labeling function
matches and unknown true labels. These models
are referred to as label models. Subsequently so
called end-models, e.g., BERT-style classifiers (De-
vlin et al.,2019), or methods dedicated to noisy
training labels are used to train a final model.
Recently, neural methods, including the use of
pre-trained models, gained more traction. Cachay
1https://github.com/AndSt/sepll
et al. (2021) use a classifier and a probabilistic
encoder for the labeling function matches and opti-
mize them using a noise-aware loss. Similarly, Ren
et al. (2020) combine a classifier and a attention-
based denoiser, but also include unlabeled sam-
ples. Yu et al. (2021) introduced Cosine, which
is a method to self-optimize classification models.
They leverage contrastive learning and confidence
regularization, i.e., high-confidence samples, to op-
timize a model’s performance.
Other approaches use additional signals. For
instance, ImplyLoss (Awasthi et al.,2020) uses
access to exemplars, i.e., single, correctly labeled
samples and ASTRA (Karamanolakis et al.,2021)
follows an attention based student-teacher mecha-
nism with an additional supervision of a few manu-
ally annotated labeled samples. Zhu et al. (2022)
uses a meta self-refinement approach which makes
use of access to the validation performance.
Our experiments are built on the Weak Super-
vision Benchmark (Wrench) (Zhang et al.,2021),
which is a framework that aims to provide a uni-
fied and standardized way to run and evaluate weak
supervision approaches. A wide range of tasks,
datasets and implementations of weak supervision
methods are available.
Latent Variable Modelling.
Existing work re-
garding latent variable modelling in different areas
of machine learning has influenced the rationale be-
hind this work. Research in representation learning
has focused on modelling mutually independent
factors of variation, e.g., color in computer vision,
explicitly in some latent space. Often this is called
disentanglement (Bengio et al.,2013). This is trans-
ferable to our setting as we aim to obtain the task
prediction as a disentangled factor. An important
early technique is Independent Component Anal-
ysis (ICA) (Comon,1994). Kingma and Welling
(2014) introduced variational autoencoders (VAE’s)
to neural networks, allowing complex data distri-
butions to be represented as simple distributions
in the latent space. An extension is given by
β
-
VAE (Higgins et al.,2017), which is more suitable
for disentanglement. In addition, there has been
progress on theoretical work, which aims to give
an insight on what information is identifiable by
using self-supervised learning (SSL), e.g., Zimmer-
mann et al. (2021) prove under certain assumptions
that it inverts the data generation process. An inter-
esting perspective is the separation of content and
style, e.g., the animal in a picture (content) and the
Input Text
CrossEntropy
LF-specific
path
Task-specific
path
Embedding LF distribution
Class Prediction
LF-matches
distribution
Example: Spam detection
80
%20
%
Spam Ham
T1: “You’ve won
$1000!”
T2: “Log in to verify
your bank account”
LF1: return “spam” if “won” in text
LF2: return “spam” if “bank account”
in text
Labeling functions
1 0 …
0 1 …
T1
T2
LF1 LF2 …
Figure 1: Overview of SepLL. Text gets embedded into Zby a Transformer encoder, and then this representation
is split into labeling function-specific and task-specific information. The task-specific information is translated
back into the LF space and re-combined into ˆ
L. A cross-entropy loss between the distribution of labeling function
matches Land ˆ
Lis minimized. The latent task prediction ˆ
Ycan be used for classification.
camera angle of the image (style). Under milder
assumptions as in Zimmermann et al. (2021), it is
proved by von Kügelgen et al. (2021) that this sep-
aration is achieved using SSL. Mentioned works
are not directly applicable to our task, because we
want to separate general aspects of labeling func-
tions, which are useful for prediction tasks, from
labeling function specific aspects. Another line
of research models the true distribution in the la-
tent layer (Sukhbaatar et al.,2015;Goldberger and
Ben-Reuven,2017;Bekker and Goldberger,2016)
while training on the noisy training labels. The
typical assumption is that the noise distribution
only depends on the class. In weak supervision the
noise depends on the input, by definition of label-
ing functions, thus these types of assumptions are
not directly applicable.
3 Method
The motivation of this work is that each labeling
function provides two types of information. On
the one hand, it provides information about the
target task, e.g., spam detection, and on the other
hand it provides information related to the labeling
function itself. This translates to our model, called
SepLL, which aims to separate these two types of
signals in a latent space. Figure 1provides an
overview of SepLL.
In this section, we first introduce some notation
and then describe the architecture of SepLL. Fol-
lowing, the training mechanisms, which aim to
support the separation of the two information types
are discussed.
3.1 Problem Setup and Notation
In general, the goal is to solve classification tasks,
e.g., spam detection asks whether a text is spam or
not. The input space is denoted by
X
and the un-
known labels are denoted by
Y={y1, . . . , yc}
.
Additionally
m
labeling functions
li:X
{y}∪∅, i = 1, . . . , m
are given where each label-
ing function (LF) either assigns a dedicated specific
label
yY
to a sample or abstains from labeling.
If a label is assigned, we say a labeling function
matches a sample. The task is to use input
X
and
labeling functions
li
to learn a mapping
XY
.
We use the format of the Knodle (Sedova et al.,
2021) framework, where each labeling function is
encoded as a labeler for exactly one class. This is
in contrast to other conventions where a single la-
beling function is allowed to label multiple classes,
e.g., in Ratner et al. (2016). This convention can
easily be transformed into our setting, by splitting
multi-class LFs into multiple class-specific LFs.
The matching matrix
L∈ {0,1}n×m
describes
whether labeling function
j
matches sample
i
by
setting
Lij = 1
, otherwise
Lij = 0
. The mapping
matrix
T∈ {0,1}m×|Y|
reflects a simple mapping
between labeling function
i
and class
j
by
Tij = 1
,
otherwise Tij = 0.
摘要:

SepLL:SeparatingLatentClassLabelsfromWeakSupervisionNoiseAndreasStephan1;2VasilikiKougia1;21ResearchGroupDataMiningandMachineLearning,FacultyofComputerScience,UniversityofVienna,Vienna,Austria2UniVieDoctoralSchoolComputerScience,Vienna,Austria3FacultyofPhilologicalandCulturalStudies,UniversityofVien...

展开>> 收起<<
SepLL Separating Latent Class Labels from Weak Supervision Noise Andreas Stephan12Vasiliki Kougia12 1Research Group Data Mining and Machine Learning.pdf

共12页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:12 页 大小:530.49KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 12
客服
关注