
In order to evaluate the performance, experi-
ments on the text classification tasks of the Wrench
benchmark (Zhang et al.,2021) are performed. Our
model achieves state-of-the-art performance when
compared to standalone models as well as when
combined and compared with the self-improvement
method Cosine (Yu et al.,2021). An ablation study
shows the importance of each information routing
strategy. The experiments show that in addition to
its task performance, the model is able to memorize
the labeling function information.
The contributions can be summarized in three
parts: 1) We introduce a new intuition about the in-
formation provided by labeling functions and turn
it into a method, SepLL, reflecting the intuition in
the latent space. 2) We provide an analysis through
experiments on the Wrench benchmark, an abla-
tion study and an in depth analysis of the two latent
spaces. 3) We provide the code and a suitably trans-
formed version of the input data. 1
2 Related Work
Weak Supervision.
A main concern in machine
learning is that a large amount of labeled data is
needed in order to train models that achieve state-
of-the-art performance. Among others, the field
of weak supervision aims to address this issue.
The idea is to formalize human knowledge or in-
tuitions into weak supervision sources, called la-
beling functions, which can be used to produce
weak labels. Examples of labeling functions are
heuristic rules, e.g., keywords, regular expressions,
other pre-trained classifiers or knowledge bases in
distant supervision (Craven and Kumlien,1999;
Mintz et al.,2009;Hoffmann et al.,2011;Taka-
matsu et al.,2012).
A main challenge that appears in a weak super-
vision setting is how to create accurate labeling
functions and how to unify and denoise them. Ma-
jority vote, Snorkel (Ratner et al.,2017) (based
on data programming) and Flying Squid (Fu et al.,
2020) are methods that compute weak labels based
on generative models over the labeling function
matches and unknown true labels. These models
are referred to as label models. Subsequently so
called end-models, e.g., BERT-style classifiers (De-
vlin et al.,2019), or methods dedicated to noisy
training labels are used to train a final model.
Recently, neural methods, including the use of
pre-trained models, gained more traction. Cachay
1https://github.com/AndSt/sepll
et al. (2021) use a classifier and a probabilistic
encoder for the labeling function matches and opti-
mize them using a noise-aware loss. Similarly, Ren
et al. (2020) combine a classifier and a attention-
based denoiser, but also include unlabeled sam-
ples. Yu et al. (2021) introduced Cosine, which
is a method to self-optimize classification models.
They leverage contrastive learning and confidence
regularization, i.e., high-confidence samples, to op-
timize a model’s performance.
Other approaches use additional signals. For
instance, ImplyLoss (Awasthi et al.,2020) uses
access to exemplars, i.e., single, correctly labeled
samples and ASTRA (Karamanolakis et al.,2021)
follows an attention based student-teacher mecha-
nism with an additional supervision of a few manu-
ally annotated labeled samples. Zhu et al. (2022)
uses a meta self-refinement approach which makes
use of access to the validation performance.
Our experiments are built on the Weak Super-
vision Benchmark (Wrench) (Zhang et al.,2021),
which is a framework that aims to provide a uni-
fied and standardized way to run and evaluate weak
supervision approaches. A wide range of tasks,
datasets and implementations of weak supervision
methods are available.
Latent Variable Modelling.
Existing work re-
garding latent variable modelling in different areas
of machine learning has influenced the rationale be-
hind this work. Research in representation learning
has focused on modelling mutually independent
factors of variation, e.g., color in computer vision,
explicitly in some latent space. Often this is called
disentanglement (Bengio et al.,2013). This is trans-
ferable to our setting as we aim to obtain the task
prediction as a disentangled factor. An important
early technique is Independent Component Anal-
ysis (ICA) (Comon,1994). Kingma and Welling
(2014) introduced variational autoencoders (VAE’s)
to neural networks, allowing complex data distri-
butions to be represented as simple distributions
in the latent space. An extension is given by
β
-
VAE (Higgins et al.,2017), which is more suitable
for disentanglement. In addition, there has been
progress on theoretical work, which aims to give
an insight on what information is identifiable by
using self-supervised learning (SSL), e.g., Zimmer-
mann et al. (2021) prove under certain assumptions
that it inverts the data generation process. An inter-
esting perspective is the separation of content and
style, e.g., the animal in a picture (content) and the