To Softmax or not to Softmax that is the question when applying Active Learning for Transformer Models

2025-05-06 0 0 1.04MB 14 页 10玖币
侵权投诉
TOSOFTMAX,OR NOT TO SOFTMAX:THAT IS THE QUESTION
WHEN APPLYING ACTIVE LEARNING FOR TRANSFORMER
MODELS
A PREPRINT
Julius Gonsior
Technische Universität Dresden
Dresden, Germany
julius.gonsior@tu-dresden.de
Christian Falkenberg
Technische Universität Dresden
Dresden, Germany
christian.falkenberg.@tu-dresden.de
Silvio Magino
Technische Universität Dresden
Dresden, Germany
silvio.magino@tu-dresden.de
Anja Reusch
Technische Universität Dresden
Dresden, Germany
anja.reusch@tu-dresden.de
Maik Thiele
Hochschule für Technik und Wirtschaft Dresden
Dresden, Germany
maik.thiele@htw-dresden.de
Wolfgang Lehner
Technische Universität Dresden
Dresden, Germany
wolfgang.lehner@tu-dresden.de
October 06, 2022
ABSTRACT
Despite achieving state-of-the-art results in nearly all Natural Language Processing applications,
fine-tuning Transformer-based language models still requires a significant amount of labeled data to
work. A well known technique to reduce the amount of human effort in acquiring a labeled dataset
is Active Learning (AL): an iterative process in which only the minimal amount of samples is la-
beled. AL strategies require access to a quantified confidence measure of the model predictions. A
common choice is the softmax activation function for the final layer. As the softmax function pro-
vides misleading probabilities, this paper compares eight alternatives on seven datasets. Our almost
paradoxical finding is that most of the methods are too good at identifying the true most uncertain
samples (outliers), and that labeling therefore exclusively outliers results in worse performance. As
a heuristic we propose to systematically ignore samples, which results in improvements of various
methods compared to the softmax function.
Keywords Active Learning ·Transformer ·Softmax ·Uncertainty ·Calibration ·Deep Neural
Networks
1 Introduction
The most common use case of Machine Learning (ML) is supervised learning, which inherently requires a labeled
dataset to demonstrate the desired outcome to the to-be-trained ML model. This initial step of acquiring a labeled
dataset can only be accomplished by often rare-to-get and costly human domain experts; automation is not possible
as the automation via ML is exactly the task which should be learned. For example, the average cost for the common
label task of segmenting a single image reliably is 6,40 USD1. At the same time, recent advances in the field of Neural
1According to scale.ai as of December 2021
arXiv:2210.03005v1 [cs.LG] 6 Oct 2022
To Softmax, or not to Softmax A PREPRINT
Learner 𝜃
Start Train model on
+
Gold Standard /
Human
Retrieve labels
oracle Pose queries
Query Strategy
𝜃)

  
Softmax
𝜃 
Alternative
to Softmax
𝜃 
𝜃 
Uncertainty Clipping
Figure 1: Standard Active Learning Cycle including our proposed Uncertainty Clipping (UC) to influence the uncer-
tainty based ranking (using the probability Pθ(y|x)of the learner model θin predicting class yfor a sample x) by
ignoring the top-kresults
Network (NN) such as Transformer [Va17] models for Natural Language Processing (NLP) (with BERT [De19] being
the most prominent example) or Convolutional Neural Networks (CNN) [LBH15] for computer vision resulted in huge
Deep NN which require even more labeled training data. Reducing the amount of labeled data is therefore a primary
objective of making ML more applicable in real-world scenarios.
Besides artificially increasing the amount of labeled data by data augmentation, or partially automating the labeling
tasks for example through programmatic labeling functions [Ra17], the process of selecting out of the large pool of
unlabeled data which samples to label first, called Active Learning (AL), is the target to be improved in this work.
AL is an iterative process, in which in each iteration a new subset of unlabeled samples is selected for labeling using
human annotators. Due to the iterative selection, the already existent knowledge of the so-far labeled data can be
leveraged to select the most promising samples to be labeled next. The goal is to reduce the amount of necessary
labeling work while keeping the same model performance.
Despite successful application in a variety of domains, AL fails to work for very deep NN such as Transformer models,
rarely beating pure random sampling. The common explanation [DD22; GI22; Ka21; SWF22] is that AL methods
favor hard-to-learn samples, often simply called outliers, which therefore neglects the potential benefits from AL. An
additional possible explanation is the potential misuse of the softmax activation function of the final output layer as a
method for computing the confidence of the NN in its predictions.
Nearly all AL strategies rely on a method measuring the certainty of the to-be-trained ML model in its prediction as
probabilities. The reasoning behind is that those samples with high model uncertainty are the most useful ones to learn
from, and should therefore be labeled first. For NN typically the softmax activation function is used for the last layer,
and its output interpreted as the probability of the confidence of the NN. But interpreting the softmax function as the
true model confidence is a fallacy [PBZ21]. We compare therefore eight alternative methods for AL in an extensive
end-to-end evaluation of fine-tuning Transformer models for seven common classification datasets AL.
Our main contributions are: 1) comprehensive comparison and evaluation of eight alternative methods to the vanilla
softmax function for calculation NN model certainty in the context of AL for fine-tuning Transformer models, and
2) proposing the novel and easy to implement method Uncertainty Clipping (UC) of mitigating the negative effect of
uncertainty based AL methods of favoring outliers.
The remainder of this paper is structured as follows: In Section 2 we briefly explain AL, the Transformer model
architecture, and the softmax function. Section 3 presents the alternative confidence measurement techniques, Sec-
tion 4 describes our experimental setup. Results are discussed in Section 5. In Section 6 we present related work and
conclude in Section 7.
2 Active Learning 101
Supervised learning techniques inherently rely on an annotated dataset dataset. AL is a well-known technique for
saving human effort by iteratively selecting exactly those unlabeled samples for expert labeling that are the most
useful ones for the overall classification task. The goal is to train a classification model θwhich maps samples x∈ X
to a respective label y∈ Y; for the training, the labels Yhave to be provided “somehow”. Figure 1 shows a standard
pool-based AL cycle: Given a small initial labeled dataset L={(xi, yi)}n
i=0 of nsamples xi∈ X and the respective
label yi∈ Y and a large unlabeled pool U={xi}, xi6∈ L, an ML model called learner θ:X 7→ Y is trained on the
labeled set.
2
To Softmax, or not to Softmax A PREPRINT
Aquery strategy f:U Qthen subsequently chooses a batch of bunlabeled samples Q, which will be labeled
by the oracle (human expert) and added to the set of labeled data L. This AL cycle repeats τtimes until a stopping
criterion is met.
In its most basic form, the focus of the sampling criteria is on either informativeness/uncertainty, or representative-
ness/diversity. Informativeness in this context prefers samples, which reduce the error of the underlying classification
model by minimizing the uncertainty of the model in predicting unknown samples. Representativeness aims at an
evenly distributed sampling in the vector space of the samples. As we are interested in improving the core measure
of informativenesses, the confidence of the ML model in its own predictions, we are concentrating in this paper on
evaluating AL strategies, which are solely relying on the informativeness metric.
Commonly used AL query strategies, which are relying on informativeness, use the confidence of the learner model θ
to select the AL query. The confidence is defined by the probability of the learner Pθ(y|x)in classifying a sample x
with the label y. The most simple informativeness AL strategy is Uncertainty Least Confidence (LC) [LG94], which
selects those samples, where the learner model is most uncertain about, i.e. where the probability Pθ(ˆy|x)of the most
probable label ˆyis the lowest:
fLC (U) = argmax
x∈U
(1 Pθ(ˆy|x)) (1)
As we are interested in the effectiveness of alternative methods in computing the confidence probability Pθ, we are
using each method in combination with Uncertainty Least Confidence (LC), which directly relies solely on the confi-
dence without further processing.
3 Confidence Probability Quantification Methods
The confidence probability of an ML model should represent, how probable it is that the prediction is true. For
example, a confidence of 70% should mean a correct prediction in 70 out of 100 cases. We are using Neural Network
(NN) as the ML model. Due to the property of having one as the sum for all components the softmax function is often
used as a makeshift probability measure for the confidence of NNs:
σ(zi) = exp(zi)
PK
j=1 exp(zj),for i= 1, . . . , K (2)
The output of the last neurons ibefore entering the activation functions is called logit, and denoted as zi,Kdenotes
the amount of neurons in the last layer. But as has been mentioned in the past by other researchers [DD22; GI22;
LPB17; SWF22; WT22], the training objective in training NNs is purely to maximize the value of the correct output
neuron, not to create a true confidence probability. An inherent limitation of the softmax function is its inability to
have – in the theoretical case – zero confidence in its prediction, as the sum of all possible outcomes always equals
1. Previous works have indicated that softmax based confidence is often overconfident [GG16]. Especially NNs using
the often used ReLU activation function [Ag18] for the inner layers can be easily tricked into being overly confident
in any prediction by simply scaling the input xwith an arbitrarily large value α > 1to ˜x=αx [HAB19; PBZ21].
We selected seven methods from the literature suitable to quantify the confidence of Deep NNs such as Transformer
models. They can be divided into four categories [Ga21]: a) single network deterministic methods, which determinis-
tically produce the same result for each NN forward pass (Inhibited Softmax (IS) [MSK18], TrustScore (TrSc) [Ji18]
and Evidential Neural Networks (Evi) [SKK18]), b) Bayesian methods, which sample from a distribution and result
therefore in non-deterministic results (Monte-Carlo Dropout (MC) [GG16]), c) ensemble methods, which combine
multiple deterministic models into a single decision (Softmax Ensemble [SOS92] ), and d) test-time augmentation
methods, which, similarly to the ensemble methods, augment the input samples, and return the combined prediction
for the augmented samples. The last category is a subject of future research as we could not find a subset of data
augmentation techniques which reliably worked well for our use case among different datasets.
Additionally to the aforementioned categories, the existing softmax function can be calibrated to produce meaningful
confidence probabilities. For calibration we selected the two techniques Label Smoothing (LS) [Sz16] and Temperature
Scaling (TeSc) [ZKH20].
More elaborate AL strategies like BALD [KAG19] or QUIRE [HJZ10] not only focus on the confidence probability
measure, but also make use of the vector space to label a diverse training set, including also regions far away from the
classification boundary. As the focus of this paper is on purely evaluating the influence of the confidence prediction
methods, we are deliberately solely using the most basic AL strategy Uncertainty Least Confidence.
In the following the core ideas of the individual methods are briefly explained, more details, reasonings, and the exact
formulas can be found in the original papers. In the end we are explaining our meta-strategy Uncertainty Clipping,
which significantly enhances several confidence probabilities quantification methods.
3
摘要:

TOSOFTMAX,ORNOTTOSOFTMAX:THATISTHEQUESTIONWHENAPPLYINGACTIVELEARNINGFORTRANSFORMERMODELSAPREPRINTJuliusGonsiorTechnischeUniversitätDresdenDresden,Germanyjulius.gonsior@tu-dresden.deChristianFalkenbergTechnischeUniversitätDresdenDresden,Germanychristian.falkenberg.@tu-dresden.deSilvioMaginoTechnische...

展开>> 收起<<
To Softmax or not to Softmax that is the question when applying Active Learning for Transformer Models.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:1.04MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注