To Softmax or not to Softmax that is the question when applying Active Learning for Transformer Models

2025-05-06 0 0 1.04MB 14 页 10玖币

侵权投诉

TOSOFTMAX,OR NOT TO SOFTMAX:THAT IS THE QUESTION

WHEN APPLYING ACTIVE LEARNING FOR TRANSFORMER

MODELS

A PREPRINT

Julius Gonsior

Technische Universität Dresden

Dresden, Germany

julius.gonsior@tu-dresden.de

Christian Falkenberg

Technische Universität Dresden

Dresden, Germany

christian.falkenberg.@tu-dresden.de

Silvio Magino

Technische Universität Dresden

Dresden, Germany

silvio.magino@tu-dresden.de

Anja Reusch

Technische Universität Dresden

Dresden, Germany

anja.reusch@tu-dresden.de

Maik Thiele

Hochschule für Technik und Wirtschaft Dresden

Dresden, Germany

maik.thiele@htw-dresden.de

Wolfgang Lehner

Technische Universität Dresden

Dresden, Germany

wolfgang.lehner@tu-dresden.de

October 06, 2022

ABSTRACT

Despite achieving state-of-the-art results in nearly all Natural Language Processing applications,

ﬁne-tuning Transformer-based language models still requires a signiﬁcant amount of labeled data to

work. A well known technique to reduce the amount of human effort in acquiring a labeled dataset

is Active Learning (AL): an iterative process in which only the minimal amount of samples is la-

beled. AL strategies require access to a quantiﬁed conﬁdence measure of the model predictions. A

common choice is the softmax activation function for the ﬁnal layer. As the softmax function pro-

vides misleading probabilities, this paper compares eight alternatives on seven datasets. Our almost

paradoxical ﬁnding is that most of the methods are too good at identifying the true most uncertain

samples (outliers), and that labeling therefore exclusively outliers results in worse performance. As

a heuristic we propose to systematically ignore samples, which results in improvements of various

methods compared to the softmax function.

Keywords Active Learning ·Transformer ·Softmax ·Uncertainty ·Calibration ·Deep Neural

Networks

1 Introduction

The most common use case of Machine Learning (ML) is supervised learning, which inherently requires a labeled

dataset to demonstrate the desired outcome to the to-be-trained ML model. This initial step of acquiring a labeled

dataset can only be accomplished by often rare-to-get and costly human domain experts; automation is not possible

as the automation via ML is exactly the task which should be learned. For example, the average cost for the common

label task of segmenting a single image reliably is 6,40 USD1. At the same time, recent advances in the ﬁeld of Neural

1According to scale.ai as of December 2021

arXiv:2210.03005v1 [cs.LG] 6 Oct 2022

To Softmax, or not to Softmax A PREPRINT

Learner 𝜃

Start Train model on 





Gold Standard /

Human

Retrieve labels

oracle Pose queries





Query Strategy

𝜃)

 

  

Softmax

𝜃 

Alternative

to Softmax

𝜃 

Uncertainty Clipping

Figure 1: Standard Active Learning Cycle including our proposed Uncertainty Clipping (UC) to inﬂuence the uncer-

tainty based ranking (using the probability Pθ(y|x)of the learner model θin predicting class yfor a sample x) by

ignoring the top-kresults

Network (NN) such as Transformer [Va17] models for Natural Language Processing (NLP) (with BERT [De19] being

the most prominent example) or Convolutional Neural Networks (CNN) [LBH15] for computer vision resulted in huge

Deep NN which require even more labeled training data. Reducing the amount of labeled data is therefore a primary

objective of making ML more applicable in real-world scenarios.

Besides artiﬁcially increasing the amount of labeled data by data augmentation, or partially automating the labeling

tasks for example through programmatic labeling functions [Ra17], the process of selecting out of the large pool of

unlabeled data which samples to label ﬁrst, called Active Learning (AL), is the target to be improved in this work.

AL is an iterative process, in which in each iteration a new subset of unlabeled samples is selected for labeling using

human annotators. Due to the iterative selection, the already existent knowledge of the so-far labeled data can be

leveraged to select the most promising samples to be labeled next. The goal is to reduce the amount of necessary

labeling work while keeping the same model performance.

Despite successful application in a variety of domains, AL fails to work for very deep NN such as Transformer models,

rarely beating pure random sampling. The common explanation [DD22; GI22; Ka21; SWF22] is that AL methods

favor hard-to-learn samples, often simply called outliers, which therefore neglects the potential beneﬁts from AL. An

additional possible explanation is the potential misuse of the softmax activation function of the ﬁnal output layer as a

method for computing the conﬁdence of the NN in its predictions.

Nearly all AL strategies rely on a method measuring the certainty of the to-be-trained ML model in its prediction as

probabilities. The reasoning behind is that those samples with high model uncertainty are the most useful ones to learn

from, and should therefore be labeled ﬁrst. For NN typically the softmax activation function is used for the last layer,

and its output interpreted as the probability of the conﬁdence of the NN. But interpreting the softmax function as the

true model conﬁdence is a fallacy [PBZ21]. We compare therefore eight alternative methods for AL in an extensive

end-to-end evaluation of ﬁne-tuning Transformer models for seven common classiﬁcation datasets AL.

Our main contributions are: 1) comprehensive comparison and evaluation of eight alternative methods to the vanilla

softmax function for calculation NN model certainty in the context of AL for ﬁne-tuning Transformer models, and

2) proposing the novel and easy to implement method Uncertainty Clipping (UC) of mitigating the negative effect of

uncertainty based AL methods of favoring outliers.

The remainder of this paper is structured as follows: In Section 2 we brieﬂy explain AL, the Transformer model

architecture, and the softmax function. Section 3 presents the alternative conﬁdence measurement techniques, Sec-

tion 4 describes our experimental setup. Results are discussed in Section 5. In Section 6 we present related work and

conclude in Section 7.

2 Active Learning 101

Supervised learning techniques inherently rely on an annotated dataset dataset. AL is a well-known technique for

saving human effort by iteratively selecting exactly those unlabeled samples for expert labeling that are the most

useful ones for the overall classiﬁcation task. The goal is to train a classiﬁcation model θwhich maps samples x∈ X

to a respective label y∈ Y; for the training, the labels Yhave to be provided “somehow”. Figure 1 shows a standard

pool-based AL cycle: Given a small initial labeled dataset L={(xi, yi)}n

i=0 of nsamples xi∈ X and the respective

label yi∈ Y and a large unlabeled pool U={xi}, xi6∈ L, an ML model called learner θ:X 7→ Y is trained on the

labeled set.

To Softmax, or not to Softmax A PREPRINT

Aquery strategy f:U −→ Qthen subsequently chooses a batch of bunlabeled samples Q, which will be labeled

by the oracle (human expert) and added to the set of labeled data L. This AL cycle repeats τtimes until a stopping

criterion is met.

In its most basic form, the focus of the sampling criteria is on either informativeness/uncertainty, or representative-

ness/diversity. Informativeness in this context prefers samples, which reduce the error of the underlying classiﬁcation

model by minimizing the uncertainty of the model in predicting unknown samples. Representativeness aims at an

evenly distributed sampling in the vector space of the samples. As we are interested in improving the core measure

of informativenesses, the conﬁdence of the ML model in its own predictions, we are concentrating in this paper on

evaluating AL strategies, which are solely relying on the informativeness metric.

Commonly used AL query strategies, which are relying on informativeness, use the conﬁdence of the learner model θ

to select the AL query. The conﬁdence is deﬁned by the probability of the learner Pθ(y|x)in classifying a sample x

with the label y. The most simple informativeness AL strategy is Uncertainty Least Conﬁdence (LC) [LG94], which

selects those samples, where the learner model is most uncertain about, i.e. where the probability Pθ(ˆy|x)of the most

probable label ˆyis the lowest:

fLC (U) = argmax

x∈U

(1 −Pθ(ˆy|x)) (1)

As we are interested in the effectiveness of alternative methods in computing the conﬁdence probability Pθ, we are

using each method in combination with Uncertainty Least Conﬁdence (LC), which directly relies solely on the conﬁ-

dence without further processing.

3 Conﬁdence Probability Quantiﬁcation Methods

The conﬁdence probability of an ML model should represent, how probable it is that the prediction is true. For

example, a conﬁdence of 70% should mean a correct prediction in 70 out of 100 cases. We are using Neural Network

(NN) as the ML model. Due to the property of having one as the sum for all components the softmax function is often

used as a makeshift probability measure for the conﬁdence of NNs:

σ(zi) = exp(zi)

j=1 exp(zj),for i= 1, . . . , K (2)

The output of the last neurons ibefore entering the activation functions is called logit, and denoted as zi,Kdenotes

the amount of neurons in the last layer. But as has been mentioned in the past by other researchers [DD22; GI22;

LPB17; SWF22; WT22], the training objective in training NNs is purely to maximize the value of the correct output

neuron, not to create a true conﬁdence probability. An inherent limitation of the softmax function is its inability to

have – in the theoretical case – zero conﬁdence in its prediction, as the sum of all possible outcomes always equals

1. Previous works have indicated that softmax based conﬁdence is often overconﬁdent [GG16]. Especially NNs using

the often used ReLU activation function [Ag18] for the inner layers can be easily tricked into being overly conﬁdent

in any prediction by simply scaling the input xwith an arbitrarily large value α > 1to ˜x=αx [HAB19; PBZ21].

We selected seven methods from the literature suitable to quantify the conﬁdence of Deep NNs such as Transformer

models. They can be divided into four categories [Ga21]: a) single network deterministic methods, which determinis-

tically produce the same result for each NN forward pass (Inhibited Softmax (IS) [MSK18], TrustScore (TrSc) [Ji18]

and Evidential Neural Networks (Evi) [SKK18]), b) Bayesian methods, which sample from a distribution and result

therefore in non-deterministic results (Monte-Carlo Dropout (MC) [GG16]), c) ensemble methods, which combine

multiple deterministic models into a single decision (Softmax Ensemble [SOS92] ), and d) test-time augmentation

methods, which, similarly to the ensemble methods, augment the input samples, and return the combined prediction

for the augmented samples. The last category is a subject of future research as we could not ﬁnd a subset of data

augmentation techniques which reliably worked well for our use case among different datasets.

Additionally to the aforementioned categories, the existing softmax function can be calibrated to produce meaningful

conﬁdence probabilities. For calibration we selected the two techniques Label Smoothing (LS) [Sz16] and Temperature

Scaling (TeSc) [ZKH20].

More elaborate AL strategies like BALD [KAG19] or QUIRE [HJZ10] not only focus on the conﬁdence probability

measure, but also make use of the vector space to label a diverse training set, including also regions far away from the

classiﬁcation boundary. As the focus of this paper is on purely evaluating the inﬂuence of the conﬁdence prediction

methods, we are deliberately solely using the most basic AL strategy Uncertainty Least Conﬁdence.

In the following the core ideas of the individual methods are brieﬂy explained, more details, reasonings, and the exact

formulas can be found in the original papers. In the end we are explaining our meta-strategy Uncertainty Clipping,

which signiﬁcantly enhances several conﬁdence probabilities quantiﬁcation methods.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TOSOFTMAX,ORNOTTOSOFTMAX:THATISTHEQUESTIONWHENAPPLYINGACTIVELEARNINGFORTRANSFORMERMODELSAPREPRINTJuliusGonsiorTechnischeUniversitätDresdenDresden,Germanyjulius.gonsior@tu-dresden.deChristianFalkenbergTechnischeUniversitätDresdenDresden,Germanychristian.falkenberg.@tu-dresden.deSilvioMaginoTechnische...

展开>> 收起<<

To Softmax or not to Softmax that is the question when applying Active Learning for Transformer Models.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

To Softmax or not to Softmax that is the question when applying Active Learning for Transformer Models

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: