Explaining Translationese why are Neural Classiﬁers Better and what do they Learn Kwabena Amponsah-Kaakyire12 Daria Pylypenko1 Josef van Genabith12

2025-05-06 0 0 3.05MB 16 页 10玖币

侵权投诉

Explaining Translationese: why are Neural Classiﬁers Better

and what do they Learn?

Kwabena Amponsah-Kaakyire*1,2, Daria Pylypenko*1, Josef van Genabith1,2,

and Cristina España-Bonet2

1Saarland University, 2German Research Center for Artiﬁcial Intelligence (DFKI)

Saarland Informatics Campus, Saarbrücken, Germany

amponsahkaakyirek@gmail.com

daria.pylypenko@uni-saarland.de

{cristinae, Josef.Van_Genabith}@dfki.de

Abstract

Recent work has shown that neural feature-

and representation-learning, e.g. BERT,

achieves superior performance over traditional

manual feature engineering based approaches,

with e.g. SVMs, in translationese classiﬁca-

tion tasks. Previous research did not show (i)

whether the difference is because of the fea-

tures, the classiﬁers or both, and (ii)what

the neural classiﬁers actually learn. To ad-

dress (i), we carefully design experiments that

swap features between BERT- and SVM-based

classiﬁers. We show that an SVM fed with

BERT representations performs at the level of

the best BERT classiﬁers, while BERT learn-

ing and using handcrafted features performs

at the level of an SVM using handcrafted fea-

tures. This shows that the performance differ-

ences are due to the features. To address (ii)

we use integrated gradients and ﬁnd that (a)

there is indication that information captured

by hand-crafted features is only a subset of

what BERT learns, and (b)part of BERT’s

top performance results are due to BERT learn-

ing topic differences and spurious correlations

with translationese.

1 Introduction

Translationese is a descriptive (non-negative) cover

term for the systematic differences between trans-

lated and originally authored text in same lan-

guage (Gellerstam,1986). Some aspects of transla-

tionese such as source interference (Toury,1980;

Teich,2003) are language dependent, others are

presumed universal, e.g. simpliﬁcation, explicita-

tion, overadherence to target language linguistic

norms (Volansky et al.,2015) in the products of

translations. While translationese effects can be

subtle, especially for professional human transla-

tion, corpus-based studies (Baker et al.,1993) and,

in particular, machine-learning and classiﬁer based

*Equal contribution.

studies (Rabinovich and Wintner,2015;Volansky

et al.,2015;Rubino et al.,2016;Pylypenko et al.,

2021) clearly reveal the differences.

While research on translationese is important

from a theoretical point of view (translation univer-

sals, speciﬁc interference), it has a direct impact

on machine translation research: (Kurokawa et al.,

2009;Stymne,2017;Toral et al.,2018;Zhang and

Toral,2019;Freitag et al.,2019;Graham et al.,

2020;Riley et al.,2020), amongst others, show

that translation direction in training and test data

impacts on results, that already translated test data

are easier to translate than original data, that ma-

chine translation and post-editing result in transla-

tionese, and that mitigating translationese in MT

output can improve results. Translationese impacts

cross-lingual applications, e.g. question answering

and natural language inference (Singh et al.,2019;

Clark et al.,2020;Artetxe et al.,2020).

In this paper we focus on machine-learning-

classiﬁer-based research on translationese. Here,

typically a classiﬁer is trained to distinguish be-

tween original and translated texts (in the same

language). Until recently, most of this research (Ba-

roni and Bernardini,2005;Volansky et al.,2015;

Rubino et al.,2016) used manually deﬁned, often

linguistically inspired, feature-engineering based

sets of features, mostly using support vector ma-

chines (SVM). Once a classiﬁer is trained, feature

importance and ranking methods are used to rea-

son back to what aspects of the input is respon-

sible for (i.e. explains) the classiﬁcation (and

whether this accords with linguistic theorisation).

More recently, a small number of papers explored

feature- and representation-learning neural network

based approaches to translationese classiﬁcation

(Sominsky and Wintner,2019). In a systematic

study Pylypenko et al. (2021) show that feature-

and representation-learning deep neural network-

based approaches (in particular BERT-based, but

arXiv:2210.13391v1 [cs.CL] 24 Oct 2022

also other neural approaches) to translationese

classiﬁcation substantially outperform handcrafted

feature-engineering based approaches using SVMs.

However, to date, two important questions remain:

(i)

it is not clear whether the substantial perfor-

mance differences are due to learned vs. hand-

crafted features, the classiﬁers (SVM, the BERT

classiﬁcation head, or full BERT), or the combina-

tion of both, and

(ii)

what the neural feature and

representation learning approaches actually learn

and how that explains the superior classiﬁcation.

The contributions of our paper are as follows:

we address

(i)

by carefully crossing fea-

tures and classiﬁers, feeding BERT-based

learned features to feature-engineering mod-

els (SVMs), feeding the BERT classiﬁcation

head with hand-crafted features, and by mak-

ing BERT architectures learn handcrafted fea-

tures, as well as feeding embeddings of hand-

crafted features into BERT. Our experiments

show that SVMs using BERT-learned fea-

tures perform on a par with our best BERT-

translationese classiﬁers, while BERT using

handcrafted features only performs at the level

of feature-engineering-based classiﬁers. This

shows that it is the features and not the clas-

siﬁers, that lead to the substantial (up to 20%

points accuracy absolute) difference in perfor-

mance.

we present the ﬁrst steps to address

(ii)

us-

ing integrated gradients, an attribution-based

approach, on the BERT models trained in var-

ious settings. Based on striking similarities

in attributions between BERT trained from

scratch and BERT pretrained on handcrafted

features and ﬁne-tuned on text data, as well as

comparable classiﬁcation accuracies, we ﬁnd

evidence that the hand-crafted features do not

bring any additional information over the set

learnt by BERT. it is therefore likely that the

hand-crafted features are a (possibly partial)

subset of the features learnt by BERT. Inspect-

ing the most attributed tokens, we present evi-

dence of ’Clever Hans’ behaviour: at least part

of the high classiﬁcation accuracy of BERT

is due to names of places and countries, sug-

gesting that part of the classiﬁcation is topic-

and not translationese-based. Moreover, some

top features suggest that there may be some

punctuation-based spurious correlation in the

data.

2 Related Work

Combining learned and hand-crafted features.

(Kaas et al.,2020;Prakash and Tayyar Madabushi,

2020;Lim and Tayyar Madabushi,2020) combine

BERT-based and manual features in order to im-

prove accuracy. (Kazameini et al.,2020;Ray and

Garain,2020;Zhang and Yamana,2020) concate-

nate BERT pooled output embeddings with hand-

crafted feature vectors for classiﬁcation, often us-

ing an SVM, where the handcrafted feature vector

might be further encoded by a neural network or

used as it is. Our work differs in that we do not

combine features from both models but swap them

in order to decide whether it is the features, the clas-

siﬁers or the combination that explains the perfor-

mance difference between neural and feature engi-

neering based models. Additionally, our approach

allows us to examine whether or not representa-

tion learning learns features similar to hand-crafted

features.

Explainability for the feature-engineering ap-

proach to translationese classiﬁcation.

To date,

explainability in translationese research has mainly

focused on quantifying handcrafted feature impor-

tance. Techniques include inspecting SVM feature

weights (Avner et al.,2016;Pylypenko et al.,2021),

correlation (Rubino et al.,2016), information gain

(Ilisei et al.,2010), chi-square (Ilisei et al.,2010),

decision trees or random forests (Rubino et al.,

2016;Ilisei et al.,2010), ablating features and ob-

serving the change in accuracy (Baroni and Bernar-

dini,2005;Ilisei et al.,2010), training separate

classiﬁers on each individual feature (or feature set)

and comparing accuracies (Volansky et al.,2015;

Avner et al.,2016). For n-grams, the difference in

frequencies between the original and translationese

classes (Koppel and Ordan,2011;van Halteren,

2008), and the contribution to the symmetrized

Kullback-Leibler Divergence between the classes

(Kurokawa et al.,2009) have been used.

Explainability for the neural approach to trans-

lationese classiﬁcation.

To date, explainability

methods for neural networks have not been widely

explored. Pylypenko et al. (2021) quantify to which

extent handcrafted features can explain the variance

in the predictions of neural models, such as BERT,

LSTMs, and a simpliﬁed Transformer, by training

per-feature linear regression models to output the

predicted probabilities of the neural models and

computing the

measure. They ﬁnd that most of

the top features are either POS-perplexity-based, or

bag-of-POS features. However, their method treats

the neural network as a black-box, whereas we use

a method that accesses the internals of the model.

Integrated Gradients (IG).

In our work we use

the Integrated Gradients method (Sundararajan

et al.,2017) for explainability. This method pro-

vides attribution scores for the input with respect

to a certain class. IG calculates the integral of gra-

dients of the model

with respect to the input

(token embedding), along the path from a baseline

(in our case, PAD token embedding) to the input

IntegratedGradsi(x) ::= (xi−x0

i)×

α=0

∂F (x0+α×(x−x0))

∂xi

dα (1)

The strength of the Integrated Gradients method

is that it satisﬁes two fundamental axioms (Sensi-

tivity and Implementation Invariance), while many

other popular attribution methods, like Gradients

(Simonyan et al.,2014), DeepLift (Shrikumar et al.,

2017) and LRP (Bach et al.,2015) violate one or

both of them. IG also satisﬁes the completeness

axiom, that is, IG is comprehensive in accounting

for attributions and does not just to pick the top

label (Sundararajan et al.,2017).

3 Experimental Settings

3.1 Data

For our experiments, we use the monolingual Ger-

man dataset in the Multilingual Parallel Direct Eu-

roparl (MPDE) (Amponsah-Kaakyire et al.,2021)

corpus. The set contains 42k paragraphs with half

of the texts German originals and the other half

translations into German from Spanish (see statis-

tics in Appendix A.1). We perform paragraph-level

classiﬁcation with an average length of 80 tokens

per training sample.

We additionally use an in-domain Europarl-

based heldout corpus of around 30k paragraphs

for training language models and

-gram quartile

distributions on it. This corpus consists of original

German texts only.

3.2 Base Setup

We compare the traditional SVM-based feature en-

gineering approach, which has demonstrated high

performance in previous translationese research,

to the BERT model known to be very success-

ful for various NLP tasks, including classiﬁcation.

As base setup, we reproduce the models from Py-

lypenko et al. (2021) for the two architectures and

a new baseline:

a linear

SVM

on 108-dimensional

hand-

crafted feature

vectors (with surface, lexi-

cal, unigram bag-of-PoS, language modelling

and

-gram frequency distribution features

[handcr.-features+SVM]

linear classiﬁer

(BERT classiﬁcation head,

simple linear FFN, except for difference

in input dimension) trained on the 108-

dimensional

handcrafted feature

vectors.

[handcr.-features+LinearClassiﬁer]

off-the-shelf Google’s

pretrained BERT

base model (12 layers, 768 hidden dimensions,

12 attention heads) which we

ﬁne-tune

on the

MPDE corpus for translationese classiﬁcation.

[pretrained-BERT-ft]

a BERT-base model with the same settings

trained

from scratch

on MPDE for transla-

tionese classiﬁcation. [fromScratch-BERT]

For 1, we estimate

-gram language models with

SRILM (Stolcke,2002) and do POS-tagging with

SpaCy.

For 3, we use multilingual BERT (Devlin

et al.,2019) (BERT-base-multilingual-uncased),

and ﬁne-tune with the simpletransformers

library.

We use a batch size of 32, learning rate of

4·10−5

and the Adam optimiser with epsilon 1·10−8.

To ensure fair and comprehensive treatment,

we carefully explore many experiments and varia-

tions below: we exchange input features between

BERT and SVM architectures by

(i)

feeding BERT-

learned features into SVMs (Section 3.3), hand-

crafted features into the BERT classiﬁcation head,

and

(ii

letting the full BERT architecture learn

handcrafted feature vectors used by SVMs and

(ii

feeding handcrafted feature vectors as embed-

dings into the BERT model (Section 3.4).

3.3 SVM Classiﬁer with BERT Features

We train an SVM with linear kernel on the features

learnt by the pretrained BERT model ﬁne-tuned on

See (Pylypenko et al.,2021) for the detailed list of fea-

tures.

2https://spacy.io/

3github.com/ThilinaRajapakse/

simpletransformers

the translationese classiﬁcation task. We use the

output of the BERT pooler, which selects the last

layer

[CLS]

token vector, with linear projection

and tanh activation as our feature vector. We use:

BERT’s 768-dim pooled vector output,

[pretrained-BERT-ft+SVM]

a 108-dim PCA projection of this vector.

[pretrained-BERT-ft+PCA108 +SVM]

The PCA projection allows us to match the hand-

crafted feature vector dimensionality.

3.4 BERT with Handcrafted Features

Apart from feeding hand-crafted feature vec-

tors into a suitably adjusted BERT classiﬁca-

tion head [

handcr.-features+LinearClassiﬁer

we carefully design two strategies to force the full

BERT architecture use the handcrafted features.

Pretraining on handcrafted feature prediction.

First, we train a BERT-base model from scratch

on the MPDE dataset to predict the handcrafted

features. This regression model [

BERT-reg-full

]

takes unmasked text as input and predicts continu-

ous values (the 108 dimension vectors representing

handcrafted features originally used in training the

SVM). The complete feature vector is predicted at

once, and the pretraining is done by minimizing

MSE loss between the predicted and the ground

truth vector. The weights of this model encode the

information of the handcrafted features. With this

pretrained model,

we freeze the weights, replace the regression

head (linear layer predicting 108 features)

with a linear classiﬁer (a BERT classiﬁcation

head predicting the original or translationese

label) and train the classiﬁer on the MPDE

data for translationese classiﬁcation, [

BERT-

r2c-full-frozen]4

we do not freeze but ﬁne-tune on MPDE for

the translationese classiﬁcation task. [

BERT-

r2c-full-ft]

The comparison between frozen and unfrozen

weights is designed to provide us insights on the

importance of representation learning in BERT.

We reproduce the same approach as above with

a smaller BERT model with only 6 layers instead

of 12 [

BERT-reg-half

]. Interestingly, according to

4r2c – regression-to-classiﬁcation

Figure 1: Mapping handcrafted features to embed-

dings.

the losses when training for predicting the hand-

crafted features, the smaller BERT-reg-half per-

forms comparably to BERT-reg-full (0.0041136

vs 0.0041148 MSE). We then load the weights of

the small 6 layer model into the embedding layer

and the ﬁrst 6 layers of a 12 layer non-pretrained

BERT-base model and, similarly as before:

we freeze the loaded weights in the ﬁrst 6

layers and train the remaining 6 layers and

classiﬁer on the translationese classiﬁcation

task, [BERT-r2c-half-frozen]

we do not freeze but ﬁne-tune on the trans-

lationese classiﬁcation task with randomly-

initialised weights for the other 6 layers.

[BERT-r2c-half-ft]

Mapping handcrafted features to embeddings.

Even though the very low MSE results indicate that

both versions of BERT-reg are able to learn hand-

crafted features well, using them in terms of frozen

layers in translationese classiﬁcation leads to low

classiﬁcation performance (Section 4). This could

be attributed to the fact that, not being an end-to-

end approach, information losses accumulate: ﬁrst,

even though MSE is low in BERT-reg, we do not

have exactly the same features; and second, the fea-

tures are not used directly for classiﬁcation, but are

encoded again by the network. This motivates us to

explore an alternative way of encoding handcrafted

features in an end-to-end manner.

We convert the single vector of handcrafted fea-

tures of dimension

(108 in our experiments) into

a sequence of embeddings in BERT’s layer format,

that is, length of feature embedding sequence

times the dimension of the hidden states

(768),

while preserving the information of the single vec-

tor (Figure 1). To do this, we consider a batch of

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ExplainingTranslationese:whyareNeuralClassiersBetterandwhatdotheyLearn?KwabenaAmponsah-Kaakyire*1,2,DariaPylypenko*1,JosefvanGenabith1,2,andCristinaEspaña-Bonet21SaarlandUniversity,2GermanResearchCenterforArticialIntelligence(DFKI)SaarlandInformaticsCampus,Saarbrücken,Germanyamponsahkaakyirek@gmai...

展开>> 收起<<

Explaining Translationese why are Neural Classiﬁers Better and what do they Learn Kwabena Amponsah-Kaakyire12 Daria Pylypenko1 Josef van Genabith12.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Explaining Translationese why are Neural Classiﬁers Better and what do they Learn Kwabena Amponsah-Kaakyire12 Daria Pylypenko1 Josef van Genabith12

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: