Explaining Translationese why are Neural Classifiers Better and what do they Learn Kwabena Amponsah-Kaakyire12 Daria Pylypenko1 Josef van Genabith12

2025-05-06 0 0 3.05MB 16 页 10玖币
侵权投诉
Explaining Translationese: why are Neural Classifiers Better
and what do they Learn?
Kwabena Amponsah-Kaakyire*1,2, Daria Pylypenko*1, Josef van Genabith1,2,
and Cristina España-Bonet2
1Saarland University, 2German Research Center for Artificial Intelligence (DFKI)
Saarland Informatics Campus, Saarbrücken, Germany
amponsahkaakyirek@gmail.com
daria.pylypenko@uni-saarland.de
{cristinae, Josef.Van_Genabith}@dfki.de
Abstract
Recent work has shown that neural feature-
and representation-learning, e.g. BERT,
achieves superior performance over traditional
manual feature engineering based approaches,
with e.g. SVMs, in translationese classifica-
tion tasks. Previous research did not show (i)
whether the difference is because of the fea-
tures, the classifiers or both, and (ii)what
the neural classifiers actually learn. To ad-
dress (i), we carefully design experiments that
swap features between BERT- and SVM-based
classifiers. We show that an SVM fed with
BERT representations performs at the level of
the best BERT classifiers, while BERT learn-
ing and using handcrafted features performs
at the level of an SVM using handcrafted fea-
tures. This shows that the performance differ-
ences are due to the features. To address (ii)
we use integrated gradients and find that (a)
there is indication that information captured
by hand-crafted features is only a subset of
what BERT learns, and (b)part of BERT’s
top performance results are due to BERT learn-
ing topic differences and spurious correlations
with translationese.
1 Introduction
Translationese is a descriptive (non-negative) cover
term for the systematic differences between trans-
lated and originally authored text in same lan-
guage (Gellerstam,1986). Some aspects of transla-
tionese such as source interference (Toury,1980;
Teich,2003) are language dependent, others are
presumed universal, e.g. simplification, explicita-
tion, overadherence to target language linguistic
norms (Volansky et al.,2015) in the products of
translations. While translationese effects can be
subtle, especially for professional human transla-
tion, corpus-based studies (Baker et al.,1993) and,
in particular, machine-learning and classifier based
*Equal contribution.
studies (Rabinovich and Wintner,2015;Volansky
et al.,2015;Rubino et al.,2016;Pylypenko et al.,
2021) clearly reveal the differences.
While research on translationese is important
from a theoretical point of view (translation univer-
sals, specific interference), it has a direct impact
on machine translation research: (Kurokawa et al.,
2009;Stymne,2017;Toral et al.,2018;Zhang and
Toral,2019;Freitag et al.,2019;Graham et al.,
2020;Riley et al.,2020), amongst others, show
that translation direction in training and test data
impacts on results, that already translated test data
are easier to translate than original data, that ma-
chine translation and post-editing result in transla-
tionese, and that mitigating translationese in MT
output can improve results. Translationese impacts
cross-lingual applications, e.g. question answering
and natural language inference (Singh et al.,2019;
Clark et al.,2020;Artetxe et al.,2020).
In this paper we focus on machine-learning-
classifier-based research on translationese. Here,
typically a classifier is trained to distinguish be-
tween original and translated texts (in the same
language). Until recently, most of this research (Ba-
roni and Bernardini,2005;Volansky et al.,2015;
Rubino et al.,2016) used manually defined, often
linguistically inspired, feature-engineering based
sets of features, mostly using support vector ma-
chines (SVM). Once a classifier is trained, feature
importance and ranking methods are used to rea-
son back to what aspects of the input is respon-
sible for (i.e. explains) the classification (and
whether this accords with linguistic theorisation).
More recently, a small number of papers explored
feature- and representation-learning neural network
based approaches to translationese classification
(Sominsky and Wintner,2019). In a systematic
study Pylypenko et al. (2021) show that feature-
and representation-learning deep neural network-
based approaches (in particular BERT-based, but
arXiv:2210.13391v1 [cs.CL] 24 Oct 2022
also other neural approaches) to translationese
classification substantially outperform handcrafted
feature-engineering based approaches using SVMs.
However, to date, two important questions remain:
(i)
it is not clear whether the substantial perfor-
mance differences are due to learned vs. hand-
crafted features, the classifiers (SVM, the BERT
classification head, or full BERT), or the combina-
tion of both, and
(ii)
what the neural feature and
representation learning approaches actually learn
and how that explains the superior classification.
The contributions of our paper are as follows:
1.
we address
(i)
by carefully crossing fea-
tures and classifiers, feeding BERT-based
learned features to feature-engineering mod-
els (SVMs), feeding the BERT classification
head with hand-crafted features, and by mak-
ing BERT architectures learn handcrafted fea-
tures, as well as feeding embeddings of hand-
crafted features into BERT. Our experiments
show that SVMs using BERT-learned fea-
tures perform on a par with our best BERT-
translationese classifiers, while BERT using
handcrafted features only performs at the level
of feature-engineering-based classifiers. This
shows that it is the features and not the clas-
sifiers, that lead to the substantial (up to 20%
points accuracy absolute) difference in perfor-
mance.
2.
we present the first steps to address
(ii)
us-
ing integrated gradients, an attribution-based
approach, on the BERT models trained in var-
ious settings. Based on striking similarities
in attributions between BERT trained from
scratch and BERT pretrained on handcrafted
features and fine-tuned on text data, as well as
comparable classification accuracies, we find
evidence that the hand-crafted features do not
bring any additional information over the set
learnt by BERT. it is therefore likely that the
hand-crafted features are a (possibly partial)
subset of the features learnt by BERT. Inspect-
ing the most attributed tokens, we present evi-
dence of ’Clever Hans’ behaviour: at least part
of the high classification accuracy of BERT
is due to names of places and countries, sug-
gesting that part of the classification is topic-
and not translationese-based. Moreover, some
top features suggest that there may be some
punctuation-based spurious correlation in the
data.
2 Related Work
Combining learned and hand-crafted features.
(Kaas et al.,2020;Prakash and Tayyar Madabushi,
2020;Lim and Tayyar Madabushi,2020) combine
BERT-based and manual features in order to im-
prove accuracy. (Kazameini et al.,2020;Ray and
Garain,2020;Zhang and Yamana,2020) concate-
nate BERT pooled output embeddings with hand-
crafted feature vectors for classification, often us-
ing an SVM, where the handcrafted feature vector
might be further encoded by a neural network or
used as it is. Our work differs in that we do not
combine features from both models but swap them
in order to decide whether it is the features, the clas-
sifiers or the combination that explains the perfor-
mance difference between neural and feature engi-
neering based models. Additionally, our approach
allows us to examine whether or not representa-
tion learning learns features similar to hand-crafted
features.
Explainability for the feature-engineering ap-
proach to translationese classification.
To date,
explainability in translationese research has mainly
focused on quantifying handcrafted feature impor-
tance. Techniques include inspecting SVM feature
weights (Avner et al.,2016;Pylypenko et al.,2021),
correlation (Rubino et al.,2016), information gain
(Ilisei et al.,2010), chi-square (Ilisei et al.,2010),
decision trees or random forests (Rubino et al.,
2016;Ilisei et al.,2010), ablating features and ob-
serving the change in accuracy (Baroni and Bernar-
dini,2005;Ilisei et al.,2010), training separate
classifiers on each individual feature (or feature set)
and comparing accuracies (Volansky et al.,2015;
Avner et al.,2016). For n-grams, the difference in
frequencies between the original and translationese
classes (Koppel and Ordan,2011;van Halteren,
2008), and the contribution to the symmetrized
Kullback-Leibler Divergence between the classes
(Kurokawa et al.,2009) have been used.
Explainability for the neural approach to trans-
lationese classification.
To date, explainability
methods for neural networks have not been widely
explored. Pylypenko et al. (2021) quantify to which
extent handcrafted features can explain the variance
in the predictions of neural models, such as BERT,
LSTMs, and a simplified Transformer, by training
per-feature linear regression models to output the
predicted probabilities of the neural models and
computing the
R2
measure. They find that most of
the top features are either POS-perplexity-based, or
bag-of-POS features. However, their method treats
the neural network as a black-box, whereas we use
a method that accesses the internals of the model.
Integrated Gradients (IG).
In our work we use
the Integrated Gradients method (Sundararajan
et al.,2017) for explainability. This method pro-
vides attribution scores for the input with respect
to a certain class. IG calculates the integral of gra-
dients of the model
F
with respect to the input
x
(token embedding), along the path from a baseline
x0
(in our case, PAD token embedding) to the input
x:
IntegratedGradsi(x) ::= (xix0
i)×
Z1
α=0
F (x0+α×(xx0))
xi
(1)
The strength of the Integrated Gradients method
is that it satisfies two fundamental axioms (Sensi-
tivity and Implementation Invariance), while many
other popular attribution methods, like Gradients
(Simonyan et al.,2014), DeepLift (Shrikumar et al.,
2017) and LRP (Bach et al.,2015) violate one or
both of them. IG also satisfies the completeness
axiom, that is, IG is comprehensive in accounting
for attributions and does not just to pick the top
label (Sundararajan et al.,2017).
3 Experimental Settings
3.1 Data
For our experiments, we use the monolingual Ger-
man dataset in the Multilingual Parallel Direct Eu-
roparl (MPDE) (Amponsah-Kaakyire et al.,2021)
corpus. The set contains 42k paragraphs with half
of the texts German originals and the other half
translations into German from Spanish (see statis-
tics in Appendix A.1). We perform paragraph-level
classification with an average length of 80 tokens
per training sample.
We additionally use an in-domain Europarl-
based heldout corpus of around 30k paragraphs
for training language models and
n
-gram quartile
distributions on it. This corpus consists of original
German texts only.
3.2 Base Setup
We compare the traditional SVM-based feature en-
gineering approach, which has demonstrated high
performance in previous translationese research,
to the BERT model known to be very success-
ful for various NLP tasks, including classification.
As base setup, we reproduce the models from Py-
lypenko et al. (2021) for the two architectures and
a new baseline:
1.
a linear
SVM
on 108-dimensional
hand-
crafted feature
vectors (with surface, lexi-
cal, unigram bag-of-PoS, language modelling
and
n
-gram frequency distribution features
1
).
[handcr.-features+SVM]
2.
a
linear classifier
(BERT classification head,
simple linear FFN, except for difference
in input dimension) trained on the 108-
dimensional
handcrafted feature
vectors.
[handcr.-features+LinearClassifier]
3.
off-the-shelf Google’s
pretrained BERT
-
base model (12 layers, 768 hidden dimensions,
12 attention heads) which we
fine-tune
on the
MPDE corpus for translationese classification.
[pretrained-BERT-ft]
4.
a BERT-base model with the same settings
trained
from scratch
on MPDE for transla-
tionese classification. [fromScratch-BERT]
For 1, we estimate
n
-gram language models with
SRILM (Stolcke,2002) and do POS-tagging with
SpaCy.
2
For 3, we use multilingual BERT (Devlin
et al.,2019) (BERT-base-multilingual-uncased),
and fine-tune with the simpletransformers
3
library.
We use a batch size of 32, learning rate of
4·105
,
and the Adam optimiser with epsilon 1·108.
To ensure fair and comprehensive treatment,
we carefully explore many experiments and varia-
tions below: we exchange input features between
BERT and SVM architectures by
(i)
feeding BERT-
learned features into SVMs (Section 3.3), hand-
crafted features into the BERT classification head,
and
(ii
-
a)
letting the full BERT architecture learn
handcrafted feature vectors used by SVMs and
(ii
-
b)
feeding handcrafted feature vectors as embed-
dings into the BERT model (Section 3.4).
3.3 SVM Classifier with BERT Features
We train an SVM with linear kernel on the features
learnt by the pretrained BERT model fine-tuned on
1
See (Pylypenko et al.,2021) for the detailed list of fea-
tures.
2https://spacy.io/
3github.com/ThilinaRajapakse/
simpletransformers
the translationese classification task. We use the
output of the BERT pooler, which selects the last
layer
[CLS]
token vector, with linear projection
and tanh activation as our feature vector. We use:
1.
BERT’s 768-dim pooled vector output,
[pretrained-BERT-ft+SVM]
2.
a 108-dim PCA projection of this vector.
[pretrained-BERT-ft+PCA108 +SVM]
The PCA projection allows us to match the hand-
crafted feature vector dimensionality.
3.4 BERT with Handcrafted Features
Apart from feeding hand-crafted feature vec-
tors into a suitably adjusted BERT classifica-
tion head [
handcr.-features+LinearClassifier
],
we carefully design two strategies to force the full
BERT architecture use the handcrafted features.
Pretraining on handcrafted feature prediction.
First, we train a BERT-base model from scratch
on the MPDE dataset to predict the handcrafted
features. This regression model [
BERT-reg-full
]
takes unmasked text as input and predicts continu-
ous values (the 108 dimension vectors representing
handcrafted features originally used in training the
SVM). The complete feature vector is predicted at
once, and the pretraining is done by minimizing
MSE loss between the predicted and the ground
truth vector. The weights of this model encode the
information of the handcrafted features. With this
pretrained model,
1.
we freeze the weights, replace the regression
head (linear layer predicting 108 features)
with a linear classifier (a BERT classification
head predicting the original or translationese
label) and train the classifier on the MPDE
data for translationese classification, [
BERT-
r2c-full-frozen]4
2.
we do not freeze but fine-tune on MPDE for
the translationese classification task. [
BERT-
r2c-full-ft]
The comparison between frozen and unfrozen
weights is designed to provide us insights on the
importance of representation learning in BERT.
We reproduce the same approach as above with
a smaller BERT model with only 6 layers instead
of 12 [
BERT-reg-half
]. Interestingly, according to
4r2c – regression-to-classification
Figure 1: Mapping handcrafted features to embed-
dings.
the losses when training for predicting the hand-
crafted features, the smaller BERT-reg-half per-
forms comparably to BERT-reg-full (0.0041136
vs 0.0041148 MSE). We then load the weights of
the small 6 layer model into the embedding layer
and the first 6 layers of a 12 layer non-pretrained
BERT-base model and, similarly as before:
3.
we freeze the loaded weights in the first 6
layers and train the remaining 6 layers and
classifier on the translationese classification
task, [BERT-r2c-half-frozen]
4.
we do not freeze but fine-tune on the trans-
lationese classification task with randomly-
initialised weights for the other 6 layers.
[BERT-r2c-half-ft]
Mapping handcrafted features to embeddings.
Even though the very low MSE results indicate that
both versions of BERT-reg are able to learn hand-
crafted features well, using them in terms of frozen
layers in translationese classification leads to low
classification performance (Section 4). This could
be attributed to the fact that, not being an end-to-
end approach, information losses accumulate: first,
even though MSE is low in BERT-reg, we do not
have exactly the same features; and second, the fea-
tures are not used directly for classification, but are
encoded again by the network. This motivates us to
explore an alternative way of encoding handcrafted
features in an end-to-end manner.
We convert the single vector of handcrafted fea-
tures of dimension
D
(108 in our experiments) into
a sequence of embeddings in BERT’s layer format,
that is, length of feature embedding sequence
L
times the dimension of the hidden states
H
(768),
while preserving the information of the single vec-
tor (Figure 1). To do this, we consider a batch of
摘要:

ExplainingTranslationese:whyareNeuralClassiersBetterandwhatdotheyLearn?KwabenaAmponsah-Kaakyire*1,2,DariaPylypenko*1,JosefvanGenabith1,2,andCristinaEspaña-Bonet21SaarlandUniversity,2GermanResearchCenterforArticialIntelligence(DFKI)SaarlandInformaticsCampus,Saarbrücken,Germanyamponsahkaakyirek@gmai...

展开>> 收起<<
Explaining Translationese why are Neural Classifiers Better and what do they Learn Kwabena Amponsah-Kaakyire12 Daria Pylypenko1 Josef van Genabith12.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:16 页 大小:3.05MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注