
On the Transformation of Latent Space in Fine-Tuned NLP Models
WARNING: This paper contains model outputs which may be disturbing to the reader
Nadir Durrani♢Hassan Sajjad♣∗Fahim Dalvi♢Firoj Alam♢
♢Qatar Computing Research Institute, Hamad Bin Khalifa University, Qatar
♣Faculty of Computer Science, Dalhousie University, Canada
{ndurrani,faimaduddin, fialam}@hbku.edu.qa, hsajjad@dal.ca
Abstract
We study the evolution of latent space in fine-
tuned NLP models. Different from the com-
monly used probing-framework, we opt for
an unsupervised method to analyze represen-
tations. More specifically, we discover latent
concepts in the representational space using
hierarchical clustering. We then use an align-
ment function to gauge the similarity between
the latent space of a pre-trained model and its
fine-tuned version. We use traditional linguis-
tic concepts to facilitate our understanding and
also study how the model space transforms to-
wards task-specific information. We perform a
thorough analysis, comparing pre-trained and
fine-tuned models across three models and
three downstream tasks. The notable find-
ings of our work are: i) the latent space of
the higher layers evolve towards task-specific
concepts, ii) whereas the lower layers retain
generic concepts acquired in the pre-trained
model, iii) we discovered that some concepts
in the higher layers acquire polarity towards
the output class, and iv) that these concepts can
be used for generating adversarial triggers.
1 Introduction
The revolution of deep learning models in NLP can
be attributed to transfer learning from pre-trained
language models. Contextualized representations
learned within these models capture rich linguis-
tic knowledge that can be leveraged towards novel
tasks e.g. classification of COVID-19 tweets (Alam
et al.,2021;Valdes et al.,2021), disease prediction
(Rasmy et al.,2020) or natural language under-
standing tasks such as SQUAD (Rajpurkar et al.,
2016) and GLUE (Wang et al.,2018).
Despite their success, the opaqueness of deep
neural networks remain a cause of concern and has
spurred a new area of research to analyze these
models. A large body of work analyzed the knowl-
edge learned within representations of pre-trained
∗
This work was carried out while the author was at QCRI.
models (Belinkov et al.,2017;Conneau et al.,2018;
Liu et al.,2019;Tenney et al.,2019;Durrani et al.,
2019;Rogers et al.,2020) and showed the pres-
ence of core-linguistic knowledge in various parts
of the network. Although transfer learning using
pre-trained models has become ubiquitous, very
few papers (Merchant et al.,2020;Mosbach et al.,
2020;Durrani et al.,2021) have analyzed the rep-
resentations of the fine-tuned models. Given their
massive usability, interpreting fine-tuned models
and highlighting task-specific peculiarities is crit-
ical for their deployment in real-word scenarios,
where it is important to ensure fairness and trust
when applying AI solutions.
In this paper, we focus on analyzing fine-tuned
models and investigate: how does the latent space
evolve in a fine-tuned model? Different from the
commonly used probing-framework of training a
post-hoc classifier (Belinkov et al.,2017;Dalvi
et al.,2019a), we opt for an unsupervised method
to analyze the latent space of pre-trained models.
More specifically, we cluster contextualized rep-
resentations in high dimensional space using hi-
erarchical clustering and term these clusters as
the Encoded Concepts (Dalvi et al.,2022). We
then analyze how these encoded concepts evolve
as the models are fine-tuned towards a downstream
task. Specifically, we target the following ques-
tions: i) how do the latent spaces compare between
base
1
and the fine-tuned models? ii) how does the
presence of core-linguistic concepts change during
transfer learning? and iii) how is the knowledge of
downstream tasks structured in a fine-tuned model?
We use an alignment function (Sajjad et al.,
2022) to compare the concepts encoded in the fine-
tuned models with: i) the concepts encoded in their
pre-trained base models, ii) the human-defined con-
cepts (e.g. parts-of-speech tags or semantic prop-
erties), and iii) the labels of the downstream task
towards which the model is fine-tuned.
1
We use “base” and “pre-trained” models interchangeably.
arXiv:2210.12696v1 [cs.CL] 23 Oct 2022