
On Text Style Transfer via Style Masked Language Models
Sharan Narasimhan, Pooja Shekar, Suvodip Dey, Maunendra Sankar Desarkar
Department of Computer Science and Engineering
Indian Institute of Technology Hyderabad, India
{sharan.n21, poojashekar15}@gmail.com
{cs19resch01003, maunendra}@iith.ac.in
Abstract
Text Style Transfer (TST) is performable
through approaches such as latent space dis-
entanglement, cycle-consistency losses, proto-
type editing etc. The prototype editing ap-
proach, which is known to be quite successful
in TST, involves two key phases a) Masking
of source style-associated tokens and b) Re-
construction of this source-style masked sen-
tence conditioned with the target style. We fol-
low a similar transduction method, in which
we transpose the more difficult direct source
to target TST task to a simpler Style-Masked
Language Model (SMLM) Task, wherein, sim-
ilar to BERT (Devlin et al.,2019a), the goal
of our model is now to reconstruct the source
sentence from its style-masked version. We ar-
rive at the SMLM mechanism naturally by for-
mulating prototype editing/ transduction meth-
ods in a probabilistic framework, where TST
resolves into estimating a hypothetical paral-
lel dataset from a partially observed parallel
dataset, wherein each domain is assumed to
have a common latent style-masked prior. To
generate this style-masked prior, we use “Ex-
plainable Attention” as our choice of attribu-
tion for a more precise style-masking step and
also introduce a cost-effective and accurate
“Attribution-Surplus” method of determining
the position of masks from any arbitrary at-
tribution model in O(1) time. We empirically
show that this non-generational approach well
suites the “content preserving” criteria for a
task like TST, even for a complex style like
Discourse Manipulation. Our model, the Style
MLM, outperforms strong TST baselines and
is on par with state-of-the-art TST models,
which use complex architectures and orders of
more parameters.
1 Introduction
Text Style Transfer (TST) can be thought of as a
subset of the Controllable Language Generation
Task (Hu et al.,2017) with tighter criteria. TST
involves converting input text possessing some
source style into an output possessing the target
style, meanwhile preserving style-independent con-
tent and maintaining fluency. Style is usually de-
fined by the class labels present in an annotated
dataset, commonly considered to be Sentiment,
Formality, Toxicity etc. We consider the unsu-
pervised setting where only non-parallel datasets
are available, i.e. the style-transferred output is
not available. Past work focuses on various com-
mon paradigms such as disentanglement (Hu et al.,
2017;Shen et al.,2017), cycle-consistency losses
(Yi et al.,2020;Luo et al.,2019;Dai et al.,2019;
Liu et al.,2021), induction (Narasimhan et al.,
2022;Shen et al.,2020) etc. We focus on a sentence
to sentence “transduction” (or prototype editing)
method, a solution which naturally emerges when
following a probabilistic formulation consisting
of a single transduction model with a latent prior
consisting of a style-absent corpus. Our approach
is also inspired by BERT’s MLM (Devlin et al.,
2019a)
*
, modified to incorporate style information
to enable TST. This notion can be seen from vari-
ous viewpoints.
Another MLM.
By making use of various text
attribution models, it is possible to determine the
fraction of tokens that significant contribution to
the style label. Masking out these tokens and train-
ing a bi-directional transformer encoder with self-
attention with the task of reconstructing the original
source style sentence is intuitive and closely resem-
bles BERT-based MLM’s (Devlin et al.,2019a;Jin
et al.,2020). Instead of random masking or per-
plexity based masking (Jin et al.,2020), by incorpo-
rating style-masking and concatenation of control
codes that inform the encoder about the target style
needed, the resultant MLM now resembles a rough
TST model. This notion of "filling in only the style
words" also does well to automatically ensure that
*
A key difference to BERT, ours is not using a self-
supervised approach of pre-training over large corpora and
fine-tuning for a downstream task.
arXiv:2210.06394v1 [cs.CL] 12 Oct 2022