
confidence of our teacher models on the predictions
they generated. For this, we sampled 100 instances
from each of our testsets and monitored the logit
distribution of our teacher models. Specifically,
we calculated the average of the softmax entropy
of the token-level softmax distributions for a se-
quence. Taking inspiration from the unsupervised
estimation of quality of machine translation outputs
(Fomicheva et al.,2020) through similar methods,
we hypothesised that the lower the entropy of our
model, the more confident it would be in its pre-
dictions for a given sample. The intuition here was
that if a model is confident about its prediction,
its logit distribution would be highly-skewed, and
not resemble a uniform distribution (which would
indicate its indecisiveness in being able to predict
the right token - and therefore, the right sequence).
Eventually, this could be used to gauge the quality
of the pseudo labels that are student were being
trained on.
2.4 Size Adaptation: Quantization
Quantization is a common way to reduce the com-
putational time and memory consumption of neu-
ral networks (Wu et al.,2020). Here, a lower-bit
representation of weights and activation functions
is used to achieve a lower memory footprint. In
this work, we perform post-training quantization,
where after training the base model with full pre-
cision of floating point 32 bits (fp-32), we convert
the weights and activations of the model to 8 bit
integers (int-8). Note that during inference, we
still preserve the precision of the input and output
encoder-decoder distributions as fp-32. In theory,
this brings down the memory consumption of the
model by nearly 4x times, though we see an effec-
tive reduction of about 3x in practice. More details
on the memory-reductions achieved are specified
in the Appendix A.4
3 Experimental Setup
3.1 Data
(a) Bribri and Wixarica:
We use the training data
7K and 8K sentences, respectively from Feldman
and Coto-Solano (2020) and evaluate on test data
from Mager et al. (2021).
(b) Gondi
: We use 26k
sentences from the data opensourced by CGNET
Swara (CGNET,2019) and split it into training
and test sets.
2(c) Mundari:
We use a dataset
2
To avoid any test-set leaks, we deduplicate the data by
removing tuples (
Si
,
Ti
) where
Si
is the
ith
sentence in
of 10K sentences provided by Indian Institute of
Technology, Kharagpur
3
, and split it into training
and test sets.
1(d) Assamesse, Odia, Punjabi and
Gujarati
: We use the training data from Ramesh
et al. (2022) (with 0.14M, 1M, 2.4M and 3M sen-
tences, respectively) and evaluate on test data from
FLORES200 Goyal et al. (2022) for Assamese and
WAT2021 Nakazawa et al. (2021) for the remaining
languages. Additional details about datasets (sizes
and splits) are mentioned in the Appendix A.1.
3.2 Training Setup
Hyperparameters:
We use the transformer and
mT5 as our model classes as described previously
in Section 2. The hyperparameters for our trans-
former model was optimized for fine-tuning of
Odia, trained on 1M sentence pairs. For fine-
tuning, we use the Adafactor optimizer (Shazeer
and Stern,2018), with a linearly decaying learning
rate of 1e-3. Since training with smaller batches
is known to be more effective for extremely low-
resource language training (Atrio and Popescu-
Belis,2022), we tuned the training batch size for
every language - varying from 32 to 256 (with gra-
dient accumulation as 2) though we did not see very
significant variation in the performance on the basis
of this tuning. For our stopping criteria: we fine-
tuned all models for 60 epochs (which concluded
with considerably overfit models) and then selected
models by we picking the checkpoint which had the
best validation performance on BLEU (with only
the 13a tokenizer which mimics the mteval-v13a
script from Moses) (Post,2018).
We use the sentencepiece tokenizer to build tok-
enizers for training the baselines for each of the lan-
guages (Kudo and Richardson,2018). We use the
per-token cross-entropy loss for fine-tuning all our
models. Following Xu et al. (2021), we opt for a
relatively smaller vocabulary size with the intent of
learning more meaningful subword representations
for our extremely low-resource languages. Specif-
ically, we use a vocabulary size of 8K for Gondi,
Mundari, Bribri and Wixarica, compared to 32K
used for Assamesse, Odia Punjabi and Gujarati.
Experimental Setup for Distillation
For
Mundari and Gondi we utilize 500K Hindi sen-
tences sampled from the Samanantar corpus
(Ramesh et al.,2022); We use the corresponding
English corpus to sample English sentences for
the source language and
Ti
is
ith
the sentence in the target
language, between the train and the test set.
3Data to be released soon;