with step-by-step scripts, such as autoregressive
translation (Vaswani et al.,2017), text summariza-
tion (Lewis et al.,2020), text classification (Wang
et al.,2018), and extractive question answering (Ra-
jpurkar et al.,2016). As for large-scale pretrain-
ing, PARAGEN supports BERT pretraining (De-
vlin et al.,2019), and multilingual translation with
mBART pretraining (Liu et al.,2020). PARAGEN is
now deployed to support various research and in-
dustrial applications at ByteDance.
2 Architecture Design
The overall architecture of PARAGEN is shown as
Figure 1. PARAGEN consists of four main func-
tional blocks: data, model, trainer, and evaluator.
The data block focuses on data input, processing,
sampling, and loading; the model block consists of
neural models in training and inference; the trainer
is implemented for scheduling the training process;
the evaluator defines the evaluation metrics. Com-
pared with the previous frameworks, we offer 13
types of plug-ins across the three blocks, which
makes PARAGEN more extensible for experiment-
ing with new ideas.
2.1 Data
We design the data organization block on four
base concepts, including reading, preprocess-
ing, sampling strategy and loading, deriving four
customizable class or functions respectively, i.e.
Dataset
,
Data Processing
,
Sampler
and
DataLoader
. We address PARAGEN ’s data pro-
cess paradigm along with two key topics: online-
offline data processing and unlimited data loading
challenge.
Dataset
The
Dataset
instances read data and
organize it to a
dict
-format object, despite their
storage format on disks. Users are allowed to
develop their on
Dataset
class for customiza-
tion usage by implementing
load
and
callback
functions. Currently, PARAGEN supports data
stored in various formats, including raw texts, paral-
lel texts, and JSON files. The
Dataset
s as well as
other classes in PARAGEN co-work with an under-
lying
io
module to suit different file systems, read-
ing and writing data on a local disk or a Hadoop
file system. It is worth noting that the
io
module
is also modularized and extensible to suit data in-
put/output under more scenarios. Besides, we also
develop
StreamingDataset
, reading data in a
streaming way. The
StreamingDataset
can
read extremely large-scale data with constant mem-
ory consumption, making it extensible to industrial
usage.
Data Processing
Data preprocessing, such Byte-
Pair Encoding (Sennrich et al.,2016), is critical to
sequence generation and varies from task to task.
To enhance task-specific data preprocessing, PAR-
AGEN provides interfaces within
Task
class to al-
low customization. The data processing is roughly
divided into two categories, offline data processing
as
data_collate_fn
and online data process-
ing
collate_fn
. The
data_collate_fn
refers to offline data processing and proceeds
before the training/inference stage start with in-
put from
Dataset
. Thus data processed by
data_collate_fn
remains unchanged during
the training/inference process, which speeds up
training and inference by eliminating repeated data
processing. The
collate_fn
is designed as on-
line processing to enhance flexibility and to allow
users to adjust data processing strategies, such as
batching, during training and inference. We believe
the combination of offline and online data process-
ing would make data processing more flexible and
extensible.
Sampler
The sampling strategy is a non-
negligible algorithm in the online data process-
ing. Although PyTorch provides a base class of
sampling strategy, it is still often ignored by ex-
isting generation frameworks. PARAGEN allows
users to develop their sampling strategies by imple-
menting a
Sampler
instance to decide how data
are organized into batches. A technical challenge
of incorporating customizable sampling strategies
is their compatibility with the feature of unlim-
ited data loading. We solve this problem in the
DataLoader with a cache mechanism.
DataLoader DataLoader
is the final stage of
data processing and the beginning of neural model
processing, acting as a bridge to connect data and
neural models. It can also be viewed as a co-
ordinator of data processing. It first fetches a
batch of samples, according to the sampling strat-
egy determined by
Sampler
, from data memory
with offline processed data. Then it sends the
data batches to online data processing, which be-
comes a private object of
DataLoader
instance
at initialization, and gets a batch to feed the neu-
ral network. However, in the original PyTorch,
DataLoader
is incompatible with streaming data