PARAGEN A Parallel Generation Toolkit Jiangtao Feng1Yi Zhou2Jun Zhang1Xian Qian2Liwei Wu2Zhexi Zhang2 Yanming Liu3Mingxuan Wang2Lei Li4Hao Zhou5

2025-05-02 0 0 241.67KB 9 页 10玖币
侵权投诉
PARAGEN : A Parallel Generation Toolkit
Jiangtao Feng*1,Yi Zhou2,Jun Zhang*1,Xian Qian2,Liwei Wu2,Zhexi Zhang2,
Yanming Liu*3,Mingxuan Wang2,Lei Li*4,Hao Zhou*5
1Shanghai AI Laboratory
2ByteDance Inc.
3Shanghai Jiaotong University
4Univeristy of California Santa Barbara
5Insititute for AI Industry Research, Tsinghua University
Abstract
PARAGEN is a PyTorch-based NLP toolkit
for further development on parallel genera-
tion. PARAGEN provides thirteen types of cus-
tomizable plugins, helping users to experiment
quickly with novel ideas across model archi-
tectures, optimization, and learning strategies.
We implement various features, such as unlim-
ited data loading and automatic model selec-
tion, to enhance its industrial usage. ParaGen
is now deployed to support various research
and industry applications at ByteDance. PAR-
AGEN is available at https://github.
com/bytedance/ParaGen.
1 Introduction
Recently, neural sequence generation model
achieve great success (Vaswani et al.,2017;Lewis
et al.,2020;Liu et al.,2020). Among a surge of
sequence generation algorithms, parallel genera-
tion or non-autoregressive generation methods gain
increasing attention on various tasks for high infer-
ence speed (Gu et al.,2018;Saharia et al.,2020;
Qian et al.,2021a;Gu and Kong,2021;Huang
et al.,2022) and competitive performance against
auto-regressive transformer (Gu et al.,2019;Chan
et al.,2020;Qian et al.,2021b). Apart from nat-
ural language processing, parallel generation also
demonstrates its superiority and scalability on text-
to-speech synthesis (Ren et al.,2021) and high-
resolution image synthesis (Chang et al.,2022).
Several toolkits on sequence generation has
been presented for developing sequence genera-
tion algorithms, such as FairSeq (Ott et al.,2019),
Tensor2Tensor (Vaswani et al.,2018), Transform-
ers (Wolf et al.,2020) and OpenNMT (Klein et al.,
2017). These toolkits are mostly born with auto-
regressive transformers with maximum likelihood
estimation training and are used for research pur-
poses.
*Work was done at ByteDance.
Neural
Modules
Module
Data Retrieval
DataLoader
Dataset
Data I/O
Task.collate
Training
Optimization
Optimizer
Inference
Measurement
Metric
Generation
Architecture
Generator
Search
Algorithm
Search
EvaluatorTrainer
Task
Task
Sampler
Data
Sampling
Data
Preparation
Architecture
Model
Loss
Function
Criterion
Process Scheduler
Neural Network
Figure 1: Overall framework of PARAGEN . The data
retrieval block is colored with green; the neural net-
work modules are colored with yellow; the process
scheduling processing block is colored with bleu; and
the task, which dominates all the processes, is colored
with red. All the white blocks here are customizable
classes in PARAGEN .
In this paper, we present PARAGEN , an exten-
sible toolkit for parallel generation, which is first
developed with Glancing Transformer on WMT-21
Competition (Qian et al.,2021b). We redesign
the code architecture for easy modification on
training and decoding methods, such as glanc-
ing training (Qian et al.,2021a), imitation learn-
ing (Wei et al.,2019), inference algorithms (noisy
parallel decoding) (Gu et al.,2018), and mask-
predict decoding (Ghazvininejad et al.,2019),
which are critical to enhancing parallel generation
algorithm developments. Besides, PARAGEN also
suits industrial usage with robust implementations
and attractive features, such as unlimited data
loading, asynchronized input/output, plug-in Hug-
gingface tokenizers/models (Wolf et al.,2020)
and fast training/inference with LightSeq (Wang
et al.,2021,2022). Apart from parallel gener-
ation, PARAGEN also reproduces typical tasks
arXiv:2210.03405v1 [cs.CL] 7 Oct 2022
with step-by-step scripts, such as autoregressive
translation (Vaswani et al.,2017), text summariza-
tion (Lewis et al.,2020), text classification (Wang
et al.,2018), and extractive question answering (Ra-
jpurkar et al.,2016). As for large-scale pretrain-
ing, PARAGEN supports BERT pretraining (De-
vlin et al.,2019), and multilingual translation with
mBART pretraining (Liu et al.,2020). PARAGEN is
now deployed to support various research and in-
dustrial applications at ByteDance.
2 Architecture Design
The overall architecture of PARAGEN is shown as
Figure 1. PARAGEN consists of four main func-
tional blocks: data, model, trainer, and evaluator.
The data block focuses on data input, processing,
sampling, and loading; the model block consists of
neural models in training and inference; the trainer
is implemented for scheduling the training process;
the evaluator defines the evaluation metrics. Com-
pared with the previous frameworks, we offer 13
types of plug-ins across the three blocks, which
makes PARAGEN more extensible for experiment-
ing with new ideas.
2.1 Data
We design the data organization block on four
base concepts, including reading, preprocess-
ing, sampling strategy and loading, deriving four
customizable class or functions respectively, i.e.
Dataset
,
Data Processing
,
Sampler
and
DataLoader
. We address PARAGEN s data pro-
cess paradigm along with two key topics: online-
offline data processing and unlimited data loading
challenge.
Dataset
The
Dataset
instances read data and
organize it to a
dict
-format object, despite their
storage format on disks. Users are allowed to
develop their on
Dataset
class for customiza-
tion usage by implementing
load
and
callback
functions. Currently, PARAGEN supports data
stored in various formats, including raw texts, paral-
lel texts, and JSON files. The
Dataset
s as well as
other classes in PARAGEN co-work with an under-
lying
io
module to suit different file systems, read-
ing and writing data on a local disk or a Hadoop
file system. It is worth noting that the
io
module
is also modularized and extensible to suit data in-
put/output under more scenarios. Besides, we also
develop
StreamingDataset
, reading data in a
streaming way. The
StreamingDataset
can
read extremely large-scale data with constant mem-
ory consumption, making it extensible to industrial
usage.
Data Processing
Data preprocessing, such Byte-
Pair Encoding (Sennrich et al.,2016), is critical to
sequence generation and varies from task to task.
To enhance task-specific data preprocessing, PAR-
AGEN provides interfaces within
Task
class to al-
low customization. The data processing is roughly
divided into two categories, offline data processing
as
data_collate_fn
and online data process-
ing
collate_fn
. The
data_collate_fn
refers to offline data processing and proceeds
before the training/inference stage start with in-
put from
Dataset
. Thus data processed by
data_collate_fn
remains unchanged during
the training/inference process, which speeds up
training and inference by eliminating repeated data
processing. The
collate_fn
is designed as on-
line processing to enhance flexibility and to allow
users to adjust data processing strategies, such as
batching, during training and inference. We believe
the combination of offline and online data process-
ing would make data processing more flexible and
extensible.
Sampler
The sampling strategy is a non-
negligible algorithm in the online data process-
ing. Although PyTorch provides a base class of
sampling strategy, it is still often ignored by ex-
isting generation frameworks. PARAGEN allows
users to develop their sampling strategies by imple-
menting a
Sampler
instance to decide how data
are organized into batches. A technical challenge
of incorporating customizable sampling strategies
is their compatibility with the feature of unlim-
ited data loading. We solve this problem in the
DataLoader with a cache mechanism.
DataLoader DataLoader
is the final stage of
data processing and the beginning of neural model
processing, acting as a bridge to connect data and
neural models. It can also be viewed as a co-
ordinator of data processing. It first fetches a
batch of samples, according to the sampling strat-
egy determined by
Sampler
, from data memory
with offline processed data. Then it sends the
data batches to online data processing, which be-
comes a private object of
DataLoader
instance
at initialization, and gets a batch to feed the neu-
ral network. However, in the original PyTorch,
DataLoader
is incompatible with streaming data
摘要:

PARAGEN:AParallelGenerationToolkitJiangtaoFeng*1,YiZhou2,JunZhang*1,XianQian2,LiweiWu2,ZhexiZhang2,YanmingLiu*3,MingxuanWang2,LeiLi*4,HaoZhou*51ShanghaiAILaboratory2ByteDanceInc.3ShanghaiJiaotongUniversity4UniveristyofCaliforniaSantaBarbara5InsitituteforAIIndustryResearch,TsinghuaUniversityAbstractP...

展开>> 收起<<
PARAGEN A Parallel Generation Toolkit Jiangtao Feng1Yi Zhou2Jun Zhang1Xian Qian2Liwei Wu2Zhexi Zhang2 Yanming Liu3Mingxuan Wang2Lei Li4Hao Zhou5.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:241.67KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注