PARAGEN A Parallel Generation Toolkit Jiangtao Feng1Yi Zhou2Jun Zhang1Xian Qian2Liwei Wu2Zhexi Zhang2 Yanming Liu3Mingxuan Wang2Lei Li4Hao Zhou5

2025-05-02 0 0 241.67KB 9 页 10玖币

侵权投诉

PARAGEN : A Parallel Generation Toolkit

Jiangtao Feng*1,Yi Zhou2,Jun Zhang*1,Xian Qian2,Liwei Wu2,Zhexi Zhang2,

Yanming Liu*3,Mingxuan Wang2,Lei Li*4,Hao Zhou*5

1Shanghai AI Laboratory

2ByteDance Inc.

3Shanghai Jiaotong University

4Univeristy of California Santa Barbara

5Insititute for AI Industry Research, Tsinghua University

Abstract

PARAGEN is a PyTorch-based NLP toolkit

for further development on parallel genera-

tion. PARAGEN provides thirteen types of cus-

tomizable plugins, helping users to experiment

quickly with novel ideas across model archi-

tectures, optimization, and learning strategies.

We implement various features, such as unlim-

ited data loading and automatic model selec-

tion, to enhance its industrial usage. ParaGen

is now deployed to support various research

and industry applications at ByteDance. PAR-

AGEN is available at https://github.

com/bytedance/ParaGen.

1 Introduction

Recently, neural sequence generation model

achieve great success (Vaswani et al.,2017;Lewis

et al.,2020;Liu et al.,2020). Among a surge of

sequence generation algorithms, parallel genera-

tion or non-autoregressive generation methods gain

increasing attention on various tasks for high infer-

ence speed (Gu et al.,2018;Saharia et al.,2020;

Qian et al.,2021a;Gu and Kong,2021;Huang

et al.,2022) and competitive performance against

auto-regressive transformer (Gu et al.,2019;Chan

et al.,2020;Qian et al.,2021b). Apart from nat-

ural language processing, parallel generation also

demonstrates its superiority and scalability on text-

to-speech synthesis (Ren et al.,2021) and high-

resolution image synthesis (Chang et al.,2022).

Several toolkits on sequence generation has

been presented for developing sequence genera-

tion algorithms, such as FairSeq (Ott et al.,2019),

Tensor2Tensor (Vaswani et al.,2018), Transform-

ers (Wolf et al.,2020) and OpenNMT (Klein et al.,

2017). These toolkits are mostly born with auto-

regressive transformers with maximum likelihood

estimation training and are used for research pur-

poses.

*Work was done at ByteDance.

Neural

Modules

Module

Data Retrieval

DataLoader

Dataset

Data I/O

Task.collate

Training

Optimization

Optimizer

Inference

Measurement

Metric

Generation

Architecture

Generator

Algorithm

EvaluatorTrainer

Task

Sampler

Data

Sampling

Data

Preparation

Architecture

Model

Loss

Function

Criterion

Process Scheduler

Neural Network

Figure 1: Overall framework of PARAGEN . The data

retrieval block is colored with green; the neural net-

work modules are colored with yellow; the process

scheduling processing block is colored with bleu; and

the task, which dominates all the processes, is colored

with red. All the white blocks here are customizable

classes in PARAGEN .

In this paper, we present PARAGEN , an exten-

sible toolkit for parallel generation, which is ﬁrst

developed with Glancing Transformer on WMT-21

Competition (Qian et al.,2021b). We redesign

the code architecture for easy modiﬁcation on

training and decoding methods, such as glanc-

ing training (Qian et al.,2021a), imitation learn-

ing (Wei et al.,2019), inference algorithms (noisy

parallel decoding) (Gu et al.,2018), and mask-

predict decoding (Ghazvininejad et al.,2019),

which are critical to enhancing parallel generation

algorithm developments. Besides, PARAGEN also

suits industrial usage with robust implementations

and attractive features, such as unlimited data

loading, asynchronized input/output, plug-in Hug-

gingface tokenizers/models (Wolf et al.,2020)

and fast training/inference with LightSeq (Wang

et al.,2021,2022). Apart from parallel gener-

ation, PARAGEN also reproduces typical tasks

arXiv:2210.03405v1 [cs.CL] 7 Oct 2022

with step-by-step scripts, such as autoregressive

translation (Vaswani et al.,2017), text summariza-

tion (Lewis et al.,2020), text classiﬁcation (Wang

et al.,2018), and extractive question answering (Ra-

jpurkar et al.,2016). As for large-scale pretrain-

ing, PARAGEN supports BERT pretraining (De-

vlin et al.,2019), and multilingual translation with

mBART pretraining (Liu et al.,2020). PARAGEN is

now deployed to support various research and in-

dustrial applications at ByteDance.

2 Architecture Design

The overall architecture of PARAGEN is shown as

Figure 1. PARAGEN consists of four main func-

tional blocks: data, model, trainer, and evaluator.

The data block focuses on data input, processing,

sampling, and loading; the model block consists of

neural models in training and inference; the trainer

is implemented for scheduling the training process;

the evaluator deﬁnes the evaluation metrics. Com-

pared with the previous frameworks, we offer 13

types of plug-ins across the three blocks, which

makes PARAGEN more extensible for experiment-

ing with new ideas.

2.1 Data

We design the data organization block on four

base concepts, including reading, preprocess-

ing, sampling strategy and loading, deriving four

customizable class or functions respectively, i.e.

Dataset

Data Processing

Sampler

and

DataLoader

. We address PARAGEN ’s data pro-

cess paradigm along with two key topics: online-

ofﬂine data processing and unlimited data loading

challenge.

Dataset

The

Dataset

instances read data and

organize it to a

dict

-format object, despite their

storage format on disks. Users are allowed to

develop their on

Dataset

class for customiza-

tion usage by implementing

load

and

callback

functions. Currently, PARAGEN supports data

stored in various formats, including raw texts, paral-

lel texts, and JSON ﬁles. The

Dataset

s as well as

other classes in PARAGEN co-work with an under-

lying

module to suit different ﬁle systems, read-

ing and writing data on a local disk or a Hadoop

ﬁle system. It is worth noting that the

module

is also modularized and extensible to suit data in-

put/output under more scenarios. Besides, we also

develop

StreamingDataset

, reading data in a

streaming way. The

StreamingDataset

can

read extremely large-scale data with constant mem-

ory consumption, making it extensible to industrial

usage.

Data Processing

Data preprocessing, such Byte-

Pair Encoding (Sennrich et al.,2016), is critical to

sequence generation and varies from task to task.

To enhance task-speciﬁc data preprocessing, PAR-

AGEN provides interfaces within

Task

class to al-

low customization. The data processing is roughly

divided into two categories, ofﬂine data processing

data_collate_fn

and online data process-

ing

collate_fn

. The

data_collate_fn

refers to ofﬂine data processing and proceeds

before the training/inference stage start with in-

put from

Dataset

. Thus data processed by

data_collate_fn

remains unchanged during

the training/inference process, which speeds up

training and inference by eliminating repeated data

processing. The

collate_fn

is designed as on-

line processing to enhance ﬂexibility and to allow

users to adjust data processing strategies, such as

batching, during training and inference. We believe

the combination of ofﬂine and online data process-

ing would make data processing more ﬂexible and

extensible.

Sampler

The sampling strategy is a non-

negligible algorithm in the online data process-

ing. Although PyTorch provides a base class of

sampling strategy, it is still often ignored by ex-

isting generation frameworks. PARAGEN allows

users to develop their sampling strategies by imple-

menting a

Sampler

instance to decide how data

are organized into batches. A technical challenge

of incorporating customizable sampling strategies

is their compatibility with the feature of unlim-

ited data loading. We solve this problem in the

DataLoader with a cache mechanism.

DataLoader DataLoader

is the ﬁnal stage of

data processing and the beginning of neural model

processing, acting as a bridge to connect data and

neural models. It can also be viewed as a co-

ordinator of data processing. It ﬁrst fetches a

batch of samples, according to the sampling strat-

egy determined by

Sampler

, from data memory

with ofﬂine processed data. Then it sends the

data batches to online data processing, which be-

comes a private object of

DataLoader

instance

at initialization, and gets a batch to feed the neu-

ral network. However, in the original PyTorch,

DataLoader

is incompatible with streaming data

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

PARAGEN:AParallelGenerationToolkitJiangtaoFeng*1,YiZhou2,JunZhang*1,XianQian2,LiweiWu2,ZhexiZhang2,YanmingLiu*3,MingxuanWang2,LeiLi*4,HaoZhou*51ShanghaiAILaboratory2ByteDanceInc.3ShanghaiJiaotongUniversity4UniveristyofCaliforniaSantaBarbara5InsitituteforAIIndustryResearch,TsinghuaUniversityAbstractP...

展开>> 收起<<

PARAGEN A Parallel Generation Toolkit Jiangtao Feng1Yi Zhou2Jun Zhang1Xian Qian2Liwei Wu2Zhexi Zhang2 Yanming Liu3Mingxuan Wang2Lei Li4Hao Zhou5.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

PARAGEN A Parallel Generation Toolkit Jiangtao Feng1Yi Zhou2Jun Zhang1Xian Qian2Liwei Wu2Zhexi Zhang2 Yanming Liu3Mingxuan Wang2Lei Li4Hao Zhou5

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: