Learning a Grammar Inducer from Massive Uncurated Instructional Videos Songyang Zhang1 Linfeng Song2 Lifeng Jin2 Haitao Mi2 Kun Xu2

2025-05-02 0 0 3.94MB 15 页 10玖币

侵权投诉

Learning a Grammar Inducer from Massive Uncurated

Instructional Videos

Songyang Zhang1∗

, Linfeng Song2, Lifeng Jin2, Haitao Mi2, Kun Xu2,

Dong Yu2and Jiebo Luo1

1University of Rochester, Rochester, NY, USA

szhang83@ur.rochester.edu,jluo@cs.rochester.edu

2Tencent AI Lab, Bellevue, WA, USA

{lfsong,lifengjin,haitaomi,kxkunxu,dyu}@tencent.com

Abstract

Video-aided grammar induction aims to lever-

age video information for ﬁnding more ac-

curate syntactic grammars for accompanying

text. While previous work focuses on build-

ing systems for inducing grammars on text

that are well-aligned with video content, we

investigate the scenario, in which text and

video are only in loose correspondence. Such

data can be found in abundance online, and

the weak correspondence is similar to the

indeterminacy problem studied in language

acquisition. Furthermore, we build a new

model that can better learn video-span cor-

relation without manually designed features

adopted by previous work. Experiments show

that our model trained only on large-scale

YouTube data with no text-video alignment re-

ports strong and robust performances across

three unseen datasets, despite domain shift and

noisy label issues. Furthermore our model

yields higher F1 scores than the previous state-

of-the-art systems trained on in-domain data.

1 Introduction

Grammar induction is a fundamental and long-

lasting (Lari and Young,1990;Clark,2001;Klein

and Manning,2002) problem in computational lin-

guistics, which aims to ﬁnd hierarchical syntactic

structures from plain sentences. Unlike supervised

methods (Charniak,2000;Collins,2003;Petrov

and Klein,2007;Zhang and Clark,2011;Cross and

Huang,2016;Kitaev and Klein,2018) that require

human annotated treebanks, e.g., Penn Treebank

(Marcus et al.,1993), grammar inducers do not rely

on any human annotations for training. Grammar

induction is attractive since annotating syntactic

trees by human language experts is expensive and

time consuming, while the current treebanks are

limited to several major languages and domains.

∗

This work was done when Songyang Zhang was an

intern at Tencent AI Lab.

Recently, deep learning models have achieved

remarkable success across NLP tasks, and neural

models have been designed (Shen et al.,2018b,a;

Kim et al.,2019a,b;Jin et al.,2018) for grammar

induction, which greatly advanced model perfor-

mance on induction with raw text. Recent efforts

have started to consider other useful information

from multiple modalities, such as images (Shi et al.,

2019;Jin and Schuler,2020) and videos (Zhang

et al.,2021). Speciﬁcally, Zhang et al. (2021) show

that multi-modal information (e.g. motion, sound

and objects) from videos can signiﬁcantly improve

the induction accuracy on verb and noun phrases.

Such work uses curated multi-modal data publicly

available on the web, which all assume that the

meaning of a sentence needs to be identical (e.g.,

being a caption) to the corresponding video or im-

age. This assumption limits usable data to several

small-scale benchmarks (Lin et al.,2014;Xu et al.,

2016;Hendricks et al.,2017) with expensive hu-

man annotations on image/video captions.

The noisy correspondence between form and

meaning is one of the main research questions

in language acquisition (Akhtar and Montague,

1999;Gentner et al.,2001;Dominey and Dodane,

2004), where different proposals attempt to address

this indeterminacy faced by children. There has

been computational work incorporating such in-

determinacy into their models (Yu and Siskind,

2013;Huang et al.,2021). For modeling empir-

ical grammar learning with multi-modal inputs,

two important questions still remain open: 1) how

can a grammar inducer beneﬁt from large-scale

multi-media data (e.g., YouTube videos) with noisy

text-to-video correspondence? and 2) how can a

grammar inducer show robust performances across

multiple domains and datasets? By using data with

only weak cross-modal correspondence, such as

YouTube videos and their automatically generated

subtitles, we allow the computational models to

face a similar indeterminacy problem, and exam-

arXiv:2210.12309v1 [cs.CL] 22 Oct 2022

ine how indeterminacy interacts with data size to

inﬂuence learning behavior and performance of the

induction models.

In this paper, we conduct the ﬁrst investigation

on both questions. Speciﬁcally, we collect 2.4 mil-

lion video clips and the corresponding subtitles

from instructional YouTube videos (HowTo100M

Miech et al. 2019) to train multi-modal grammar

inducers, instead of using the training data from a

benchmark where text and video are in alignment.

We then propose a novel model, named Pre-Trained

Compound Probabilistic Context-Free Grammars

(PTC-PCFG), that extends previous work (Shi et al.,

2019;Zhang et al.,2021) by incorporating a video-

span matching loss term into the Compound PCFG

(Kim et al.,2019a) model. To better capture the

video-span correlation, it leverages CLIP (Miech

et al.,2020), a state-of-the-art model pretrained on

video subtitle retrieval, as the encoders for both

video and text. Compared with previous work

(Zhang et al.,2021) that independently extracts

features from each modality before merging them

using a simple Transformer (Vaswani et al.,2017)

encoder, the encoders of our model have been pre-

trained to merge such multi-modal information,

and no human efforts are needed to select useful

modalities from the full set.

Experiments on three benchmarks show that our

model, which is trained on noisy YouTube video

clips and no data from these benchmarks, produces

substantial gains over the previous state-of-the-art

system (Zhang et al.,2021) trained on in-domain

video clips with human annotated captions. Further-

more, our model demonstrates robust performances

across all three datasets. We suggest the limitations

of our model and future directions for improve-

ments through analysis and discussions. Code will

be released upon paper acceptance.

In summary, the main contributions are:

•

We are the ﬁrst to study training a grammar in-

ducer with massive general-domain noisy video

clips instead of benchmark data, introducing the

indeterminacy problem to the induction model.

•

We propose PTC-PCFG, a novel model for un-

supervised grammar induction. It is simpler in

design than previous models and can better cap-

ture the video-text matching information.

•

Trained only on noisy YouTube videos without

ﬁnetuning on benchmark data, PTC-PCFG re-

ports stronger performances than previous mod-

els trained on benchmark data across three bench-

marks.

2 Background and Motivation

2.1 Compound PCFGs

A PCFG model in Chomsky Normal Form can be

deﬁned as a tuple of 6 terms

(S,N,P,Σ,R,Π)

where they correspond to the start symbol, the sets

of non-terminals, pre-terminals, terminals, produc-

tion rules and their probabilities. Given pre-deﬁned

numbers of non-terminals and pre-terminals, a

PCFG induction model tries to estimate the proba-

bilities for all production rules.

The compound PCFG (C-PCFG) model (Kim

et al.,2019a) adopts a mixture of PCFGs. Instead

of a corpus-level prior used in previous work (Kuri-

hara and Sato,2006;Johnson et al.,2007;Wang

and Blunsom,2013;Jin et al.,2018), C-PCFG im-

poses a sentence-speciﬁc prior on the distribution

of possible PCFGs. Speciﬁcally in the generative

story, the probability

πr

for production rule

estimated by model

that assigns a latent variable

for each sentence

, and

is drawn from a prior

distribution:

πr=g(r, z;θ),z∼p(z).(1)

where

represents the model parameters. The prob-

abilities for all three types of CFG rules are deﬁned

as follows:

πS→A=exp(u>

Afs([wS;z]))

PA0∈N exp(u>

A0fs([wS;z])),

πA→BC =exp(u>

BC [wA;z])

PB0,C0∈N ∪P exp(u>

B0C0[wA;z])),

πT→w=exp(u>

wft([wT;z]))

Pw0∈Σexp(u>

w0ft([wT;z])),

(2)

where

A∈ N

and

C∈ N ∪ P

T∈ P

w∈Σ

Both

and

are dense vectors representing words

and all types of non-terminals, and

and

are

neural encoding functions.

Optimizing the C-PCFG model involves maxi-

mizing the marginal likelihood

p(σ)

of each train-

ing sentence σfor all possible z:

log pθ(σ) = log ZzX

t∈TG(σ)

pθ(t|z)p(z)dz(3)

where

TG(σ)

indicates all possible parsing trees for

sentence

. Since computing the integral over

is intractable, this objective is optimized by max-

imizing its evidence lower bound ELBO(

;

ELBO(σ;φ, θ) = Eqφ(z|σ)[log pθ(σ|z)]

−KL[qφ(z|σ)||p(z)],(4)

where

qφ(z|σ)

is the variational posterior calcu-

lated by another neural network with parameters

. Given a sampled

, the log-likelihood term

log pθ(σ|z)

is calculated via the inside algorithm.

The KL term can be computed analytically when

both the prior

p(z)

and the variational posterior

qφ(z|σ)

are Gaussian (Kingma and Welling,2014).

2.2 Multi-Modal Compound PCFGs

Multi-Modal Compound PCFGs (MMC-

PCFG) (Zhang et al.,2021) extends C-PCFG with

a model to match a video

with a span

in a

parse tree

of a sentence

. It extracts

visual

and audio features from a video

and encodes

them via a multi-modal transformer (Gabeur

et al.,2020), denoted as

Ψ={ψi}M

i=1

. The word

representation

of the

th word is computed by

BiLSTM. Given a particular span

c=wi, . . . , wj

its representation

is the weighted sum of all

label-speciﬁc span representations:

|N |

k=1

p(k|c, σ)fk 1

j−i+ 1

l=i

hl!,(5)

where

{p(k|c, σ)|1≤k≤ |N |}

are the phrasal

label probabilities of span

. The representation of

a span

is then correspondingly projected to

separate embeddings via gated embedding (Miech

et al.,2018), denoted as

Ξ={ξi}M

i=1

. Finally

the video-text matching loss is deﬁned as a sum

over all video-span matching losses weighted by

the marginal probability of a span from the parser:

smm(v, σ) = X

c∈σ

p(c|σ)hmm(Ξ,Ψ),(6)

where

hmm(Ξ,Ψ)

is a hinge loss measuring the

distances from video

to the matched and un-

matched (i.e. span from another sentence) span

and

and the distances from span

to the matched

and unmatched (i.e. another video) video

and

ωi(c) = exp(u>

ic)

j=1 exp(u>

jc),(7)

o(Ξ,Ψ) =

i=1

ωi(c)cos(ξi,ψi),(8)

hmm(Ξ,Ψ) = Ec0[o(Ξ0,Ψ)−o(Ξ,Ψ)) + ]+

+Ev0[o(Ξ,Ψ0)−o(Ξ,Ψ) + ]+,(9)

where

Ξ0

is a set of unmatched span expert em-

beddings of

Ψ0

is a set of unmatched video

representations of



is a positive margin,

[·]+=

max(0,·)

{ui}M

i=1

are learned weights, and the

expectations are approximated with one sample

drawn from the training data. During training, both

ELBO and the video-text matching loss are jointly

optimized.

2.3 Limitation and Motivation

Existing work on multi-modal grammar induction

aims at leveraging strict correspondence between

image/video and text for information about syntac-

tic categories and structures of the words and spans

in the text. However, such datasets are expensive to

annotate. Besides, the ambiguous correspondence

between language and real-world context, observed

in language acquisition, is not really reﬂected in

such training setups.

As a result, we believe that the previous work

fails to answer the following important questions:

1) how well a grammar inducer would perform

when it is trained only on noisy multi-media data;

2) how the scale of training data would affect the

performance and cross-domain robustness?

3 Training a Grammar Inducer with

Massive YouTube Videos

We make the ﬁrst investigation into the above ques-

tions by leveraging massive video clips from in-

structional YouTube videos to train our grammar

inducer. Different from the benchmark data used

by previous work, the YouTube video clips do not

contain paired sentences. This section will ﬁrst

introduce the method for generating noisy train-

ing instances (video clip and sentence pairs) from

YouTube videos (§3.1), before describing a novel

grammar induction model (§3.2) with pre-trained

text and video encoders.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LearningaGrammarInducerfromMassiveUncuratedInstructionalVideosSongyangZhang1,LinfengSong2,LifengJin2,HaitaoMi2,KunXu2,DongYu2andJieboLuo11UniversityofRochester,Rochester,NY,USAszhang83@ur.rochester.edu,jluo@cs.rochester.edu2TencentAILab,Bellevue,WA,USA{lfsong,lifengjin,haitaomi,kxkunxu,dyu}@tencent...

展开>> 收起<<

Learning a Grammar Inducer from Massive Uncurated Instructional Videos Songyang Zhang1 Linfeng Song2 Lifeng Jin2 Haitao Mi2 Kun Xu2.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Learning a Grammar Inducer from Massive Uncurated Instructional Videos Songyang Zhang1 Linfeng Song2 Lifeng Jin2 Haitao Mi2 Kun Xu2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: