Learning a Grammar Inducer from Massive Uncurated Instructional Videos Songyang Zhang1 Linfeng Song2 Lifeng Jin2 Haitao Mi2 Kun Xu2

2025-05-02 0 0 3.94MB 15 页 10玖币
侵权投诉
Learning a Grammar Inducer from Massive Uncurated
Instructional Videos
Songyang Zhang1
, Linfeng Song2, Lifeng Jin2, Haitao Mi2, Kun Xu2,
Dong Yu2and Jiebo Luo1
1University of Rochester, Rochester, NY, USA
szhang83@ur.rochester.edu,jluo@cs.rochester.edu
2Tencent AI Lab, Bellevue, WA, USA
{lfsong,lifengjin,haitaomi,kxkunxu,dyu}@tencent.com
Abstract
Video-aided grammar induction aims to lever-
age video information for finding more ac-
curate syntactic grammars for accompanying
text. While previous work focuses on build-
ing systems for inducing grammars on text
that are well-aligned with video content, we
investigate the scenario, in which text and
video are only in loose correspondence. Such
data can be found in abundance online, and
the weak correspondence is similar to the
indeterminacy problem studied in language
acquisition. Furthermore, we build a new
model that can better learn video-span cor-
relation without manually designed features
adopted by previous work. Experiments show
that our model trained only on large-scale
YouTube data with no text-video alignment re-
ports strong and robust performances across
three unseen datasets, despite domain shift and
noisy label issues. Furthermore our model
yields higher F1 scores than the previous state-
of-the-art systems trained on in-domain data.
1 Introduction
Grammar induction is a fundamental and long-
lasting (Lari and Young,1990;Clark,2001;Klein
and Manning,2002) problem in computational lin-
guistics, which aims to find hierarchical syntactic
structures from plain sentences. Unlike supervised
methods (Charniak,2000;Collins,2003;Petrov
and Klein,2007;Zhang and Clark,2011;Cross and
Huang,2016;Kitaev and Klein,2018) that require
human annotated treebanks, e.g., Penn Treebank
(Marcus et al.,1993), grammar inducers do not rely
on any human annotations for training. Grammar
induction is attractive since annotating syntactic
trees by human language experts is expensive and
time consuming, while the current treebanks are
limited to several major languages and domains.
This work was done when Songyang Zhang was an
intern at Tencent AI Lab.
Recently, deep learning models have achieved
remarkable success across NLP tasks, and neural
models have been designed (Shen et al.,2018b,a;
Kim et al.,2019a,b;Jin et al.,2018) for grammar
induction, which greatly advanced model perfor-
mance on induction with raw text. Recent efforts
have started to consider other useful information
from multiple modalities, such as images (Shi et al.,
2019;Jin and Schuler,2020) and videos (Zhang
et al.,2021). Specifically, Zhang et al. (2021) show
that multi-modal information (e.g. motion, sound
and objects) from videos can significantly improve
the induction accuracy on verb and noun phrases.
Such work uses curated multi-modal data publicly
available on the web, which all assume that the
meaning of a sentence needs to be identical (e.g.,
being a caption) to the corresponding video or im-
age. This assumption limits usable data to several
small-scale benchmarks (Lin et al.,2014;Xu et al.,
2016;Hendricks et al.,2017) with expensive hu-
man annotations on image/video captions.
The noisy correspondence between form and
meaning is one of the main research questions
in language acquisition (Akhtar and Montague,
1999;Gentner et al.,2001;Dominey and Dodane,
2004), where different proposals attempt to address
this indeterminacy faced by children. There has
been computational work incorporating such in-
determinacy into their models (Yu and Siskind,
2013;Huang et al.,2021). For modeling empir-
ical grammar learning with multi-modal inputs,
two important questions still remain open: 1) how
can a grammar inducer benefit from large-scale
multi-media data (e.g., YouTube videos) with noisy
text-to-video correspondence? and 2) how can a
grammar inducer show robust performances across
multiple domains and datasets? By using data with
only weak cross-modal correspondence, such as
YouTube videos and their automatically generated
subtitles, we allow the computational models to
face a similar indeterminacy problem, and exam-
arXiv:2210.12309v1 [cs.CL] 22 Oct 2022
ine how indeterminacy interacts with data size to
influence learning behavior and performance of the
induction models.
In this paper, we conduct the first investigation
on both questions. Specifically, we collect 2.4 mil-
lion video clips and the corresponding subtitles
from instructional YouTube videos (HowTo100M
Miech et al. 2019) to train multi-modal grammar
inducers, instead of using the training data from a
benchmark where text and video are in alignment.
We then propose a novel model, named Pre-Trained
Compound Probabilistic Context-Free Grammars
(PTC-PCFG), that extends previous work (Shi et al.,
2019;Zhang et al.,2021) by incorporating a video-
span matching loss term into the Compound PCFG
(Kim et al.,2019a) model. To better capture the
video-span correlation, it leverages CLIP (Miech
et al.,2020), a state-of-the-art model pretrained on
video subtitle retrieval, as the encoders for both
video and text. Compared with previous work
(Zhang et al.,2021) that independently extracts
features from each modality before merging them
using a simple Transformer (Vaswani et al.,2017)
encoder, the encoders of our model have been pre-
trained to merge such multi-modal information,
and no human efforts are needed to select useful
modalities from the full set.
Experiments on three benchmarks show that our
model, which is trained on noisy YouTube video
clips and no data from these benchmarks, produces
substantial gains over the previous state-of-the-art
system (Zhang et al.,2021) trained on in-domain
video clips with human annotated captions. Further-
more, our model demonstrates robust performances
across all three datasets. We suggest the limitations
of our model and future directions for improve-
ments through analysis and discussions. Code will
be released upon paper acceptance.
In summary, the main contributions are:
We are the first to study training a grammar in-
ducer with massive general-domain noisy video
clips instead of benchmark data, introducing the
indeterminacy problem to the induction model.
We propose PTC-PCFG, a novel model for un-
supervised grammar induction. It is simpler in
design than previous models and can better cap-
ture the video-text matching information.
Trained only on noisy YouTube videos without
finetuning on benchmark data, PTC-PCFG re-
ports stronger performances than previous mod-
els trained on benchmark data across three bench-
marks.
2 Background and Motivation
2.1 Compound PCFGs
A PCFG model in Chomsky Normal Form can be
defined as a tuple of 6 terms
(S,N,P,Σ,R,Π)
,
where they correspond to the start symbol, the sets
of non-terminals, pre-terminals, terminals, produc-
tion rules and their probabilities. Given pre-defined
numbers of non-terminals and pre-terminals, a
PCFG induction model tries to estimate the proba-
bilities for all production rules.
The compound PCFG (C-PCFG) model (Kim
et al.,2019a) adopts a mixture of PCFGs. Instead
of a corpus-level prior used in previous work (Kuri-
hara and Sato,2006;Johnson et al.,2007;Wang
and Blunsom,2013;Jin et al.,2018), C-PCFG im-
poses a sentence-specific prior on the distribution
of possible PCFGs. Specifically in the generative
story, the probability
πr
for production rule
r
is
estimated by model
g
that assigns a latent variable
z
for each sentence
σ
, and
z
is drawn from a prior
distribution:
πr=g(r, z;θ),zp(z).(1)
where
θ
represents the model parameters. The prob-
abilities for all three types of CFG rules are defined
as follows:
πSA=exp(u>
Afs([wS;z]))
PA0∈N exp(u>
A0fs([wS;z])),
πABC =exp(u>
BC [wA;z])
PB0,C0∈N ∪P exp(u>
B0C0[wA;z])),
πTw=exp(u>
wft([wT;z]))
Pw0Σexp(u>
w0ft([wT;z])),
(2)
where
A∈ N
,
B
and
C N ∪ P
,
T∈ P
,
wΣ
.
Both
w
and
u
are dense vectors representing words
and all types of non-terminals, and
fs
and
ft
are
neural encoding functions.
Optimizing the C-PCFG model involves maxi-
mizing the marginal likelihood
p(σ)
of each train-
ing sentence σfor all possible z:
log pθ(σ) = log ZzX
t∈TG(σ)
pθ(t|z)p(z)dz(3)
where
TG(σ)
indicates all possible parsing trees for
sentence
σ
. Since computing the integral over
z
is intractable, this objective is optimized by max-
imizing its evidence lower bound ELBO(
σ
;
φ
,
θ
):
ELBO(σ;φ, θ) = Eqφ(z|σ)[log pθ(σ|z)]
KL[qφ(z|σ)||p(z)],(4)
where
qφ(z|σ)
is the variational posterior calcu-
lated by another neural network with parameters
φ
. Given a sampled
z
, the log-likelihood term
log pθ(σ|z)
is calculated via the inside algorithm.
The KL term can be computed analytically when
both the prior
p(z)
and the variational posterior
qφ(z|σ)
are Gaussian (Kingma and Welling,2014).
2.2 Multi-Modal Compound PCFGs
Multi-Modal Compound PCFGs (MMC-
PCFG) (Zhang et al.,2021) extends C-PCFG with
a model to match a video
v
with a span
c
in a
parse tree
t
of a sentence
σ
. It extracts
M
visual
and audio features from a video
v
and encodes
them via a multi-modal transformer (Gabeur
et al.,2020), denoted as
Ψ={ψi}M
i=1
. The word
representation
hi
of the
i
th word is computed by
BiLSTM. Given a particular span
c=wi, . . . , wj
,
its representation
c
is the weighted sum of all
label-specific span representations:
c=
|N |
X
k=1
p(k|c, σ)fk 1
ji+ 1
j
X
l=i
hl!,(5)
where
{p(k|c, σ)|1k |N |}
are the phrasal
label probabilities of span
c
. The representation of
a span
c
is then correspondingly projected to
M
separate embeddings via gated embedding (Miech
et al.,2018), denoted as
Ξ={ξi}M
i=1
. Finally
the video-text matching loss is defined as a sum
over all video-span matching losses weighted by
the marginal probability of a span from the parser:
smm(v, σ) = X
cσ
p(c|σ)hmm(Ξ,Ψ),(6)
where
hmm(Ξ,Ψ)
is a hinge loss measuring the
distances from video
v
to the matched and un-
matched (i.e. span from another sentence) span
c
and
c0
and the distances from span
c
to the matched
and unmatched (i.e. another video) video
v
and
v0
:
ωi(c) = exp(u>
ic)
PM
j=1 exp(u>
jc),(7)
o(Ξ,Ψ) =
M
X
i=1
ωi(c)cos(ξi,ψi),(8)
hmm(Ξ,Ψ) = Ec0[o(Ξ0,Ψ)o(Ξ,Ψ)) + ]+
+Ev0[o(Ξ,Ψ0)o(Ξ,Ψ) + ]+,(9)
where
Ξ0
is a set of unmatched span expert em-
beddings of
Ψ
,
Ψ0
is a set of unmatched video
representations of
Ξ
,
is a positive margin,
[·]+=
max(0,·)
,
{ui}M
i=1
are learned weights, and the
expectations are approximated with one sample
drawn from the training data. During training, both
ELBO and the video-text matching loss are jointly
optimized.
2.3 Limitation and Motivation
Existing work on multi-modal grammar induction
aims at leveraging strict correspondence between
image/video and text for information about syntac-
tic categories and structures of the words and spans
in the text. However, such datasets are expensive to
annotate. Besides, the ambiguous correspondence
between language and real-world context, observed
in language acquisition, is not really reflected in
such training setups.
As a result, we believe that the previous work
fails to answer the following important questions:
1) how well a grammar inducer would perform
when it is trained only on noisy multi-media data;
2) how the scale of training data would affect the
performance and cross-domain robustness?
3 Training a Grammar Inducer with
Massive YouTube Videos
We make the first investigation into the above ques-
tions by leveraging massive video clips from in-
structional YouTube videos to train our grammar
inducer. Different from the benchmark data used
by previous work, the YouTube video clips do not
contain paired sentences. This section will first
introduce the method for generating noisy train-
ing instances (video clip and sentence pairs) from
YouTube videos (§3.1), before describing a novel
grammar induction model (§3.2) with pre-trained
text and video encoders.
摘要:

LearningaGrammarInducerfromMassiveUncuratedInstructionalVideosSongyangZhang1,LinfengSong2,LifengJin2,HaitaoMi2,KunXu2,DongYu2andJieboLuo11UniversityofRochester,Rochester,NY,USAszhang83@ur.rochester.edu,jluo@cs.rochester.edu2TencentAILab,Bellevue,WA,USA{lfsong,lifengjin,haitaomi,kxkunxu,dyu}@tencent...

展开>> 收起<<
Learning a Grammar Inducer from Massive Uncurated Instructional Videos Songyang Zhang1 Linfeng Song2 Lifeng Jin2 Haitao Mi2 Kun Xu2.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:3.94MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注