
Learning a Grammar Inducer from Massive Uncurated
Instructional Videos
Songyang Zhang1∗
, Linfeng Song2, Lifeng Jin2, Haitao Mi2, Kun Xu2,
Dong Yu2and Jiebo Luo1
1University of Rochester, Rochester, NY, USA
szhang83@ur.rochester.edu,jluo@cs.rochester.edu
2Tencent AI Lab, Bellevue, WA, USA
{lfsong,lifengjin,haitaomi,kxkunxu,dyu}@tencent.com
Abstract
Video-aided grammar induction aims to lever-
age video information for finding more ac-
curate syntactic grammars for accompanying
text. While previous work focuses on build-
ing systems for inducing grammars on text
that are well-aligned with video content, we
investigate the scenario, in which text and
video are only in loose correspondence. Such
data can be found in abundance online, and
the weak correspondence is similar to the
indeterminacy problem studied in language
acquisition. Furthermore, we build a new
model that can better learn video-span cor-
relation without manually designed features
adopted by previous work. Experiments show
that our model trained only on large-scale
YouTube data with no text-video alignment re-
ports strong and robust performances across
three unseen datasets, despite domain shift and
noisy label issues. Furthermore our model
yields higher F1 scores than the previous state-
of-the-art systems trained on in-domain data.
1 Introduction
Grammar induction is a fundamental and long-
lasting (Lari and Young,1990;Clark,2001;Klein
and Manning,2002) problem in computational lin-
guistics, which aims to find hierarchical syntactic
structures from plain sentences. Unlike supervised
methods (Charniak,2000;Collins,2003;Petrov
and Klein,2007;Zhang and Clark,2011;Cross and
Huang,2016;Kitaev and Klein,2018) that require
human annotated treebanks, e.g., Penn Treebank
(Marcus et al.,1993), grammar inducers do not rely
on any human annotations for training. Grammar
induction is attractive since annotating syntactic
trees by human language experts is expensive and
time consuming, while the current treebanks are
limited to several major languages and domains.
∗
This work was done when Songyang Zhang was an
intern at Tencent AI Lab.
Recently, deep learning models have achieved
remarkable success across NLP tasks, and neural
models have been designed (Shen et al.,2018b,a;
Kim et al.,2019a,b;Jin et al.,2018) for grammar
induction, which greatly advanced model perfor-
mance on induction with raw text. Recent efforts
have started to consider other useful information
from multiple modalities, such as images (Shi et al.,
2019;Jin and Schuler,2020) and videos (Zhang
et al.,2021). Specifically, Zhang et al. (2021) show
that multi-modal information (e.g. motion, sound
and objects) from videos can significantly improve
the induction accuracy on verb and noun phrases.
Such work uses curated multi-modal data publicly
available on the web, which all assume that the
meaning of a sentence needs to be identical (e.g.,
being a caption) to the corresponding video or im-
age. This assumption limits usable data to several
small-scale benchmarks (Lin et al.,2014;Xu et al.,
2016;Hendricks et al.,2017) with expensive hu-
man annotations on image/video captions.
The noisy correspondence between form and
meaning is one of the main research questions
in language acquisition (Akhtar and Montague,
1999;Gentner et al.,2001;Dominey and Dodane,
2004), where different proposals attempt to address
this indeterminacy faced by children. There has
been computational work incorporating such in-
determinacy into their models (Yu and Siskind,
2013;Huang et al.,2021). For modeling empir-
ical grammar learning with multi-modal inputs,
two important questions still remain open: 1) how
can a grammar inducer benefit from large-scale
multi-media data (e.g., YouTube videos) with noisy
text-to-video correspondence? and 2) how can a
grammar inducer show robust performances across
multiple domains and datasets? By using data with
only weak cross-modal correspondence, such as
YouTube videos and their automatically generated
subtitles, we allow the computational models to
face a similar indeterminacy problem, and exam-
arXiv:2210.12309v1 [cs.CL] 22 Oct 2022