
Name Mode of Travel Purpose Destination Day Month Year Expense
Philip Duffy Air Regional Meeting London 10 April 2019 189.06
Jeremy Oppenheim Taxi Exchange Visit Ottawa 30 Jul 2019 8.08
Mark Sedwill Air Evening Meal Bristol 02 September 2019 50
Name Date Destination Purpose
Clark 23/07 France Discuss EU
Gyimah 03/09 Belgium Build Relations
Harrington 05/08 China Discuss Productivity
Bird Name Scientific Name Date Location
Pine Siskin Carduelis Pinus 2019 Ottawa
American Robin Turdus migratorius 2019 Ottawa
Northern Flicker Colaptes auratus 2019 London
Table A:
Table B: Table C:
Figure 1: An example of table union search on Open Data.
approaches are supervised, they can only be applied to nding a
limited set of semantic types (78 in their experiments), and while
not a general solution for unionability in data lakes, they can be
used in an oine fashion to nd unionable tables containing the
types on which they are trained.
However, there are still plenty of opportunities to further im-
prove the performance of table union search. One important issue
is to learn sucient contextual information between columns in ta-
bles so as to determine the unionability. This point can be illustrated
in the following motivation example.
Example 1.1. Figure 1 shows an example of nding unionable
tables. Given the query Table A, existing approaches rst nd union-
able columns. In this example, the column
Destination
in Table A
will be deemed more unionable with
Location
from Table C than
with
Destination
from Table B. This is because the syntactic simi-
larity score, e.g. overlap and containment Jaccard, between the two
Destination
columns is 0; while the average word embedding of
cities (Table A) is also not as close to that of nations (Table B). Sim-
ilarly, if an ontology is used, Table A and Table C shares the same
class while the values in B are in dierent (though related) classes.
Meanwhile, looking at the tables as a whole we observe that Table A
is actually irrelevant to Table C. But as existing solutions only look
at the pair of single columns when calculating column unionability
score, the columns
Year/Date
and
Destination/Location
of the
two tables might be wrongly aligned together. Even techniques that
look at relationships [
23
] can be fooled by the value overlap in this
relationship and determine the relationship
Year-Destination
in
Table A to be unionable with
Date-Location
in Table C. This kind
of mistake can be avoided by looking at a table’s context, i.e. in-
formation carried by other columns within a table. Looking at the
table as a whole, a method should be able to recognize that the
Year
in Table A is part of a travel date while in Table C it is the
date of discovery of a bird; and
Destination
in Table A refers to
the cities to which the ocers are traveling; whereas
Location
in
Table C is the city where a bird is found.
From the above example, we focus on the following challenges
in proposing a new solution. Firstly, it is essential to learn richer
semantics of columns based on natural language domain. To this
end, we require a more powerful approach to learn the column
representation so as to capture richer information instead of relying
on simple methods like the average over bag of word embeddings
utilized in previous studies [
2
,
13
] or even the similarity of the
word embedding distributions [
40
]. Secondly, we argue that it is
crucial to utilize the contextual information within a table to learn
the representation of each column, which is ignored by previous
studies. Even proposals for capturing relationship semantics do not
use contextual information to learn column representations [
23
].
Finally, due to the large volume of data lake tables, it is also a great
challenge to develop a scalable and memory-ecient solution.
We propose
Starmie
, an end-to-end framework for dataset dis-
covery from data lakes with table union search as the main use case.
Starmie
uses pre-trained language models (LMs) such as BERT [
12
]
to obtain semantics-aware representations for columns of data lake
tables. While pre-trained LMs have been shown to achieve state-of-
the-art results in table understanding applications [
11
,
29
,
45
], their
good performance heavily relies on high-quality labeled training
data. For the problem setting of table union search [
39
,
40
], we
must come up with a fully unsupervised approach in order to apply
pre-trained LMs to such applications, something not yet supported
by previous studies.
Starmie
addresses this issue by leveraging con-
trastive representation learning [
10
] to learn column representations
in a
self-supervised manner
. An innovation of this approach is
to assume that two randomly selected columns in a data lake can be
used as negative training examples. For positive examples, we pro-
pose and use novel data augmentation methods. The framework de-
nes a learning objective that connects the same or similar columns
in the representation space while separating distinct columns. As
such,
Starmie
can apply the pre-trained representation model in
downstream tasks such as table union search without requiring any
labels. We also propose to combine the learning algorithm with a
novel multi-column table transformer model to learn contextualized
column embeddings that model the column semantics depending
on not only the column values, but also their context within a table.
While a recent study SANTOS [
23
] can reach a similar goal by em-
ploying a knowledge base, our proposed methods can automatically
capture such contextual information from tables in an unsupervised
manner without relying on any external knowledge or labels.
Based on the proposed column encoders, we use cosine sim-
ilarity between column embeddings as the column unionability
score and develop a bipartite matching based method to calculate
the table unionability score. We propose a lter-and-verication
framework that enables the use of dierent indexing and pruning
techniques to reduce the number of computations of the expensive
bipartite matching. While most previous studies employed LSH
index to improve the search performance, we also make use of
HNSW (Hierarchical Navigable Small World) index [
34
] to acceler-
ate query processing. Experimental results show that HNSW can
signicantly improve the query time while only slightly reducing
the MAP/recall scores. Besides table union search, we further con-
duct two case studies to show that
Starmie
can also support other
dataset discovery applications such as joinable table search and
column clustering. We believe these results show great promise
in the use of contextualized, self-supervised embeddings for many
table understanding tasks.
Our contributions can be summarized as the following.
•
We propose
Starmie
, an end-to-end framework to support
dataset discovery over data lakes with table union search as
the main use case.
•
We develop a contrastive learning framework to learn con-
textualized column representations for data lake tables with-
out requiring labeled training instances.
Starmie
achieves
an improvement of 6.8% in both MAP and recall compared
with the best state-of-the-art method, with a MAP of 99%, a
signicant margin compared with previous studies.