Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning

2025-05-03 0 0 1.6MB 19 页 10玖币
侵权投诉
Semantics-aware Dataset Discovery from Data Lakes with
Contextualized Column-based Representation Learning
Grace Fan
Northeastern University
United States
fan.gr@northeastern.edu
Jin Wang
Megagon Labs
United States
jin@megagon.ai
Yuliang Li
Megagon Labs
United States
yuliang@megagon.ai
Dan Zhang
Megagon Labs
United States
dan_z@megagon.ai
Renée Miller
Northeastern University
United States
miller@northeastern.edu
ABSTRACT
Dataset discovery from data lakes is essential in many real ap-
plication scenarios. In this paper, we propose
Starmie
, an end-to-
end framework for dataset discovery from data lakes (with table
union search as the main use case). Our proposed framework fea-
tures a contrastive learning method to train column encoders from
pre-trained language models in a fully unsupervised manner. The
column encoder of
Starmie
captures the rich contextual semantic
information within tables by leveraging a contrastive multi-column
pre-training strategy. We utilize the cosine similarity between col-
umn embedding vectors as the column unionability score and pro-
pose a lter-and-verication framework that allows exploring a
variety of design choices to compute the unionability score between
two tables accordingly. Empirical results on real table benchmarks
show that
Starmie
outperforms the best-known solutions in the ef-
fectiveness of table union search by 6.8 in MAP and recall. Moreover,
Starmie
is the rst to employ the HNSW (Hierarchical Navigable
Small World) index to accelerate query processing of table union
search which provides a 3,000X performance gain over the linear
scan baseline and a 400X performance gain over an LSH index (the
state-of-the-art solution for data lake indexing).
PVLDB Reference Format:
Grace Fan, Jin Wang, Yuliang Li, Dan Zhang, and Renée Miller.
Semantics-aware Dataset Discovery from Data Lakes with Contextualized
Column-based Representation Learning. PVLDB, 14(1): 50 - 60, 2021.
doi:10.14778/3421424.3421431
PVLDB Artifact Availability:
The source code, data, and/or other artifacts have been made available at
https://github.com/megagonlabs/starmie.
1 INTRODUCTION
The growing number of open datasets from governments, academic
institutions, and companies have brought new opportunities for
innovation, economic growth, and societal benets. To integrate
This work is licensed under the Creative Commons BY-NC-ND 4.0 International
License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of
this license. For any use beyond those covered by this license, obtain permission by
emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights
licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097.
doi:10.14778/3421424.3421431
and analyze such datasets, researchers in both academia and in-
dustry have built a number of dataset search engines to support
the application of dataset discovery [
3
,
7
,
16
,
18
,
31
,
38
,
43
]. One
popular example is Google’s dataset search [
3
] which provides key-
word search on the metadata. However, for open datasets, simple
keyword search might suer from data quality issues of incomplete
and inconsistent metadata across dierent datasets and publish-
ers [
1
,
15
,
39
,
40
]. Thus it is essential to support table search over
open datasets, and more generally data lake tables (including private
enterprise data lakes), to boost dataset discovery applications, such
as nding related tables, domain discovery, and column clustering.
Finding related tables from data lakes [
37
,
44
,
55
] has a wide
spectrum of real application scenarios. There are two sub-tasks
of nding related tables, namely table union search and joinable
table search. In this paper, we mainly focus on the problem of table
union search, which has been recognized as a crucial task in dataset
discovery from data lakes [
2
,
23
,
37
,
39
,
40
,
55
,
59
]. Given a query
table and a collection of data lake tables, table union search aims to
nd all tables that are unionable with the query table. To determine
whether two tables are unionable, existing solutions rst identify
all pairs of unionable columns from the two tables based on column
representations, such as bag of tokens or bag of word embeddings.
They then devise some mechanism to aggregate the column-level
results to compute the table unionability score.
State-of-the-art:
Early work on nding unionable tables used
table clustering followed by simple syntactic measures such as the
dierence in column mean string length and cosine similarities to
determine if two tables are unionable [
4
]. Table union search [
40
]
improved on this by applying a rich collection of column repre-
sentations including syntactic, semantic (leveraging ontologies),
and natural language (based on word-embeddings) column rep-
resentations. Two important innovations of this work were the
modeling of data lake context to create an ensemble unionability
score which models the surprisingness of a score given the score
distributions within a data lake and the use of LSH indices to make
table union search fast over large data lakes [
40
]. More recently
𝐷3𝐿
[
2
] added additional column representations based on regu-
lar expression matching and SANTOS [
23
] added to the column
representations, representations of binary relationships. In paral-
lel to these search-based approaches, the mighty hammer of deep
learning has been applied to the problem of column matching (de-
termining the semantic type of a column) [
21
,
54
]. Since these
arXiv:2210.01922v2 [cs.DB] 15 Jan 2023
Name Mode of Travel Purpose Destination Day Month Year Expense
Philip Duffy Air Regional Meeting London 10 April 2019 189.06
Jeremy Oppenheim Taxi Exchange Visit Ottawa 30 Jul 2019 8.08
Mark Sedwill Air Evening Meal Bristol 02 September 2019 50
Name Date Destination Purpose
Clark 23/07 France Discuss EU
Gyimah 03/09 Belgium Build Relations
Harrington 05/08 China Discuss Productivity
Bird Name Scientific Name Date Location
Pine Siskin Carduelis Pinus 2019 Ottawa
American Robin Turdus migratorius 2019 Ottawa
Northern Flicker Colaptes auratus 2019 London
Table A:
Table B: Table C:
Figure 1: An example of table union search on Open Data.
approaches are supervised, they can only be applied to nding a
limited set of semantic types (78 in their experiments), and while
not a general solution for unionability in data lakes, they can be
used in an oine fashion to nd unionable tables containing the
types on which they are trained.
However, there are still plenty of opportunities to further im-
prove the performance of table union search. One important issue
is to learn sucient contextual information between columns in ta-
bles so as to determine the unionability. This point can be illustrated
in the following motivation example.
Example 1.1. Figure 1 shows an example of nding unionable
tables. Given the query Table A, existing approaches rst nd union-
able columns. In this example, the column
Destination
in Table A
will be deemed more unionable with
Location
from Table C than
with
Destination
from Table B. This is because the syntactic simi-
larity score, e.g. overlap and containment Jaccard, between the two
Destination
columns is 0; while the average word embedding of
cities (Table A) is also not as close to that of nations (Table B). Sim-
ilarly, if an ontology is used, Table A and Table C shares the same
class while the values in B are in dierent (though related) classes.
Meanwhile, looking at the tables as a whole we observe that Table A
is actually irrelevant to Table C. But as existing solutions only look
at the pair of single columns when calculating column unionability
score, the columns
Year/Date
and
Destination/Location
of the
two tables might be wrongly aligned together. Even techniques that
look at relationships [
23
] can be fooled by the value overlap in this
relationship and determine the relationship
Year-Destination
in
Table A to be unionable with
Date-Location
in Table C. This kind
of mistake can be avoided by looking at a table’s context, i.e. in-
formation carried by other columns within a table. Looking at the
table as a whole, a method should be able to recognize that the
Year
in Table A is part of a travel date while in Table C it is the
date of discovery of a bird; and
Destination
in Table A refers to
the cities to which the ocers are traveling; whereas
Location
in
Table C is the city where a bird is found.
From the above example, we focus on the following challenges
in proposing a new solution. Firstly, it is essential to learn richer
semantics of columns based on natural language domain. To this
end, we require a more powerful approach to learn the column
representation so as to capture richer information instead of relying
on simple methods like the average over bag of word embeddings
utilized in previous studies [
2
,
13
] or even the similarity of the
word embedding distributions [
40
]. Secondly, we argue that it is
crucial to utilize the contextual information within a table to learn
the representation of each column, which is ignored by previous
studies. Even proposals for capturing relationship semantics do not
use contextual information to learn column representations [
23
].
Finally, due to the large volume of data lake tables, it is also a great
challenge to develop a scalable and memory-ecient solution.
We propose
Starmie
, an end-to-end framework for dataset dis-
covery from data lakes with table union search as the main use case.
Starmie
uses pre-trained language models (LMs) such as BERT [
12
]
to obtain semantics-aware representations for columns of data lake
tables. While pre-trained LMs have been shown to achieve state-of-
the-art results in table understanding applications [
11
,
29
,
45
], their
good performance heavily relies on high-quality labeled training
data. For the problem setting of table union search [
39
,
40
], we
must come up with a fully unsupervised approach in order to apply
pre-trained LMs to such applications, something not yet supported
by previous studies.
Starmie
addresses this issue by leveraging con-
trastive representation learning [
10
] to learn column representations
in a
self-supervised manner
. An innovation of this approach is
to assume that two randomly selected columns in a data lake can be
used as negative training examples. For positive examples, we pro-
pose and use novel data augmentation methods. The framework de-
nes a learning objective that connects the same or similar columns
in the representation space while separating distinct columns. As
such,
Starmie
can apply the pre-trained representation model in
downstream tasks such as table union search without requiring any
labels. We also propose to combine the learning algorithm with a
novel multi-column table transformer model to learn contextualized
column embeddings that model the column semantics depending
on not only the column values, but also their context within a table.
While a recent study SANTOS [
23
] can reach a similar goal by em-
ploying a knowledge base, our proposed methods can automatically
capture such contextual information from tables in an unsupervised
manner without relying on any external knowledge or labels.
Based on the proposed column encoders, we use cosine sim-
ilarity between column embeddings as the column unionability
score and develop a bipartite matching based method to calculate
the table unionability score. We propose a lter-and-verication
framework that enables the use of dierent indexing and pruning
techniques to reduce the number of computations of the expensive
bipartite matching. While most previous studies employed LSH
index to improve the search performance, we also make use of
HNSW (Hierarchical Navigable Small World) index [
34
] to acceler-
ate query processing. Experimental results show that HNSW can
signicantly improve the query time while only slightly reducing
the MAP/recall scores. Besides table union search, we further con-
duct two case studies to show that
Starmie
can also support other
dataset discovery applications such as joinable table search and
column clustering. We believe these results show great promise
in the use of contextualized, self-supervised embeddings for many
table understanding tasks.
Our contributions can be summarized as the following.
We propose
Starmie
, an end-to-end framework to support
dataset discovery over data lakes with table union search as
the main use case.
We develop a contrastive learning framework to learn con-
textualized column representations for data lake tables with-
out requiring labeled training instances.
Starmie
achieves
an improvement of 6.8% in both MAP and recall compared
with the best state-of-the-art method, with a MAP of 99%, a
signicant margin compared with previous studies.
We design and implement a lter-and-verication based
framework for computing the table-level unionability score
which can accommodate multiple design choices of indexing
and pruning to accelerate the overall query processing. By
leveraging the HNSW index,
Starmie
achieves up to three
orders of magnitude in performance gain for query time
relative to the linear scan baseline.
We conduct an extensive set of experiments over two real
world data lake corpora. Experimental results demonstrate
that the proposed
Starmie
framework signicantly outper-
forms existing solutions in eectiveness. It also shows good
scalability and memory eciency.
We further conduct case studies to show the exibility and
generality of our proposed framework in other dataset dis-
covery applications.
2 OVERVIEW
2.1 Problem denition
A data lake consists of a collection of tables
T
. Each table
𝑇∈ T
consists of several columns
{𝑡1, . . . , 𝑡𝑚}
where each column
𝑡𝑖
can
be from dierent domains. Here
𝑚
is the number of columns in
table
𝑇
(denoted as
|𝑇|=𝑚
). We will use the notation
𝑇
to denote
both the table and its set of columns if there is no ambiguity. To de-
termine the unionability between two columns, following previous
studies, we employ column encoders to generate the representations
of columns. Then the column unionability score can be computed to
measure the relevance between those representations. A column
encoder
M
takes a column
𝑡
as input and outputs
M(𝑡)
as the
representation. Given two columns
𝑡𝑖
and
𝑡𝑗
, the column unionabil-
ity score is computed as
F (M(𝑡𝑖),M(𝑡𝑗))
, where
F
is a scoring
function between two column representations.
Based on the column unionability scores, we compute the table
unionability score between two tables, which is obtained by aggre-
gating the column unionability scores introduced above. Given two
tables
𝑆
and
𝑇
, we dene a table unionability scoring mechanism
as
𝑈={F ,M,A}
, where
M
and
F
are the column encoder and
scoring function for two column representations, respectively. Here
A
is a mechanism to aggregate the column unionability scores
between all pairs of columns from the two tables. We will introduce
the details of Alater in Section 4.
Following the above discussions, we can formally dene the table
union search problem as a top-k search problem as Denition 2.1:
Denition 2.1 (Table Union Search). Given a collection of data
lake tables
T
and a query table
𝑆
, top-k table union search aims at
nding a subset
S ⊆ T
where
|S| =𝑘
and
𝑇∈ S
and
𝑇 T − S
,
we have 𝑈(𝑆,𝑇 ) 𝑈(𝑆, 𝑇 ).
2.2 System architecture
Figure 2 shows the overall architecture of
Starmie
that solves table
union search in two stages: oine and online.
During the oine stage,
Starmie
pre-trains a column represen-
tation model that encodes columns of data lake tables into dense
high-dimensional vectors (i.e., column embeddings). Then, we ap-
ply the trained model to all data lake tables to obtain the column
embeddings via model inference. We store the embedding vectors in
Data Lake
Contrastive Self-supervised Training
Vector Indices
(HNSW, LSH, etc.)
Offline
Online
[Table A: 0.9, Table B: 0.85, …]
Multi-Column
Encoder
Query Table
Inference
Verify
Retrieve
Table Scorer Contextualized
embeddings
Figure 2: During the oline phase, Starmie pre-trains a multi-
column table encoder using contrastive learning and stores the em-
beddings of data lake columns in vector indices like HNSW. During
online processing, Starmie retrieves candidate tables with similar
contextualized column embeddings then veries their table-level
unionability scores using column alignment algorithms.
ecient vector indices for online retrieval. A key challenge for the
oine stage is to train high-quality column encoders that capture
the semantics of tabular data. In
Starmie
, we follow a recent trend
[
11
,
29
,
45
] of table representation learning that encodes tabular data
using pre-trained language models (LMs). Pre-trained LMs have
achieved state-of-the-art performance on table understanding tasks
such as column type and relation type annotation [
45
]. However,
the good performance of pre-trained LMs requires ne-tuning on
high-quality labeled datasets, which are always not available in ta-
ble search applications such as table union search. Using pre-trained
LMs o-the-shelf is also problematic as the column embeddings
cannot capture (ir-)relevance between columns or the contextual
information within tables. To this end, in Section 3, we propose a
contrastive learning framework for learning high-dimensional col-
umn representations in fully unsupervised manner. We combine the
framework with a multi-column table model that captures column
semantics from the column values while taking the table context
into account. Then we apply the column encoder to all tables to
convert each table into a collection of embedding vectors.
During the online stage, given an input query table, we retrieve a
set of candidate tables from the vector indices by searching for data
lake column embeddings of high column-level similarity with the
input columns.
Starmie
then applies a verication step for checking
and ranking the candidates for the top-
𝑘
tables with the highest
table-level unionability scores. The rst challenge for the online
stage is how to eciently search for unionable columns. This is not
a trivial task due to the massive size of data lakes. We address this
challenge by allowing dierent design choices of state-of-the-art
high-dimensional vector indices. Yet another challenge is designing
a table unionability function that can eectively aggregate the col-
umn unionability scores. As in other studies, we employ weighted
bipartite graph matching. To address its limitation of high compu-
tation complexity, we introduce a novel algorithm to reduce the
number of expensive calls to the exact matching algorithm by de-
ducing lower and upper bounds of the matching score (Section 4).
X
Y
Xori
Xaug
M
M
Zori
Zaug
augment
M
M(Y)
connect
separate
separate
Two batches of
serialized columns
<s> 9/14/2009 12/14/2009 10/31/2009 ... Two views of the same
batch (e.g., via sampling)
Figure 3: Contrastive learning with single-column input.
3 LEARNING CONTEXTUALIZED COLUMN
EMBEDDINGS
We now describe the oine stage for training high-quality column
encoders. The encoder pre-processes tables into sequenced inputs
and uses a pre-trained LM to encode each column into a high-
dimensional vector. We rst introduce background knowledge in
Section 3.1. We describe a novel contrastive learning approach for
table encoders in Section 3.2 and generalize it to multi-column
encoders for contextualized embeddings in Section 3.3. Finally, we
describe the table pre-processing approaches to generate the input
for such learning processes in Section 3.4.
3.1 Background
Contrastive learning is a self-supervision approach that learns data
representations where similar data items are close while distinct
data items are far apart. In
Starmie
, we adopt SimCLR [
10
] which
was recently shown to be eective in Vision and NLP applications.
Figure 3 illustrates the high-level idea of the algorithm. The goal is
to learn an encoder
M
(e.g., a column encoder) that takes a data
item (e.g., a column) as input and encodes it into a high-dimensional
vector. To train the encoder in a self-supervised manner without
labels, SimCLR relies on (1) a data augmentation operator gener-
ating semantic-preserving views (in our context this means
𝑋ori
and
𝑋aug
that are unionable) of the same data item and (2) a sam-
pling method (e.g., uniform sampling from a large collection) that
returns pairs of data items (i.e.,
𝑋
and
𝑌
) that are distinct (mean-
ing non-unionable) with high probability. SimCLR then applies a
contrastive loss function that connects the representations of the
semantic-preserving (unionable) views meanwhile separating those
of the sampled distinct (non-unionable) items. Next, we illustrate
how we apply the algorithm for training a single-column encoder.
3.2 Contrastive Learning Framework
The goal is to connect representations of the same or unionable
columns in their representation space while separating represen-
tations of distinct columns. To achieve the rst goal, Algorithm 1
leverages a data augmentation operator
op
(Line 5). Given a batch
of columns
𝑋={𝑥1, . . . , 𝑥𝑁}
where
𝑁
is the batch size,
op
trans-
forms
𝑋
into a semantics-preserving view
𝑋aug
. We design the
augmentation operator to be uniform sampling of the values from
the original column. By doing so, we can generate diverse views
of the same column while all views preserve the original semantic
types. Then
M
can encode the batches
𝑋
(also
𝑋ori
which is a copy
Algorithm 1: SimCLR pre-training
Input: A collection 𝐷of data lake columns
Variables : Number of training epochs n_epoch;
Data augmentation operator op; Learning rate 𝜂
Output: An embedding model M
1Initialize Musing a pre-trained LM;
2for ep =1to n_epoch do
3Randomly split 𝐷into batches {𝐵1, . . . 𝐵𝑛};
4for 𝐵∈ {𝐵1, . . . 𝐵𝑛}do
/* augment and encode every item */
5𝐵ori, 𝐵aug augment(𝐵, op);
6®
𝑍ori,®
𝑍aug M (𝐵ori),M (𝐵aug );
/* Equation (1) and (2) */
7L ← Lcontrast (®
𝑍ori,®
𝑍aug);
/* Back-prop to update M*/
8M back-propagate( M, 𝜂, 𝜕 L/𝜕M);
9return M;
of
𝑋
in the gure) and
𝑋aug
into column embedding vectors
®
𝑍ori
and
®
𝑍aug
respectively. Note that
®
𝑍ori
and
®
𝑍aug
are both matrices
with size
𝑁
times the dimension of embedding vector (e.g., 768 for
BERT). 5
Next, the algorithm leverages a contrastive loss function to con-
nect the semantics-preserving views of columns and separate rep-
resentations of distinct columns (Line 6). More specically, let
®
𝑍={®𝑧𝑖}1𝑖2𝑁
be the concatenation of the two encoded views
®
𝑍ori
and
®
𝑍aug
of batch
𝑋
introduced above. Here
®𝑧𝑖
is the
𝑖
-th ele-
ment of
®
𝑍ori
for
𝑖𝑁
and the
(𝑖𝑁)
-th element of
®
𝑍aug
for
𝑖>𝑁
.
We rst dene a single-pair loss
(𝑖, 𝑗)
for an element pair
(®𝑧𝑖,®𝑧𝑗)
to be Equation 1.
(𝑖, 𝑗)=log exp sim ®𝑧𝑖,®𝑧𝑗/𝜏
Í2𝑁
𝑘=11[𝑘𝑖,𝑘𝑗]exp (sim (®𝑧𝑖,®𝑧𝑘)/𝜏)(1)
where
sim
is a similarity function such as cosine and
𝜏
is a tem-
perature hyper-parameter in the range
(0,1]
. We x
𝜏
to be 0.07
empirically. Intuitively, by minimizing this loss for a pair
(®𝑧𝑖,®𝑧𝑗)
that are views of the same columns, we (i) maximize the similarity
score
sim ®𝑧𝑖,®𝑧𝑗
in the numerator and (ii) minimize
®𝑧𝑖
’s similarities
with all the other elements in the denominator.
Next, we can obtain the contrastive loss by averaging all match-
ing pairs shown in Equation 2 (Line 7):
Lcontrast =1
2𝑁
𝑁
𝑘=1
[(𝑘, 𝑘 +𝑁) + (𝑘+𝑁, 𝑘)] (2)
where each term
(𝑘, 𝑘 +𝑁)
and
(𝑘+𝑁 , 𝑘)
refers to pairs of views
generated from the same column.
3.3 Multi-column Table Encoder
While the method shown in Algorithm 1 learns column represen-
tations based on values within a column itself, it cannot take the
contextual information of a table into account. For example, the
single-column model can understand that a column consisting of
values “1997 1998 . . . ” is a column about years, but depending on
the context of other columns present in the same table, the same
摘要:

Semantics-awareDatasetDiscoveryfromDataLakeswithContextualizedColumn-basedRepresentationLearningGraceFanNortheasternUniversityUnitedStatesfan.gr@northeastern.eduJinWangMegagonLabsUnitedStatesjin@megagon.aiYuliangLiMegagonLabsUnitedStatesyuliang@megagon.aiDanZhangMegagonLabsUnitedStatesdan_z@megagon....

展开>> 收起<<
Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:1.6MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注