Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning

2025-05-03 0 0 1.6MB 19 页 10玖币

侵权投诉

Semantics-aware Dataset Discovery from Data Lakes with

Contextualized Column-based Representation Learning

Grace Fan

Northeastern University

United States

fan.gr@northeastern.edu

Jin Wang

Megagon Labs

United States

jin@megagon.ai

Yuliang Li

Megagon Labs

United States

yuliang@megagon.ai

Dan Zhang

Megagon Labs

United States

dan_z@megagon.ai

Renée Miller

Northeastern University

United States

miller@northeastern.edu

ABSTRACT

Dataset discovery from data lakes is essential in many real ap-

plication scenarios. In this paper, we propose

Starmie

, an end-to-

end framework for dataset discovery from data lakes (with table

union search as the main use case). Our proposed framework fea-

tures a contrastive learning method to train column encoders from

pre-trained language models in a fully unsupervised manner. The

column encoder of

Starmie

captures the rich contextual semantic

information within tables by leveraging a contrastive multi-column

pre-training strategy. We utilize the cosine similarity between col-

umn embedding vectors as the column unionability score and pro-

pose a lter-and-verication framework that allows exploring a

variety of design choices to compute the unionability score between

two tables accordingly. Empirical results on real table benchmarks

show that

Starmie

outperforms the best-known solutions in the ef-

fectiveness of table union search by 6.8 in MAP and recall. Moreover,

Starmie

is the rst to employ the HNSW (Hierarchical Navigable

Small World) index to accelerate query processing of table union

search which provides a 3,000X performance gain over the linear

scan baseline and a 400X performance gain over an LSH index (the

state-of-the-art solution for data lake indexing).

PVLDB Reference Format:

Grace Fan, Jin Wang, Yuliang Li, Dan Zhang, and Renée Miller.

Semantics-aware Dataset Discovery from Data Lakes with Contextualized

Column-based Representation Learning. PVLDB, 14(1): 50 - 60, 2021.

doi:10.14778/3421424.3421431

PVLDB Artifact Availability:

The source code, data, and/or other artifacts have been made available at

https://github.com/megagonlabs/starmie.

1 INTRODUCTION

The growing number of open datasets from governments, academic

institutions, and companies have brought new opportunities for

innovation, economic growth, and societal benets. To integrate

This work is licensed under the Creative Commons BY-NC-ND 4.0 International

License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of

this license. For any use beyond those covered by this license, obtain permission by

emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights

licensed to the VLDB Endowment.

Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097.

doi:10.14778/3421424.3421431

and analyze such datasets, researchers in both academia and in-

dustry have built a number of dataset search engines to support

the application of dataset discovery [

]. One

popular example is Google’s dataset search [

] which provides key-

word search on the metadata. However, for open datasets, simple

keyword search might suer from data quality issues of incomplete

and inconsistent metadata across dierent datasets and publish-

ers [

]. Thus it is essential to support table search over

open datasets, and more generally data lake tables (including private

enterprise data lakes), to boost dataset discovery applications, such

as nding related tables, domain discovery, and column clustering.

Finding related tables from data lakes [

] has a wide

spectrum of real application scenarios. There are two sub-tasks

of nding related tables, namely table union search and joinable

table search. In this paper, we mainly focus on the problem of table

union search, which has been recognized as a crucial task in dataset

discovery from data lakes [

]. Given a query

table and a collection of data lake tables, table union search aims to

nd all tables that are unionable with the query table. To determine

whether two tables are unionable, existing solutions rst identify

all pairs of unionable columns from the two tables based on column

representations, such as bag of tokens or bag of word embeddings.

They then devise some mechanism to aggregate the column-level

results to compute the table unionability score.

State-of-the-art:

Early work on nding unionable tables used

table clustering followed by simple syntactic measures such as the

dierence in column mean string length and cosine similarities to

determine if two tables are unionable [

]. Table union search [

]

improved on this by applying a rich collection of column repre-

sentations including syntactic, semantic (leveraging ontologies),

and natural language (based on word-embeddings) column rep-

resentations. Two important innovations of this work were the

modeling of data lake context to create an ensemble unionability

score which models the surprisingness of a score given the score

distributions within a data lake and the use of LSH indices to make

table union search fast over large data lakes [

]. More recently

𝐷3𝐿

[

] added additional column representations based on regu-

lar expression matching and SANTOS [

] added to the column

representations, representations of binary relationships. In paral-

lel to these search-based approaches, the mighty hammer of deep

learning has been applied to the problem of column matching (de-

termining the semantic type of a column) [

]. Since these

arXiv:2210.01922v2 [cs.DB] 15 Jan 2023

Name Mode of Travel Purpose Destination Day Month Year Expense

Philip Duffy Air Regional Meeting London 10 April 2019 189.06

Jeremy Oppenheim Taxi Exchange Visit Ottawa 30 Jul 2019 8.08

Mark Sedwill Air Evening Meal Bristol 02 September 2019 50

Name Date Destination Purpose

Clark 23/07 France Discuss EU

Gyimah 03/09 Belgium Build Relations

Harrington 05/08 China Discuss Productivity

Bird Name Scientific Name Date Location

Pine Siskin Carduelis Pinus 2019 Ottawa

American Robin Turdus migratorius 2019 Ottawa

Northern Flicker Colaptes auratus 2019 London

Table A:

Table B: Table C:

Figure 1: An example of table union search on Open Data.

approaches are supervised, they can only be applied to nding a

limited set of semantic types (78 in their experiments), and while

not a general solution for unionability in data lakes, they can be

used in an oine fashion to nd unionable tables containing the

types on which they are trained.

However, there are still plenty of opportunities to further im-

prove the performance of table union search. One important issue

is to learn sucient contextual information between columns in ta-

bles so as to determine the unionability. This point can be illustrated

in the following motivation example.

Example 1.1. Figure 1 shows an example of nding unionable

tables. Given the query Table A, existing approaches rst nd union-

able columns. In this example, the column

Destination

in Table A

will be deemed more unionable with

Location

from Table C than

with

Destination

from Table B. This is because the syntactic simi-

larity score, e.g. overlap and containment Jaccard, between the two

Destination

columns is 0; while the average word embedding of

cities (Table A) is also not as close to that of nations (Table B). Sim-

ilarly, if an ontology is used, Table A and Table C shares the same

class while the values in B are in dierent (though related) classes.

Meanwhile, looking at the tables as a whole we observe that Table A

is actually irrelevant to Table C. But as existing solutions only look

at the pair of single columns when calculating column unionability

score, the columns

Year/Date

and

Destination/Location

of the

two tables might be wrongly aligned together. Even techniques that

look at relationships [

] can be fooled by the value overlap in this

relationship and determine the relationship

Year-Destination

Table A to be unionable with

Date-Location

in Table C. This kind

of mistake can be avoided by looking at a table’s context, i.e. in-

formation carried by other columns within a table. Looking at the

table as a whole, a method should be able to recognize that the

Year

in Table A is part of a travel date while in Table C it is the

date of discovery of a bird; and

Destination

in Table A refers to

the cities to which the ocers are traveling; whereas

Location

Table C is the city where a bird is found.

From the above example, we focus on the following challenges

in proposing a new solution. Firstly, it is essential to learn richer

semantics of columns based on natural language domain. To this

end, we require a more powerful approach to learn the column

representation so as to capture richer information instead of relying

on simple methods like the average over bag of word embeddings

utilized in previous studies [

] or even the similarity of the

word embedding distributions [

]. Secondly, we argue that it is

crucial to utilize the contextual information within a table to learn

the representation of each column, which is ignored by previous

studies. Even proposals for capturing relationship semantics do not

use contextual information to learn column representations [

Finally, due to the large volume of data lake tables, it is also a great

challenge to develop a scalable and memory-ecient solution.

We propose

Starmie

, an end-to-end framework for dataset dis-

covery from data lakes with table union search as the main use case.

Starmie

uses pre-trained language models (LMs) such as BERT [

]

to obtain semantics-aware representations for columns of data lake

tables. While pre-trained LMs have been shown to achieve state-of-

the-art results in table understanding applications [

], their

good performance heavily relies on high-quality labeled training

data. For the problem setting of table union search [

], we

must come up with a fully unsupervised approach in order to apply

pre-trained LMs to such applications, something not yet supported

by previous studies.

Starmie

addresses this issue by leveraging con-

trastive representation learning [

] to learn column representations

in a

self-supervised manner

. An innovation of this approach is

to assume that two randomly selected columns in a data lake can be

used as negative training examples. For positive examples, we pro-

pose and use novel data augmentation methods. The framework de-

nes a learning objective that connects the same or similar columns

in the representation space while separating distinct columns. As

such,

Starmie

can apply the pre-trained representation model in

downstream tasks such as table union search without requiring any

labels. We also propose to combine the learning algorithm with a

novel multi-column table transformer model to learn contextualized

column embeddings that model the column semantics depending

on not only the column values, but also their context within a table.

While a recent study SANTOS [

] can reach a similar goal by em-

ploying a knowledge base, our proposed methods can automatically

capture such contextual information from tables in an unsupervised

manner without relying on any external knowledge or labels.

Based on the proposed column encoders, we use cosine sim-

ilarity between column embeddings as the column unionability

score and develop a bipartite matching based method to calculate

the table unionability score. We propose a lter-and-verication

framework that enables the use of dierent indexing and pruning

techniques to reduce the number of computations of the expensive

bipartite matching. While most previous studies employed LSH

index to improve the search performance, we also make use of

HNSW (Hierarchical Navigable Small World) index [

] to acceler-

ate query processing. Experimental results show that HNSW can

signicantly improve the query time while only slightly reducing

the MAP/recall scores. Besides table union search, we further con-

duct two case studies to show that

Starmie

can also support other

dataset discovery applications such as joinable table search and

column clustering. We believe these results show great promise

in the use of contextualized, self-supervised embeddings for many

table understanding tasks.

Our contributions can be summarized as the following.

•

We propose

Starmie

, an end-to-end framework to support

dataset discovery over data lakes with table union search as

the main use case.

•

We develop a contrastive learning framework to learn con-

textualized column representations for data lake tables with-

out requiring labeled training instances.

Starmie

achieves

an improvement of 6.8% in both MAP and recall compared

with the best state-of-the-art method, with a MAP of 99%, a

signicant margin compared with previous studies.

•

We design and implement a lter-and-verication based

framework for computing the table-level unionability score

which can accommodate multiple design choices of indexing

and pruning to accelerate the overall query processing. By

leveraging the HNSW index,

Starmie

achieves up to three

orders of magnitude in performance gain for query time

relative to the linear scan baseline.

•

We conduct an extensive set of experiments over two real

world data lake corpora. Experimental results demonstrate

that the proposed

Starmie

framework signicantly outper-

forms existing solutions in eectiveness. It also shows good

scalability and memory eciency.

•

We further conduct case studies to show the exibility and

generality of our proposed framework in other dataset dis-

covery applications.

2 OVERVIEW

2.1 Problem denition

A data lake consists of a collection of tables

. Each table

𝑇∈ T

consists of several columns

{𝑡1, . . . , 𝑡𝑚}

where each column

𝑡𝑖

can

be from dierent domains. Here

𝑚

is the number of columns in

table

𝑇

(denoted as

|𝑇|=𝑚

). We will use the notation

𝑇

to denote

both the table and its set of columns if there is no ambiguity. To de-

termine the unionability between two columns, following previous

studies, we employ column encoders to generate the representations

of columns. Then the column unionability score can be computed to

measure the relevance between those representations. A column

encoder

takes a column

𝑡

as input and outputs

M(𝑡)

as the

representation. Given two columns

𝑡𝑖

and

𝑡𝑗

, the column unionabil-

ity score is computed as

F (M(𝑡𝑖),M(𝑡𝑗))

, where

is a scoring

function between two column representations.

Based on the column unionability scores, we compute the table

unionability score between two tables, which is obtained by aggre-

gating the column unionability scores introduced above. Given two

tables

𝑆

and

𝑇

, we dene a table unionability scoring mechanism

𝑈={F ,M,A}

, where

and

are the column encoder and

scoring function for two column representations, respectively. Here

is a mechanism to aggregate the column unionability scores

between all pairs of columns from the two tables. We will introduce

the details of Alater in Section 4.

Following the above discussions, we can formally dene the table

union search problem as a top-k search problem as Denition 2.1:

Denition 2.1 (Table Union Search). Given a collection of data

lake tables

and a query table

𝑆

, top-k table union search aims at

nding a subset

S ⊆ T

where

|S| =𝑘

and

∀𝑇∈ S

and

𝑇′∈ T − S

we have 𝑈(𝑆,𝑇 ) ≥ 𝑈(𝑆, 𝑇 ′).

2.2 System architecture

Figure 2 shows the overall architecture of

Starmie

that solves table

union search in two stages: oine and online.

During the oine stage,

Starmie

pre-trains a column represen-

tation model that encodes columns of data lake tables into dense

high-dimensional vectors (i.e., column embeddings). Then, we ap-

ply the trained model to all data lake tables to obtain the column

embeddings via model inference. We store the embedding vectors in

Data Lake

Contrastive Self-supervised Training

Vector Indices

(HNSW, LSH, etc.)

Offline

Online

[Table A: 0.9, Table B: 0.85, …]

Multi-Column

Encoder

Query Table

Inference

Verify

Retrieve

Table Scorer Contextualized

embeddings

Figure 2: During the oline phase, Starmie pre-trains a multi-

column table encoder using contrastive learning and stores the em-

beddings of data lake columns in vector indices like HNSW. During

online processing, Starmie retrieves candidate tables with similar

contextualized column embeddings then veries their table-level

unionability scores using column alignment algorithms.

ecient vector indices for online retrieval. A key challenge for the

oine stage is to train high-quality column encoders that capture

the semantics of tabular data. In

Starmie

, we follow a recent trend

[

] of table representation learning that encodes tabular data

using pre-trained language models (LMs). Pre-trained LMs have

achieved state-of-the-art performance on table understanding tasks

such as column type and relation type annotation [

]. However,

the good performance of pre-trained LMs requires ne-tuning on

high-quality labeled datasets, which are always not available in ta-

ble search applications such as table union search. Using pre-trained

LMs o-the-shelf is also problematic as the column embeddings

cannot capture (ir-)relevance between columns or the contextual

information within tables. To this end, in Section 3, we propose a

contrastive learning framework for learning high-dimensional col-

umn representations in fully unsupervised manner. We combine the

framework with a multi-column table model that captures column

semantics from the column values while taking the table context

into account. Then we apply the column encoder to all tables to

convert each table into a collection of embedding vectors.

During the online stage, given an input query table, we retrieve a

set of candidate tables from the vector indices by searching for data

lake column embeddings of high column-level similarity with the

input columns.

Starmie

then applies a verication step for checking

and ranking the candidates for the top-

𝑘

tables with the highest

table-level unionability scores. The rst challenge for the online

stage is how to eciently search for unionable columns. This is not

a trivial task due to the massive size of data lakes. We address this

challenge by allowing dierent design choices of state-of-the-art

high-dimensional vector indices. Yet another challenge is designing

a table unionability function that can eectively aggregate the col-

umn unionability scores. As in other studies, we employ weighted

bipartite graph matching. To address its limitation of high compu-

tation complexity, we introduce a novel algorithm to reduce the

number of expensive calls to the exact matching algorithm by de-

ducing lower and upper bounds of the matching score (Section 4).

Xori

Xaug

Zori

Zaug

augment

M(Y)

connect

separate

Two batches of

serialized columns

<s> 9/14/2009 12/14/2009 10/31/2009 ... Two views of the same

batch (e.g., via sampling)

Figure 3: Contrastive learning with single-column input.

3 LEARNING CONTEXTUALIZED COLUMN

EMBEDDINGS

We now describe the oine stage for training high-quality column

encoders. The encoder pre-processes tables into sequenced inputs

and uses a pre-trained LM to encode each column into a high-

dimensional vector. We rst introduce background knowledge in

Section 3.1. We describe a novel contrastive learning approach for

table encoders in Section 3.2 and generalize it to multi-column

encoders for contextualized embeddings in Section 3.3. Finally, we

describe the table pre-processing approaches to generate the input

for such learning processes in Section 3.4.

3.1 Background

Contrastive learning is a self-supervision approach that learns data

representations where similar data items are close while distinct

data items are far apart. In

Starmie

, we adopt SimCLR [

] which

was recently shown to be eective in Vision and NLP applications.

Figure 3 illustrates the high-level idea of the algorithm. The goal is

to learn an encoder

(e.g., a column encoder) that takes a data

item (e.g., a column) as input and encodes it into a high-dimensional

vector. To train the encoder in a self-supervised manner without

labels, SimCLR relies on (1) a data augmentation operator gener-

ating semantic-preserving views (in our context this means

𝑋ori

and

𝑋aug

that are unionable) of the same data item and (2) a sam-

pling method (e.g., uniform sampling from a large collection) that

returns pairs of data items (i.e.,

𝑋

and

𝑌

) that are distinct (mean-

ing non-unionable) with high probability. SimCLR then applies a

contrastive loss function that connects the representations of the

semantic-preserving (unionable) views meanwhile separating those

of the sampled distinct (non-unionable) items. Next, we illustrate

how we apply the algorithm for training a single-column encoder.

3.2 Contrastive Learning Framework

The goal is to connect representations of the same or unionable

columns in their representation space while separating represen-

tations of distinct columns. To achieve the rst goal, Algorithm 1

leverages a data augmentation operator

(Line 5). Given a batch

of columns

𝑋={𝑥1, . . . , 𝑥𝑁}

where

𝑁

is the batch size,

trans-

forms

𝑋

into a semantics-preserving view

𝑋aug

. We design the

augmentation operator to be uniform sampling of the values from

the original column. By doing so, we can generate diverse views

of the same column while all views preserve the original semantic

types. Then

can encode the batches

𝑋

(also

𝑋ori

which is a copy

Algorithm 1: SimCLR pre-training

Input: A collection 𝐷of data lake columns

Variables : Number of training epochs n_epoch;

Data augmentation operator op; Learning rate 𝜂

Output: An embedding model M

1Initialize Musing a pre-trained LM;

2for ep =1to n_epoch do

3Randomly split 𝐷into batches {𝐵1, . . . 𝐵𝑛};

4for 𝐵∈ {𝐵1, . . . 𝐵𝑛}do

/* augment and encode every item */

5𝐵ori, 𝐵aug ←augment(𝐵, op);

6®

𝑍ori,®

𝑍aug ← M (𝐵ori),M (𝐵aug );

/* Equation (1) and (2) */

7L ← Lcontrast (®

𝑍ori,®

𝑍aug);

/* Back-prop to update M*/

8M ← back-propagate( M, 𝜂, 𝜕 L/𝜕M);

9return M;

𝑋

in the gure) and

𝑋aug

into column embedding vectors

𝑍ori

and

𝑍aug

respectively. Note that

𝑍ori

and

𝑍aug

are both matrices

with size

𝑁

times the dimension of embedding vector (e.g., 768 for

BERT). 5

Next, the algorithm leverages a contrastive loss function to con-

nect the semantics-preserving views of columns and separate rep-

resentations of distinct columns (Line 6). More specically, let

𝑍={®𝑧𝑖}1≤𝑖≤2𝑁

be the concatenation of the two encoded views

𝑍ori

and

𝑍aug

of batch

𝑋

introduced above. Here

®𝑧𝑖

is the

𝑖

-th ele-

ment of

𝑍ori

for

𝑖≤𝑁

and the

(𝑖−𝑁)

-th element of

𝑍aug

for

𝑖>𝑁

We rst dene a single-pair loss

ℓ(𝑖, 𝑗)

for an element pair

(®𝑧𝑖,®𝑧𝑗)

to be Equation 1.

ℓ(𝑖, 𝑗)=−log exp sim ®𝑧𝑖,®𝑧𝑗/𝜏

Í2𝑁

𝑘=11[𝑘≠𝑖,𝑘≠𝑗]exp (sim (®𝑧𝑖,®𝑧𝑘)/𝜏)(1)

where

sim

is a similarity function such as cosine and

𝜏

is a tem-

perature hyper-parameter in the range

(0,1]

. We x

𝜏

to be 0.07

empirically. Intuitively, by minimizing this loss for a pair

(®𝑧𝑖,®𝑧𝑗)

that are views of the same columns, we (i) maximize the similarity

score

sim ®𝑧𝑖,®𝑧𝑗

in the numerator and (ii) minimize

®𝑧𝑖

’s similarities

with all the other elements in the denominator.

Next, we can obtain the contrastive loss by averaging all match-

ing pairs shown in Equation 2 (Line 7):

Lcontrast =1

2𝑁

𝑁



𝑘=1

[ℓ(𝑘, 𝑘 +𝑁) + ℓ(𝑘+𝑁, 𝑘)] (2)

where each term

ℓ(𝑘, 𝑘 +𝑁)

and

ℓ(𝑘+𝑁 , 𝑘)

refers to pairs of views

generated from the same column.

3.3 Multi-column Table Encoder

While the method shown in Algorithm 1 learns column represen-

tations based on values within a column itself, it cannot take the

contextual information of a table into account. For example, the

single-column model can understand that a column consisting of

values “1997 1998 . . . ” is a column about years, but depending on

the context of other columns present in the same table, the same

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Semantics-awareDatasetDiscoveryfromDataLakeswithContextualizedColumn-basedRepresentationLearningGraceFanNortheasternUniversityUnitedStatesfan.gr@northeastern.eduJinWangMegagonLabsUnitedStatesjin@megagon.aiYuliangLiMegagonLabsUnitedStatesyuliang@megagon.aiDanZhangMegagonLabsUnitedStatesdan_z@megagon....

展开>> 收起<<

Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: