Russian Web Tables A Public Corpus of Web Tables for Russian Language Based on Wikipedia Platon Fedorov10000000188386849 Alexey Mironov10000000318284606

2025-04-26 0 0 965.59KB 14 页 10玖币

侵权投诉

Russian Web Tables: A Public Corpus of Web

Tables for Russian Language Based on Wikipedia?

Platon Fedorov1[0000−0001−8838−6849], Alexey Mironov1[0000−0003−1828−4606],

and George Chernishev1,2[0000−0002−4265−9642]

1Unidata, Russia

2Saint-Petersburg State University, Russia

{platon.fedorov, alexey.mironov, georgii.chernyshev}@unidata-platform.ru

Abstract. Corpora that contain tabular data such as WebTables are

a vital resource for the academic community. Essentially, they are the

backbone of any modern research in information management. They are

used for various tasks of data extraction, knowledge base construction,

question answering, column semantic type detection and many other.

Such corpora are useful not only as a source of data, but also as a base

for building test datasets. So far, there were no such corpora for the Rus-

sian language and this seriously hindered research in the aforementioned

areas.

In this paper, we present the ﬁrst corpus of Web tables created speciﬁ-

cally out of Russian language material. It was built via a special toolkit

we have developed to crawl the Russian Wikipedia. Both the corpus and

the toolkit are open-source and publicly available. Finally, we present a

short study that describes Russian Wikipedia tables and their statistics.

Keywords: Web Tables ·Wikipedia ·Corpus.

1 Introduction

Tabular data is very important for the information management community

since tables are a core data stucture which both academics and practicioners

utilize in their work.

The ﬁrst studies concerning tabular data appeared almost twenty years ago [12],

and since then there was a constant demand for open tabular data. Table-related

studies needed to ensure repeatability and this secured the demand for table cor-

pora, which researchers addressed using the tables available on the Web.

Currently, Web tables are the building blocks of many modern research in

information management. Numerous studies have used table corpora for various

data management tasks [13]. For example, four major corpus papers [1,4,7,9]

have over a thousand of citations combined, according to Google Scholar.

?This paper was accepted to the XXIV International Conference on Data Analytics

and Management in Data Intensive Domains (DAMDID’22).

arXiv:2210.06353v1 [cs.CL] 3 Oct 2022

2 P. Fedorov et al.

There are several such corpora for the English language [13]: they are usually

based on pre-crawled data. However, there are no public dedicated table corpora

for Russian language that we are aware of.

In this paper, we present the ﬁrst corpus of Web tables (named RWT, Rus-

sian WebTables) created speciﬁcally out of Russian language material. For this

purpose, we have created a special toolkit that processes Russian Wikipedia.

Designing this toolkit, we were driven by the following considerations:

1. Light-weightness. Existing toolkits for corpus creation usually require a cloud

environment just to deploy the initial data. We aim to provide for low-budget

research projects and eventually we hope to make corpus creation accessible

even to individual students.

2. Full cycle. Existing toolkits rely on pre-crawled data. Our idea is to create a

full cycle system that will not depend on external data and will allow better

temporal granularity control.

3. Customizability. Our goal is to provide users with means of managing the

collection process, for example, allow them to ﬁlter tables that contain no

latin characters.

It is implemented in Python, has a modern codebase, contains a minimal set

of dependencies, and is equipped with a GUI.

Moreover, in this paper we describe the collected corpus: ﬁrst, we character-

ize it in terms of high-level metrics, then we outline statistics related to table

contents, and ﬁnally, we highlight interesting tables and pages.

Overall, the contributions of this paper are the following:

–The ﬁrst corpus of web tables for Russian language, created from Russian

Wikipedia.

–A conﬁgurable, light-weight, full cycle toolkit for creating such corpora, in-

tended for low-budget projects.

–A study that describes tables of the Russian Wikipedia and presents their

statistics.

Both corpus3and toolkit4are open-source and publicly available.

2 Background, Motivation, and Related Work

There are many types of Web tables and their classiﬁcations [13]. One of the

most popular is the following:

1. Layout: navigational and formatting. The former are used for navigating

within a website and the latter are used for formatting purposes.

3https://gitlab.com/unidata-labs/ru-wiki-tables-dataset

4https://gitlab.com/unidata-labs/ru-wiki-tables-backend https://gitlab.

com/unidata-labs/ru-wiki-tables-frontend

Russian Web Tables: A Public Corpus of Web Tables for Russian Language 3

2. Content: relational, entity, and matrix. Entity tables describe a single entity,

relational tables describe a set of entities and their attributes, and matrix-

type tables are simply three dimensional datasets.

Relational tables are of the most interest since they contain usable data.

However, only a small subset of tables is relational.

Web tables are extensively used for many tasks, namely [9,13]:

1. Table type identiﬁcation, for example, according to the aforementioned clas-

siﬁcation.

2. Table interpretation, which is one of the following: 1) column type identiﬁ-

cation, 2) entity linking, and 3) relation extraction.

3. Table search, which can be either keyword-based or table-based search. Re-

cently, there was a surge of interest in this task, related to dataset exploration

problems [11,5].

4. Knowledge base augmentation and construction.

5. Table augmentation, which can be done either by extending by row, column,

or by doing data completion.

All these tasks require corpora to ensure repeatability of approaches, meth-

ods, and algorithms. Moreover, with the popularization of machine learning ap-

proaches, tables as training data are in demand too. Therefore, the last decade

experienced a boom of table corpus creation.

Unfortunately, this boom has bypassed the Russian database community,

and, consequently, Russian language — all existing corpora concern either En-

glish language exclusively or ignore the language issue altogether. Therefore we

have decided to create the RWT corpus — a collection of Web tables containing

Russian language material. As the starting point we have selected the Russian

part of Wikipedia.

To crawl and process it we have developed a special toolkit — the RWT

toolkit. The following considerations were taken into account during its creation:

light-weightness,full cycle support, and customizability.

Existing corpora, such as the WDC Web Table Corpus [9] or the Dresden

Web Table Corpus [7], are usualy created from the Common Crawl data5. It is

a huge repository of web pages that are crawled monthly and made available to

the general public in a compressed form.

For example, the latest version (June/July 2021) contains more than 100TB

of compressed data. It will be diﬃcult even to unpack such amount of data on a

single machine.

Therefore, relying on the Common Crawl dataset would require a multi-

machine environment. For example, in case of the WDC Web Table Corpus, the

Amazon EC2 cloud services are needed6to perform table extraction (and the

setup is built into the source code). Therefore, it is prohibitevely expensive for

low-budged projects or individual students to use the Common Crawl dataset.

5http://commoncrawl.org/

6http://webdatacommons.org/framework/

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

RussianWebTables:APublicCorpusofWebTablesforRussianLanguageBasedonWikipedia?PlatonFedorov1[0000000188386849],AlexeyMironov1[0000000318284606],andGeorgeChernishev1;2[0000000242659642]1Unidata,Russia2Saint-PetersburgStateUniversity,Russia{platon.fedorov,alexey.mironov,georgii.chernyshev}@unidata-platf...

展开>> 收起<<

Russian Web Tables A Public Corpus of Web Tables for Russian Language Based on Wikipedia Platon Fedorov10000000188386849 Alexey Mironov10000000318284606.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Russian Web Tables A Public Corpus of Web Tables for Russian Language Based on Wikipedia Platon Fedorov10000000188386849 Alexey Mironov10000000318284606

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: