Russian Web Tables A Public Corpus of Web Tables for Russian Language Based on Wikipedia Platon Fedorov10000000188386849 Alexey Mironov10000000318284606

2025-04-26 0 0 965.59KB 14 页 10玖币
侵权投诉
Russian Web Tables: A Public Corpus of Web
Tables for Russian Language Based on Wikipedia?
Platon Fedorov1[0000000188386849], Alexey Mironov1[0000000318284606],
and George Chernishev1,2[0000000242659642]
1Unidata, Russia
2Saint-Petersburg State University, Russia
{platon.fedorov, alexey.mironov, georgii.chernyshev}@unidata-platform.ru
Abstract. Corpora that contain tabular data such as WebTables are
a vital resource for the academic community. Essentially, they are the
backbone of any modern research in information management. They are
used for various tasks of data extraction, knowledge base construction,
question answering, column semantic type detection and many other.
Such corpora are useful not only as a source of data, but also as a base
for building test datasets. So far, there were no such corpora for the Rus-
sian language and this seriously hindered research in the aforementioned
areas.
In this paper, we present the first corpus of Web tables created specifi-
cally out of Russian language material. It was built via a special toolkit
we have developed to crawl the Russian Wikipedia. Both the corpus and
the toolkit are open-source and publicly available. Finally, we present a
short study that describes Russian Wikipedia tables and their statistics.
Keywords: Web Tables ·Wikipedia ·Corpus.
1 Introduction
Tabular data is very important for the information management community
since tables are a core data stucture which both academics and practicioners
utilize in their work.
The first studies concerning tabular data appeared almost twenty years ago [12],
and since then there was a constant demand for open tabular data. Table-related
studies needed to ensure repeatability and this secured the demand for table cor-
pora, which researchers addressed using the tables available on the Web.
Currently, Web tables are the building blocks of many modern research in
information management. Numerous studies have used table corpora for various
data management tasks [13]. For example, four major corpus papers [1,4,7,9]
have over a thousand of citations combined, according to Google Scholar.
?This paper was accepted to the XXIV International Conference on Data Analytics
and Management in Data Intensive Domains (DAMDID’22).
arXiv:2210.06353v1 [cs.CL] 3 Oct 2022
2 P. Fedorov et al.
There are several such corpora for the English language [13]: they are usually
based on pre-crawled data. However, there are no public dedicated table corpora
for Russian language that we are aware of.
In this paper, we present the first corpus of Web tables (named RWT, Rus-
sian WebTables) created specifically out of Russian language material. For this
purpose, we have created a special toolkit that processes Russian Wikipedia.
Designing this toolkit, we were driven by the following considerations:
1. Light-weightness. Existing toolkits for corpus creation usually require a cloud
environment just to deploy the initial data. We aim to provide for low-budget
research projects and eventually we hope to make corpus creation accessible
even to individual students.
2. Full cycle. Existing toolkits rely on pre-crawled data. Our idea is to create a
full cycle system that will not depend on external data and will allow better
temporal granularity control.
3. Customizability. Our goal is to provide users with means of managing the
collection process, for example, allow them to filter tables that contain no
latin characters.
It is implemented in Python, has a modern codebase, contains a minimal set
of dependencies, and is equipped with a GUI.
Moreover, in this paper we describe the collected corpus: first, we character-
ize it in terms of high-level metrics, then we outline statistics related to table
contents, and finally, we highlight interesting tables and pages.
Overall, the contributions of this paper are the following:
The first corpus of web tables for Russian language, created from Russian
Wikipedia.
A configurable, light-weight, full cycle toolkit for creating such corpora, in-
tended for low-budget projects.
A study that describes tables of the Russian Wikipedia and presents their
statistics.
Both corpus3and toolkit4are open-source and publicly available.
2 Background, Motivation, and Related Work
There are many types of Web tables and their classifications [13]. One of the
most popular is the following:
1. Layout: navigational and formatting. The former are used for navigating
within a website and the latter are used for formatting purposes.
3https://gitlab.com/unidata-labs/ru-wiki-tables-dataset
4https://gitlab.com/unidata-labs/ru-wiki-tables-backend https://gitlab.
com/unidata-labs/ru-wiki-tables-frontend
Russian Web Tables: A Public Corpus of Web Tables for Russian Language 3
2. Content: relational, entity, and matrix. Entity tables describe a single entity,
relational tables describe a set of entities and their attributes, and matrix-
type tables are simply three dimensional datasets.
Relational tables are of the most interest since they contain usable data.
However, only a small subset of tables is relational.
Web tables are extensively used for many tasks, namely [9,13]:
1. Table type identification, for example, according to the aforementioned clas-
sification.
2. Table interpretation, which is one of the following: 1) column type identifi-
cation, 2) entity linking, and 3) relation extraction.
3. Table search, which can be either keyword-based or table-based search. Re-
cently, there was a surge of interest in this task, related to dataset exploration
problems [11,5].
4. Knowledge base augmentation and construction.
5. Table augmentation, which can be done either by extending by row, column,
or by doing data completion.
All these tasks require corpora to ensure repeatability of approaches, meth-
ods, and algorithms. Moreover, with the popularization of machine learning ap-
proaches, tables as training data are in demand too. Therefore, the last decade
experienced a boom of table corpus creation.
Unfortunately, this boom has bypassed the Russian database community,
and, consequently, Russian language — all existing corpora concern either En-
glish language exclusively or ignore the language issue altogether. Therefore we
have decided to create the RWT corpus — a collection of Web tables containing
Russian language material. As the starting point we have selected the Russian
part of Wikipedia.
To crawl and process it we have developed a special toolkit — the RWT
toolkit. The following considerations were taken into account during its creation:
light-weightness,full cycle support, and customizability.
Existing corpora, such as the WDC Web Table Corpus [9] or the Dresden
Web Table Corpus [7], are usualy created from the Common Crawl data5. It is
a huge repository of web pages that are crawled monthly and made available to
the general public in a compressed form.
For example, the latest version (June/July 2021) contains more than 100TB
of compressed data. It will be difficult even to unpack such amount of data on a
single machine.
Therefore, relying on the Common Crawl dataset would require a multi-
machine environment. For example, in case of the WDC Web Table Corpus, the
Amazon EC2 cloud services are needed6to perform table extraction (and the
setup is built into the source code). Therefore, it is prohibitevely expensive for
low-budged projects or individual students to use the Common Crawl dataset.
5http://commoncrawl.org/
6http://webdatacommons.org/framework/
摘要:

RussianWebTables:APublicCorpusofWebTablesforRussianLanguageBasedonWikipedia?PlatonFedorov1[0000000188386849],AlexeyMironov1[0000000318284606],andGeorgeChernishev1;2[0000000242659642]1Unidata,Russia2Saint-PetersburgStateUniversity,Russia{platon.fedorov,alexey.mironov,georgii.chernyshev}@unidata-platf...

展开>> 收起<<
Russian Web Tables A Public Corpus of Web Tables for Russian Language Based on Wikipedia Platon Fedorov10000000188386849 Alexey Mironov10000000318284606.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:965.59KB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注