Dimensional Data KNN-based Imputation Yuzhao Yang10000000265524812 J er ome Darmont2000000031491384X Franck Ravat1000000034820841X and Olivier Teste10000000303389886

2025-04-27 0 0 1.63MB 14 页 10玖币
侵权投诉
Dimensional Data KNN-based Imputation?
Yuzhao Yang1[0000000265524812], J´erˆome Darmont2[000000031491384X],
Franck Ravat1[000000034820841X], and Olivier Teste1[0000000303389886]
1IRIT-CNRS (UMR 5505), Universit´e de Toulouse, France
{Yuzhao.Yang, Franck.Ravat, Olivier.Teste}@irit.fr
2Universit´e de Lyon, Lyon 2, UR ERIC, France
jerome.darmont@univ-lyon2.fr
Abstract. Data Warehouses (DWs) are core components of Business
Intelligence (BI). Missing data in DWs have a great impact on data
analyses. Therefore, missing data need to be completed. Unlike other
existing data imputation methods mainly adapted for facts, we propose a
new imputation method for dimensions. This method contains two steps:
1) a hierarchical imputation and 2) a k-nearest neighbors (KNN) based
imputation. Our solution has the advantage of taking into account the
DW structure and dependency constraints. Experimental assessments
validate our method in terms of effectiveness and efficiency.
Keywords: Data Imputation ·Data Warehouses ·Dimensions ·KNN
1 Introduction
Data warehouses (DWs) are widely used in companies and organizations as a
significant Business Intelligence (BI) tool to help them building their decision
support systems. Data in DWs are usually modelled in a multidimensional way,
which allows the user to analyse data through On Line Analytical Processing
(OLAP). An OLAP model organizes data according to analysis subjects (facts)
associated to analysis axis (dimensions). Each fact is composed of measures.
Each dimension contains one or several analysis viewpoints (hierarchies).
Missing data may exist in a DW. There are 2 types of DW missing data:
dimensional missing data which are missing data in the dimensions and fac-
tual missing data which are in the facts. These missing data have impact on
OLAP analyses. It is important to complete the missing data for the sake of a
better data analysis.
Data imputation is the process of replacing the missing values by some plau-
sible values based on information available in the data [12]. The current DW data
imputation research mainly focuses on factual data [25,21,4]. Yet the dimensional
missing data make aggregated data incomplete and make it hard to analyse them
with respect to hierarchy levels. Therefore the imputation for DW dimensions is
also necessary. However the DW dimension has a complex structure containing
?This work is supported by the French National Research Agency (ANR), Project
ANR-19-CE23-0005 BI4people (Business intelligence for the people).
arXiv:2210.02237v1 [cs.DB] 5 Oct 2022
2 Y. Yang et al.
different hierarchies with different granularity levels having their dependency
relationships. When we complete the dimensional missing data, we have to take
the DW structure and the dependency constraints into account. We proposed
a hierarchical imputation based on the inter- and intra-dimensional hierarchical
dependency relationships [27] for the imputation of dimensional missing data.
To the best of our knowledge, there is no other specific data imputation method
for DW dimensions. The hierarchical imputation is convincible because we use
accurate data based on real functional dependency relationships. However, this
method is limited owing to the sparsity problem which means that for an in-
stance to be completed, there may not be an instance sharing the same value on
a lower-granularity level of the hierarchy.
In order to complete as many values as possible, in this paper, we propose
H-OLAPKNN, an imputation method for DW dimensions by extending the hi-
erarchical imputation with a novel dimension imputation method called OLAP-
KNN. OLAPKNN is based on K-nearest neighbours (KNN) algorithm. KNN
imputation finds the K nearest neighbors of an instance with missing data then
fills in the missing data based on the mean or mode of the neighbors’ value [23].
We choose KNN because it is a non-parametric and instance-based algorithm,
which is widely applied for data imputation [3] and has been proved to have
relatively high accuracy [2,23]. Compared to the basic KNN imputation, OLAP-
KNN considers the structure complexity and the dependency constraints of the
dimension hierarchies. Moreover, the dimensional data are usually qualitative on
which we focus in this paper.
The remainder of this paper is organized as follows. In Section 2, we review
the related work about data imputation algorithms. In Section 3, we formal-
ize the DW dimension model. In Section 4, we propose a distance calculation
method for dimension instances. In Section 5, we explain in detail our proposed
dimension imputation algorithm. In Section 9, we validate our proposal by some
experiments. In Section 7, we conclude this paper and hint at future research.
2 Related Work
There are various data imputation methods [16]: statistic based imputation,
machine-learning based imputation, rule based imputation, external source based
imputation and hybrid methods etc. The statistic based imputation completes
the missing values by applying the statistical methods like filling average, the
most frequent value or with the value of the most similar record; there are
also methods using the regression to predict the missing values [19]. The ma-
chine learning based imputation methods use algorithms like k-nearest neighbor
(KNN) [2,23,10,17], regression models [13], Naive Bayes [9] to predict the missing
values. The rule based imputation methods [8,22,5] complete the missing values
by some business rules, similarity rules or dependency rules. Concerning the ex-
ternal source based methods, the crowdsourcing [14] can be applied for the data
imputation by putting forward the queries in the crowdsourcing frameworks and
collecting answers to complete the missing data. There are also methods which
Dimensional Data KNN-based Imputation 3
realize the imputation through web information [29,26] like web pages, web lists
and web tables. What’s more, there are hybrid methods which mix different
imputation methods to provide a higher performance.
The statistic and machine learning based methods mainly focus on the nu-
merical data, which fit for the imputation of facts where the data are mostly
numerical. However, in the dimensions, there are mainly qualitative data which
make it difficult to process the data imputation by such imputation methods.
The rule based and external source based imputation methods may be suitable
for the imputation of dimensions, but they need time and efforts to create rules
or find the appropriate sources. Hence we propose H-OLAPKNN which combines
the hierarchical imputation with a KNN-based imputation method.
3 DW Dimension
As a DW is composed of dimensions and facts and we focus on the dimension
imputation, we introduce the DW dimension concepts used in this paper [20].
Definition 1 (Dimension). In a data warehouse, a dimension, denoted by
D, is defined as (AD, HD, ID). AD={a1, ..., au}∪{id}is a set of attributes,
where id represents the dimension’s identifier; HD={H1, ..., Hv}is a set of
hierarchies; IDis a matrix of dimension instances, for a given row r, the row
instance vector is denoted as ir; for a given attribute au, their joint instance
value is denoted as ir,au.
Definition 2 (Hierarchy). Ahierarchy of dimension D, denoted by H
HD, is defined as (P aramH, W eakH).P aramH=< idD, pH
2, ..., pH
v>is an
ordered set of dimension attributes, called parameters, which set granularity
levels along the dimensions, k[1...v], pH
kAD. Parameter pH
1rolls up to pH
2
in His denoted as pH
1HpH
2;W eakH=P aramH2(ADP aramH)is a map-
ping possibly associating each parameter with one or several weak attributes,
which are also dimension attributes providing additional information; All param-
eters and weak attributes of Hconstitute the hierarchy attributes of H, denoted
by AH=P aramH(S
pH
vP aramH
W eakH[pH
v])
There exists different types of hierarchy, but the most basic and common
one is the strict hierarchy [15] where a value at a hierarchy’s lower-granularity
belongs to only one higher-granularity value [24]. Thus in this paper, we only
consider the case of the strict hierarchy.
4 Distance Between Dimension Instances
Since the KNN imputation select the k-nearest neighbors of the missing data
instance for the imputation, we should calculate the distance between dimension
instances containing missing data to be completed and other instances. In a
摘要:

DimensionalDataKNN-basedImputation?YuzhaoYang1[0000000265524812],Jer^omeDarmont2[000000031491384X],FranckRavat1[000000034820841X],andOlivierTeste1[0000000303389886]1IRIT-CNRS(UMR5505),UniversitedeToulouse,FrancefYuzhao.Yang,Franck.Ravat,Olivier.Testeg@irit.fr2UniversitedeLyon,Lyon2,URERIC,Francej...

展开>> 收起<<
Dimensional Data KNN-based Imputation Yuzhao Yang10000000265524812 J er ome Darmont2000000031491384X Franck Ravat1000000034820841X and Olivier Teste10000000303389886.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:1.63MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注