Dimensional Data KNN-based Imputation Yuzhao Yang10000000265524812 J er ome Darmont2000000031491384X Franck Ravat1000000034820841X and Olivier Teste10000000303389886

2025-04-27 0 0 1.63MB 14 页 10玖币

侵权投诉

Dimensional Data KNN-based Imputation?

Yuzhao Yang1[0000−0002−6552−4812], J´erˆome Darmont2[0000−0003−1491−384X],

Franck Ravat1[0000−0003−4820−841X], and Olivier Teste1[0000−0003−0338−9886]

1IRIT-CNRS (UMR 5505), Universit´e de Toulouse, France

{Yuzhao.Yang, Franck.Ravat, Olivier.Teste}@irit.fr

2Universit´e de Lyon, Lyon 2, UR ERIC, France

jerome.darmont@univ-lyon2.fr

Abstract. Data Warehouses (DWs) are core components of Business

Intelligence (BI). Missing data in DWs have a great impact on data

analyses. Therefore, missing data need to be completed. Unlike other

existing data imputation methods mainly adapted for facts, we propose a

new imputation method for dimensions. This method contains two steps:

1) a hierarchical imputation and 2) a k-nearest neighbors (KNN) based

imputation. Our solution has the advantage of taking into account the

DW structure and dependency constraints. Experimental assessments

validate our method in terms of eﬀectiveness and eﬃciency.

Keywords: Data Imputation ·Data Warehouses ·Dimensions ·KNN

1 Introduction

Data warehouses (DWs) are widely used in companies and organizations as a

signiﬁcant Business Intelligence (BI) tool to help them building their decision

support systems. Data in DWs are usually modelled in a multidimensional way,

which allows the user to analyse data through On Line Analytical Processing

(OLAP). An OLAP model organizes data according to analysis subjects (facts)

associated to analysis axis (dimensions). Each fact is composed of measures.

Each dimension contains one or several analysis viewpoints (hierarchies).

Missing data may exist in a DW. There are 2 types of DW missing data:

dimensional missing data which are missing data in the dimensions and fac-

tual missing data which are in the facts. These missing data have impact on

OLAP analyses. It is important to complete the missing data for the sake of a

better data analysis.

Data imputation is the process of replacing the missing values by some plau-

sible values based on information available in the data [12]. The current DW data

imputation research mainly focuses on factual data [25,21,4]. Yet the dimensional

missing data make aggregated data incomplete and make it hard to analyse them

with respect to hierarchy levels. Therefore the imputation for DW dimensions is

also necessary. However the DW dimension has a complex structure containing

?This work is supported by the French National Research Agency (ANR), Project

ANR-19-CE23-0005 BI4people (Business intelligence for the people).

arXiv:2210.02237v1 [cs.DB] 5 Oct 2022

2 Y. Yang et al.

diﬀerent hierarchies with diﬀerent granularity levels having their dependency

relationships. When we complete the dimensional missing data, we have to take

the DW structure and the dependency constraints into account. We proposed

a hierarchical imputation based on the inter- and intra-dimensional hierarchical

dependency relationships [27] for the imputation of dimensional missing data.

To the best of our knowledge, there is no other speciﬁc data imputation method

for DW dimensions. The hierarchical imputation is convincible because we use

accurate data based on real functional dependency relationships. However, this

method is limited owing to the sparsity problem which means that for an in-

stance to be completed, there may not be an instance sharing the same value on

a lower-granularity level of the hierarchy.

In order to complete as many values as possible, in this paper, we propose

H-OLAPKNN, an imputation method for DW dimensions by extending the hi-

erarchical imputation with a novel dimension imputation method called OLAP-

KNN. OLAPKNN is based on K-nearest neighbours (KNN) algorithm. KNN

imputation ﬁnds the K nearest neighbors of an instance with missing data then

ﬁlls in the missing data based on the mean or mode of the neighbors’ value [23].

We choose KNN because it is a non-parametric and instance-based algorithm,

which is widely applied for data imputation [3] and has been proved to have

relatively high accuracy [2,23]. Compared to the basic KNN imputation, OLAP-

KNN considers the structure complexity and the dependency constraints of the

dimension hierarchies. Moreover, the dimensional data are usually qualitative on

which we focus in this paper.

The remainder of this paper is organized as follows. In Section 2, we review

the related work about data imputation algorithms. In Section 3, we formal-

ize the DW dimension model. In Section 4, we propose a distance calculation

method for dimension instances. In Section 5, we explain in detail our proposed

dimension imputation algorithm. In Section 9, we validate our proposal by some

experiments. In Section 7, we conclude this paper and hint at future research.

2 Related Work

There are various data imputation methods [16]: statistic based imputation,

machine-learning based imputation, rule based imputation, external source based

imputation and hybrid methods etc. The statistic based imputation completes

the missing values by applying the statistical methods like ﬁlling average, the

most frequent value or with the value of the most similar record; there are

also methods using the regression to predict the missing values [19]. The ma-

chine learning based imputation methods use algorithms like k-nearest neighbor

(KNN) [2,23,10,17], regression models [13], Naive Bayes [9] to predict the missing

values. The rule based imputation methods [8,22,5] complete the missing values

by some business rules, similarity rules or dependency rules. Concerning the ex-

ternal source based methods, the crowdsourcing [14] can be applied for the data

imputation by putting forward the queries in the crowdsourcing frameworks and

collecting answers to complete the missing data. There are also methods which

Dimensional Data KNN-based Imputation 3

realize the imputation through web information [29,26] like web pages, web lists

and web tables. What’s more, there are hybrid methods which mix diﬀerent

imputation methods to provide a higher performance.

The statistic and machine learning based methods mainly focus on the nu-

merical data, which ﬁt for the imputation of facts where the data are mostly

numerical. However, in the dimensions, there are mainly qualitative data which

make it diﬃcult to process the data imputation by such imputation methods.

The rule based and external source based imputation methods may be suitable

for the imputation of dimensions, but they need time and eﬀorts to create rules

or ﬁnd the appropriate sources. Hence we propose H-OLAPKNN which combines

the hierarchical imputation with a KNN-based imputation method.

3 DW Dimension

As a DW is composed of dimensions and facts and we focus on the dimension

imputation, we introduce the DW dimension concepts used in this paper [20].

Deﬁnition 1 (Dimension). In a data warehouse, a dimension, denoted by

D, is deﬁned as (AD, HD, ID). AD={a1, ..., au}∪{id}is a set of attributes,

where id represents the dimension’s identiﬁer; HD={H1, ..., Hv}is a set of

hierarchies; IDis a matrix of dimension instances, for a given row r, the row

instance vector is denoted as ir; for a given attribute au, their joint instance

value is denoted as ir,au.

Deﬁnition 2 (Hierarchy). Ahierarchy of dimension D, denoted by H∈

HD, is deﬁned as (P aramH, W eakH).P aramH=< idD, pH

2, ..., pH

v>is an

ordered set of dimension attributes, called parameters, which set granularity

levels along the dimensions, ∀k∈[1...v], pH

k∈AD. Parameter pH

1rolls up to pH

in His denoted as pH

1HpH

2;W eakH=P aramH→2(AD−P aramH)is a map-

ping possibly associating each parameter with one or several weak attributes,

which are also dimension attributes providing additional information; All param-

eters and weak attributes of Hconstitute the hierarchy attributes of H, denoted

by AH=P aramH∪(S

v∈P aramH

W eakH[pH

v])

There exists diﬀerent types of hierarchy, but the most basic and common

one is the strict hierarchy [15] where a value at a hierarchy’s lower-granularity

belongs to only one higher-granularity value [24]. Thus in this paper, we only

consider the case of the strict hierarchy.

4 Distance Between Dimension Instances

Since the KNN imputation select the k-nearest neighbors of the missing data

instance for the imputation, we should calculate the distance between dimension

instances containing missing data to be completed and other instances. In a

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DimensionalDataKNN-basedImputation?YuzhaoYang1[0000000265524812],Jer^omeDarmont2[000000031491384X],FranckRavat1[000000034820841X],andOlivierTeste1[0000000303389886]1IRIT-CNRS(UMR5505),UniversitedeToulouse,FrancefYuzhao.Yang,Franck.Ravat,Olivier.Testeg@irit.fr2UniversitedeLyon,Lyon2,URERIC,Francej...

展开>> 收起<<

Dimensional Data KNN-based Imputation Yuzhao Yang10000000265524812 J er ome Darmont2000000031491384X Franck Ravat1000000034820841X and Olivier Teste10000000303389886.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Dimensional Data KNN-based Imputation Yuzhao Yang10000000265524812 J er ome Darmont2000000031491384X Franck Ravat1000000034820841X and Olivier Teste10000000303389886

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: