Hierarchical Categories in Colored Searching Peyman AfshaniRasmus KillmannKasper Green Larsen Abstract

2025-04-24 0 0 401.7KB 13 页 10玖币
侵权投诉
Hierarchical Categories in Colored Searching
Peyman AfshaniRasmus KillmannKasper Green Larsen
Abstract
In colored range counting (CRC), the input is a set of points where each point is assigned a
“color” (or a “category”) and the goal is to store them in a data structure such that the number of
distinct categories inside a given query range can be counted efficiently. CRC has strong motivations
as it allows data structure to deal with categorical data.
However, colors (i.e., the categories) in the CRC problem do not have any internal structure,
whereas this is not the case for many datasets in practice where hierarchical categories exists or
where a single input belongs to multiple categories. Motivated by these, we consider variants of the
problem where such structures can be represented. We define two variants of the problem called
hierarchical range counting (HCC) and sub-category colored range counting (SCRC) and consider
hierarchical structures that can either be a DAG or a tree. We show that the two problems on some
special trees are in fact equivalent to other well-known problems in the literature. Based on these,
we also give efficient data structures when the underlying hierarchy can be represented as a tree. We
show a conditional lower bound for the general case when the existing hierarchy can be any DAG,
through reductions from the orthogonal vectors problem.
1 Introduction
Range searching is a broad area of computational geometry where the goal is to store a given set Pof
input points in a data structure such that one can efficiently report (range reporting) or count (range
counting) the points inside a geometric query region R. Sometimes, each point piPis associated with
a weight wiRand the goal could be to find the sum of the weights in R, or the maximum weight
in P∩ R (range max problem). This is a very broad area of research and there are many well-studied
variants. See a recent excellent survey by Agarwal for more information [1].
Colored (or categorical) range searching is an important variant where each input point is associated
with a category which conceptually is represented as a color; the goal of the query is then to report
or count the number of distinct colors inside the query range. Colored range searching has strong
motivations since colors allow us to represent nominal attributes such as brand, manufacturer and so on
and in practice a data set often contains a mix of nominal and ordinal attributes.
Colored range searching was introduced in 1993 by Janardan and Lopez [14] and it has received con-
siderable attention since then. However, classical colored range searching only models a “flat” categorical
structure where categories have no inherent structure. We consider variants of colored range counting
to model such structures. We show that looking at colored range searching from this angle creates a
number of interesting questions with non-trivial connections to other already existing problems.
1.1 Problem Definitions and Motivations
We begin by formally defining colored range counting.
Problem 1 (colored range counting).Given an input set Pof npoints in Rd, a set Cof colors (i.e.,
categories) and a function C:PC, store them in a data structure such that given a query range R ⊂
Rd, it can count the number of distinct colors in R, i.e., the value |CR|where CR={C(p)|pP∩ R}.
Aarhus University, Denmark peyman@cs.au.dk. Supported by Independent Research Fund Denmark (DFF) grant ID
DFF701400404.
Aarhus University, Denmark killmann@cs.au.dk. Supported by Independent Research Fund Denmark (DFF) grant
ID DFF701400404.
Aarhus University, Denmark larsen@cs.au.dk. Supported by Independent Research Fund Denmark (DFF) Sapere
Aude Research Leader grant No 9064-00068B.
arXiv:2210.05403v1 [cs.DS] 11 Oct 2022
In the weighted version, input also includes a weight function W:CRand the output of the query
is the weighted sum of the distinct colors in R, i.e., the value PcCRW(c).
Colored range searching assumes that the colors are completely unstructured and that each point
receives exactly one color. However, these assumptions can be inadequate as hierarchical categories are
quite common. For example, biological classification of living organisms are done through a tree where
inclusion in a category implies inclusion in all the ancestor categories. In fact, similar phenomena happen
with respect to most notions of classification (e.g., classification of industries). In other scenarios, a point
may have multiple categories (e.g., a car can have a “brand”, a “color”, “fuel type” and so on). While
it is possible to view the set of categories assigned to a point as one single category, doing so ignores a
lot of structure. For example, “a blue diesel car” is both a “blue car” and also a “diesel car” but by
considering “blue diesel” as a single category, this information is lost. We believe it is worthwhile to
study the notion of structures within categories from a theoretical point of view. We are not in fact the
first in trying to do so and we will shortly discuss some of the previous attempts.
A natural way to represent hierarchical categories is to assume that vertices of a DAG Grepresent
the set of categories where an edge ~e = (u, v) from (a category) uto (a category) vmeans that uis a
sub-category of v. We call Gacategory DAG (or a category tree if it is a tree). For a vertex v∈ G, we
define G(v) as the subset of vertices of Gthat can reach v(i.e., “sub-categories” of v), G(v) as the
subset of vertices of Gthat vcan reach (i.e., “super-categories” of v). Similarly, for a subset HV(G)
we define G(H) = vHG(v), and G(H) = vHG(v).
Category trees allows us to represent hierarchical categories. Category DAGs allow us to capture cases
where points can have multiple categories. Consider the car example again. We can define a category
DAG Gwhere a vertex u∈ G represents the compound category {diesel,blue}with edges to vertices
dand bthat represent “diesel” and “blue” categories respectively. Thus, the set G(d) represents all
the “diesel” cars and it includes the category u, the “blue diesel” category, and similarly, the set G(b)
represents all the “blue” cars which also includes the category u, the “blue diesel” category. Likewise,
G(u) includes both band dsince a “blue diesel” car is both a “blue car” and a “diesel car”.
We revisit colored range searching problems, using concepts of category DAGs or trees.
Problem 2 (sub-category range counting (SCRC)).Consider an input point set PRdof npoints, a
DAG Gwith O(n)edges, and a function C:P→ G. The goal is to store the input in a data structure,
such that given a query that consists of a query range R ⊂ Rdand a query vertex vq∈ G it can output
|CR∩ G(vq)|where CR={C(p)|pP∩ R}.
Problem 3 (hierarchical color counting (HCC)).Consider an input point set PRdof npoints, a
DAG Gwith O(n)edges, and a function C:P→ G. The goal is to store the input in a data structure,
such that given a query range R ⊂ Rdone can output |GR|where GRis the set of colors in R, i.e.,
GR=Sp∈R∩PG(C(p)).
In the weighted version of the problem, each vertex v(i.e., category) of Gis associated with a weight
w(v)and given the query R, the goal is to compute PvGRw(v)instead.
Related problems. Very recently, there have been other attempts to consider the structure of
“colors” within the computational geometry community, e.g., He and Kazi [12] consider a problem very
similar to SCRC on a tree G; the only difference is that instead of |CR∩ G(v)|, the query outputs
|CRπ(v, w)|where π(v, w) is a path between two query vertices v, w ∈ G. There are also other variants
available. See [12] for further references.
1.2 Previous and Other Related Results
The study of colored range counting and its variants began in 1993 [14] and since then it has received
considerable attention. See the survey on colored range searching and its variants [9]. The problem has
at least three main variants: color range counting (CRC), color range reporting, and “type 2” color range
counting (for every distinct color, count how many times it appears). Here, we only review colored range
counting results.
In 1D, one can solve the CRC problem using O(n) space and O(log n) query time by an elegant and
simple transformation [10] which turns the problem into the unweighted 2D orthogonal range counting
2
problem which can be solved within said bounds [5]. Interestingly, it is also possible to show an equiva-
lence between the two problems [16]. The problem, however, is difficult for 2D and beyond. Kaplan et
al. [15] showed that answering mqueries on a set of npoints requires Ω(nω/2o(1)) time where ωis the
boolean matrix multiplication exponent. Under some assumptions (e.g., the boolean matrix multiplica-
tion conjectures), this shows that P(n) + mQ(n)n3/2o(1) where P(·) and Q(·) are the preprocessing
time and the query time of any data structure that solves the 2D CRC problem.
Note that the equivalence between 1D CRC and 2D range counting also applies to the weighted case
of both problems, however, the status of the weighted 2D range counting is still unresolved. It can be
solved with O(nlog n/ log log n) space and O(log n/ log log n) query time [13] but it is not known if both
space and query time can be improved simultaneously (it is possible to improve one at the expense of
the other). The only available lower bound is a query time lower bound of Ω(log n/ log log n) [17].
Some other interesting problems related to our results are defined below.
Definition 1. The following problems are defined for an input that consists of a set PRdof npoints.
The goal is to build a data structure to answer the following queries.
(orthogonal range counting) Given a query axis-aligned rectangle R, count the number of points
in R. In the weighted version, the points have weights and the goal is to compute the sum of the
weights in the query.
(dominance range counting) This is a special case of orthogonal range counting where the query
rectangle has the form (−∞, q1]× ···(−∞, qd]which is also known as a dominance range. Or-
thogonal range counting and dominance range counting are known to be equivalent if subtraction of
weights are allowed (e.g., integer weights).
(3-sided color counting). The input is in the plane (d= 2) and each point is assigned a color from
a set Cand the query is a 3-sided range in the form of R= [q`, qr]×(−∞, qt]and the goal is to
count the number of colors in R. In the weighted version, the colors have weights and the goal is
to compute the sum of the weights of the colors.
(3-sided distinct coordinate counting) This is a special case of 3-sided color counting where an input
point (xi, yi)has color yi; in other words, given the query R= [q`, qr]×(−∞, qt], we would like to
count the number of distinct Y-coordinates inside it.
(range max) Given a weight function W:PRas part of the input, at the query time we would
like to find the maximum weight inside a given query range R.
(sum-max color counting) This a combination of range max and color counting queries. Assume the
points in Phave been assigned colors from a set Cand assume we have a weight function W:PR
on the points. Given a query range R, we would like to compute the output PcCXc(R)where
Xc(R)is the maximum weight of a point with color cinside R; if no point of color cexists in R,
then Xc(R) = 0.
A sum-max color counting query includes a number of the above problems as its special case: If all
points have the same weight, then it reduces to a color counting query. If all points have the same
color, then it reduces to a range max query. If all points have distinct colors, then it reduces to a
weighted range counting query.
1.3 Our Results
Clearly, sub-category range counting (SCRC) is at least as hard as CRC. We also observe that hierarchical
color counting (HCC) is also at least as hard. Thus, getting efficient results for d2 seems hopeless.
Consequently, we focus on the 1D problem but for two different important DAG’s: when Gis a tree and
also when Gis an arbitrary (sparse) DAG. Our main results are the following.
For the SCRC problem, first, we observe that the following problems are equivalent:
1. SCRC when Gis a single path on a one-dimensional input P.
2. 3-sided distinct coordinate counting (for a planar point set P).
3
摘要:

HierarchicalCategoriesinColoredSearchingPeymanAfshani*RasmusKillmann„KasperGreenLarsen…AbstractIncoloredrangecounting(CRC),theinputisasetofpointswhereeachpointisassignedacolor"(oracategory")andthegoalistostoretheminadatastructuresuchthatthenumberofdistinctcategoriesinsideagivenqueryrangecanbecount...

展开>> 收起<<
Hierarchical Categories in Colored Searching Peyman AfshaniRasmus KillmannKasper Green Larsen Abstract.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:401.7KB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注