
Oset-value coding in database query processing
Goetz Graefe
Google Inc
Madison, Wisconsin, USA
GoetzG@Google.com
Thanh Do∗
Celonis Inc
Madison, Wisconsin, USA
T.Do@Celonis.com
ABSTRACT
Recent work [
9
] shows how oset-value coding speeds up data-
base query execution, not only sorting but also duplicate removal
and grouping (aggregation) in sorted streams, order-preserving ex-
change (shue), merge join, and more. It already saves thousands
of CPUs in Google’s Napa and F1 Query systems, e.g., in grouping
algorithms and in log-structured merge-forests.
In order to realize the full benet of interesting orderings, however,
query execution algorithms must not only consume and exploit
oset-value codes but also produce oset-value codes for the next
operator in the pipeline. Our research has sought ways to pro-
duce oset-value codes without comparing successive output rows
one-by-one, column-by-column. This short paper introduces a new
theorem and, based on its proof and a simple corollary, describes in
detail how order-preserving algorithms (from lter to merge join
and even shue) can compute oset-value codes for their outputs.
These computations are surprisingly simple and very ecient.
This paper is the extended version of an EDBT 2023 publication [
21
].
1 INTRODUCTION
Tree-of-losers priority queues [
13
,
25
] and oset-value coding [
7
,
23
] enable the most ecient sort algorithmsfor large records. When
sorting a database table with
𝑁
rows, the provable lower bound
for row comparisons is
𝑙𝑜𝑔2(𝑁
!
) ≈ 𝑁𝑙𝑜𝑔2(𝑁/𝑒)
with
𝑒≈
2
.
718.
External merge sort using tree-of-losers priority queues for both
run generation and merging exceeds this lower bound by only 1-2%.
Oset-value coding truncates shared row prexes and turns prex
sizes into order-preserving surrogate keys. These surrogate keys
limit the count of column value comparisons for
𝑁
rows with
𝐾
key columns to
𝑁×𝐾
[
9
]. Importantly, there is no
𝑙𝑜𝑔(𝑁)
factor.
Oset-value codes alone decide the remainder of the
𝑙𝑜𝑔2(𝑁
!
)
row
comparisons.
Recent work shows how oset-value coding complements “inter-
esting orderings” [
32
] to speed up database query processing, not
only sorting but also duplicate removal and grouping (aggregation)
in sorted streams, order-preserving exchange (shue), merge join,
and more [
9
,
10
]. For the full benet of interesting orderings, all
execution algorithms relying on sort order must also exploit oset-
value coding in their comparisons of rows and columns. Moreover,
order-preserving query execution algorithms must not only con-
sume but also produce oset-value codes, to be consumed and
exploited by the next operator in the pipeline.
Nevertheless, computing oset-value codes in order-preserving
query execution algorithms has not received any attention, and the
∗Work done at Google Inc.
only method known to-date – comparing an operator’s output row-
by-row, column-by-column – is so expensive that it renders produc-
ing oset-value across database operators much less attractive or
even worthless. In contrast, new, simple techniques eciently com-
pute oset-value codes for order-preserving algorithms in database
query execution. The new techniques do not require comparisons
row-by-row, column-by-column; instead, oset-value codes in the
output depend only on oset-value codes in the inputs. There is
no need for additional column value comparisons beyond those
required in the operation itself, e.g., column value comparisons in
the merge logic of merge join. The new techniques are implemented
in Google’s Napa [1] and F1 Query [31, 33] systems.
The rest of the paper is structured as follows. We summarize
related prior work in Section 2 and explain oset-value coding
and tree-of-losers priority queues in Section 3. In Section 4, we
rst establish a solid foundation for ecient computation of oset-
value codes by introducing a new theorem, its proof, and a pertinent
corollary. Based on this theory, we then present how oset-value
codes can be eciently derived for lter, project, segmented sorting,
duplicate removal, grouping, merge join (inner, outer, and semi
joins), nested-loops join, order-preserving exchange, and more. We
present performance results in Section 6 and nally conclude the
paper in Section 7.
2 RELATED PRIOR WORK
The context of our work are Google’s Napa [
1
] and F1 Query [
31
,
33
]
systems. Napa is a data warehouse that maintains thousands of
materialized views in log-structured merge-forests [
29
]. F1 Query
is a federated query processing platform that executes SQL queries
over tables in various Google storage systems such as Spanner [
8
],
BigTable [
6
], and Napa. Both Napa and F1 Query employ sort order
for ecient data access and data manipulation.
Pioneering work on sorting established the benets of tree-of-
losers priority queues and of oset-value coding, both for run gen-
eration and external merge steps [
7
,
13
,
23
,
25
]. Pioneering work
in the database eld established the value of sort-based query exe-
cution, notably merge join but also duplicate removal and group-
ing [
4
,
11
,
32
]. Surveys on database query evaluation [
15
,
17
] cover
sorting and oset-value coding but fail to recognize oset-value
coding as a signicant opportunity throughout sort-based query
execution. Recent work [
9
,
10
] lls that gap but fails to oer e-
cient derivation of oset-value codes for output rows. The present
short paper lls this remaining gap.
3 BACKGROUND: OFFSET-VALUE CODING
AND TREE-OF-LOSERS PRIORITY QUEUES
A tree-of-losers priority queue [
13
,
25
], also known as a tournament
tree, embeds a balanced binary tree in an array, with the tree’s unary
arXiv:2210.00034v3 [cs.DB] 17 Feb 2023