Constrained Optimal Querying Huffman Coding and Beyond Shuyuan Zhang Jichen Sun Shengkang Chen

2025-04-27 0 0 810.08KB 20 页 10玖币
侵权投诉
Constrained Optimal Querying:
Huffman Coding and Beyond
Shuyuan Zhang Jichen Sun Shengkang Chen
May 2022
Abstract
Huffman coding is well known to be useful in certain decision problems involving
minimizing the average number of (freely chosen) queries to determine an unknown
random variable. However, in problems where the queries are more constrained, the
original Huffman coding no longer works. In this paper, we proposed a general model
to describe such problems and two code schemes: one is Huffman-based, and the other
called GBSC (Greedy Binary Separation Coding). We proved the optimality of GBSC by
induction on a binary decision tree, telling us that GBSC is at least as good as Shannon
coding. We then compared the two algorithms based on these two codes, by testing
them with two problems: DNA detection and 1-player Battleship, and found both to be
decent approximating algorithms, with Huffman-based algorithm giving an expected
length 1.1 times the true optimal in DNA detection problem, and GBSC yielding an
average number of queries 1.4 times the theoretical optimal in 1-player Battleship.
Keywords: Information theory, Huffman coding, greedy algorithms, decision trees,
Battleship game, DNA exon detection
1 Introduction
Our research is inspired by a problem in Cover’s textbook[1] (see page 153).
Example 1 (Bad wine).One is given six bottles of wine. It is known that precisely one bottle has
gone bad (tastes terrible). From inspection of the bottles it is determined that the probability pithat
the ith bottle is bad is given by (p1,p2, . . . , p6) = ( 8
23 ,6
23 ,4
23 ,2
23 ,2
23 ,1
23 ). Tasting will determine the
bad wine. You can mix some of the wines in a fresh glass and sample the mixture. You proceed,
1
arXiv:2210.04013v1 [cs.IT] 8 Oct 2022
mixing and tasting, stopping when the bad bottle has been determined. What is the minimum
expected number of tastings required to determine the bad wine?
As an exercise in the textbook, this problem is well solved. In fact, it is equivalent
to Huffman coding. The process that we determine which bottle of wine is bad can be
regarded as a decision tree. At each node, we make an observation that has two possible
outcomes (in this example, good or bad) and move to the left or right child node according
to the outcome. This process continues until we reach a leaf node. If we represent each
move to left with 0 and each move to right with 1, then a move sequence can be represented
by a binary code. Minimizing the expected number of moves is equivalent to minimizing
the expected code length, so we only need to use Huffman coding.
We want to generalize this problem. A natural generalization is changing the number
of bad wines. It seems that the original Huffman coding strategy will still work, but
unfortunately, it is wrong. This is shown by the following example.
Example 2 (More bad wine).The background is similar to Example 1, but this time there are four
bottles of wine to be examined and two of them are bad. Suppose the probability pij that the ith and
jth bottles are bad is given by (p12,p13,p14,p23,p24,p34)=(0.1, 0.1, 0.15, 0.15, 0.3, 0.2). What is
the minimum expected number of tastings required to determine the two bad wines?
If we repeat the Huffman coding strategy, we will obtain the unique Huffman tree below.
1
0.4
0.2
p12 p13
p34
0.6
p24 0.3
p14 p23
Figure 1: Huffman Tree of Example 2.
The expected Huffman code length is 2.5, but it is not the answer. This is because the
decision tree we have constructed is infeasible. To see this, just look at the root of the tree.
Note that in both the left and right subtrees, every bottle of wine may be bad, so there is
no way that we can decide in which direction we should move by just one observation.
2
Example 2 shows that changing Example 1 a little bit will produce a much more difficult
problem. In fact, Example 2 is just an instance of a huge family of similar problems, and
finding the precise solution of these problems is usually extremely difficult.
In this report, we proposed a generalized model that formulates such type of problems,
and analyzed three problems related to this model. We designed two generalized approx-
imate algorithms for these problems. One algorithm is based on Huffman coding, while
the other is based on a new code scheme (at least not being researched much). We call the
new code GBSC (Greedy Binary Separation Code), and we proved an interesting property
of GBSC.
2 Models
2.1 Generalized Model
We consider this generalized problem. Given a random variable Xwith alphabet Xand its
distribution p(x). We determine the value of Xby asking questions. The i-th question we
ask is that "is Xin set Si?", where SiA,i. Suppose we need Nquestions to determine
the value of X. Our goal is to find out the minimum of EN (the expectation of N) and if
possible, its corresponding strategy.
Definition 1. The family of set Ais called a decision set on X.
Obviously, different structures of the decision set will produce a bunch of completely
different problems. In the rest of this chapter, we will introduce three specializations of
this model, which are Huffman coding, the DNA detection problem and the Battleship
problem.
2.2 Huffman Coding: the Easy Case
In Example 1, the decision set is 2X. We have seen that in this case, the problem is equiva-
lent to Huffman coding. However, the condition can be loosened slightly.
Definition 2. A decision set Aon Xis decision-complete, if S2X\ {,X }, S Aor
X \ SA.
Proposition 1. If Ais decision-complete, then any decision tree of X is feasible.
3
Proof. This is easy to demonstrate. Asking "is Xin set Si?" is equivalent to asking "is Xin
set X \ Si?", and there is no point to ask whether Xis in or X.
Corollary 1. If Ais decision-complete, then Huffman coding will produce the optimal solution.
Proposition 1 also provides a trivial but practical necessary condition for a set to be
decision-complete.
Corollary 2. If Ais decision-complete, then |A| 2|X |11.
Using Corollary 2, it is easy to prove that the decision set of Example 2 is not decision-
complete, so the Huffman coding strategy might fail.
Proposition 2. The decision set Aof Example 2 is not decision-complete.
Proof. |X | =6, so |A|=241<2|X |11.
2.3 DNA Detection Problem
To determine genomic sequences of several organisms, biological meaning needs to be
assigned to particular regions of the sequence. One of important steps in this process is
the identification of genes. An Exon is an interval of the DNA sequence and it does not
overlap with other exons and gene is a sequence of exons. In this paper, we simplified the
question by assuming that the target gene only contains one exon. We can detect whether
the target exon is in an interval in the DNA sequence or not at each detection. The position
of the exon on DNA is fixed, so we want to minimize the expected number of detection to
determine the position of the exon by choosing the intervals wisely, thus reducing the cost
of DNA detection[2][3].
Definition 3. A set S of integers is continuous, if S = [min S, max S]Z.
Example 3. {1, 2, 3}and {4}are continuous, while {1, 3}is not.
If we assume that the exon’s location we want to find on the DNA is discrete and unique,
we can use a random variable X∈ {1, 2, . . . , n}to it, and in this case, the decision set is any
continuous set in space X.
4
摘要:

ConstrainedOptimalQuerying:HuffmanCodingandBeyondShuyuanZhangJichenSunShengkangChenMay2022AbstractHuffmancodingiswellknowntobeusefulincertaindecisionproblemsinvolvingminimizingtheaveragenumberof(freelychosen)queriestodetermineanunknownrandomvariable.However,inproblemswherethequeriesaremoreconstraine...

展开>> 收起<<
Constrained Optimal Querying Huffman Coding and Beyond Shuyuan Zhang Jichen Sun Shengkang Chen.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:810.08KB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注