Functional Labeled Optimal Partitioning Toby D. Hocking Toby.Hockingnau.edu Jacob M. Kaufman jmk478nau.edu

2025-05-06 0 0 661.76KB 15 页 10玖币
侵权投诉
Functional Labeled Optimal Partitioning
Toby D. Hocking, Toby.Hocking@nau.edu
Jacob M. Kaufman, jmk478@nau.edu
Alyssa J. Stenberg, ajs937@nau.edu
October 7, 2022
Abstract
Peak detection is a problem in sequential data analysis that involves
differentiating regions with higher counts (peaks) from regions with lower
counts (background noise). It is crucial to correctly predict areas that
deviate from the background noise, in both the train and test sets of labels.
Dynamic programming changepoint algorithms have been proposed to
solve the peak detection problem by constraining the mean to alternatively
increase and then decrease. The current constrained changepoint algo-
rithms only create predictions on the test set, while completely ignoring
the train set. Changepoint algorithms that are both accurate when fitting
the train set, and make predictions on the test set, have been proposed but
not in the context of peak detection models. We propose to resolve these
issues by creating a new dynamic programming algorithm, FLOPART,
that has zero train label errors, and is able to provide highly accurate
predictions on the test set. We provide an empirical analysis that shows
FLOPART has a similar time complexity while being more accurate than
the existing algorithms in terms of train and test label errors.
1 Introduction
Changepoint detection models have become a prominent solution for fields such
as genomics and medical monitoring that produce large amounts of data in
a time or space series. Many of the proposed optimal changepoint models
follow the classical dynamic programming algorithm of Auger and Lawrence
[1989], solving for the optimal
K
segments (
K
1changes) in
n
data in
O
(
Kn2
)
time. This algorithm is optimal in the sense that it computes the changepoints
and segment means which result in minimum loss given the data set. Given a
properly chosen non-negative penalty λ, the optimal solution can be computed
in
O
(
n2
)time using the classic Optimal Partitioning algorithm of Jackson et al.
[2005]. Newer pruning methods have allowed the algorithm to decrease the
number of changepoints assessed, which lowers the time complexity to
O
(
nlog n
)
Northern Arizona University
1
arXiv:2210.02580v1 [cs.LG] 5 Oct 2022
Table 1: Comparison between proposed FLOPART algorithm and previous
algorithms for changepoint detection (column for publication, label constraints,
up-down constraint, and time).
Algorithm Publication Labels Up-down Time
OPART Jackson et al. [2005] No No O(n2)
LOPART Hocking and Srivastava [2020] Yes No O(n2)
SegAnnot Hocking and Rigaill [2012] Yes No O(Kn2)
BINSEG Scott and Knott [1974] No No O(nlog K)
FPOP Maidstone et al. [2016] No No O(nlog n)
GFPOP Runge et al. [2020] No Yes O(nlog n)
FLOPART This paper Yes Yes O(nlog n)
while staying optimal [Maidstone et al., 2016]. Binary segmentation is a classic
heuristic algorithm which is often employed to reduce computation times, but
it is not guaranteed to compute a set of changepoints with optimal loss [Scott
and Knott, 1974]. In each of these algorithms there is a model complexity
parameter (penalty or number of splits/segments), which can be chosen using
either unsupervised [Yao, 1988, Zhang and Siegmund, 2007] as well as supervised
methods when labeled training data are available [Rigaill et al., 2013, Truong
et al., 2017]. Labels are typically created by domain experts, and indicate
presence or absence of changepoints in particular regions (Figure 1).
1.1 Algorithms for peak detection
This paper, however, focuses on changepoint models with constrained model
parameters to produce more interpretable results. One problem that motivated
the introduction of constrained models is peak detection in ChIP-seq data [Barski
et al., 2007]. Peak detection is the problem of differentiating regions with higher
counts (peaks) from areas with lower counts (background noise). Up-down
constrained changepoint models can also be computed in
O
(
nlog n
)time, and
have shown high peak detection accuracies in ChIP-seq data [Hocking et al., 2020,
Runge et al., 2020]. The up-down constraint on the model forces an up change
in the segment mean parameter to be followed by a down transition and vice
versa. This constraint is necessary for interpretation as a peak detection model
because, in an unconstrained changepoint model, the outputted segmentation
can result in consecutive up-down changes in the segment mean parameter.
In recent findings, the max jump post-processing rule has improved upon the
accuracy of the up-down constrained models in ChIP-seq data [Liehrmann et al.,
2020]. The max jump rule converts the outputted segmentation of unconstrained
changepoint models into peaks and background states. The post-processing rule
works by grouping consecutive changes in segment mean parameters (either up
or down) together and choosing the maximum mean difference from each group.
2
Novelty with respect to previous work.
We propose a new supervised
up-down constrained changepoint detection algorithm for the case of labeled
data sequences. The current fastest constrained algorithm, GFPOP [Runge
et al., 2020], can be considered unsupervised because the algorithm does not use
the labels in the train set, allowing for train errors. In contrast, changepoint
models that currently utilize labels are all unconstrained models (no constraints
between adjacent segment means), including the LOPART algorithm [Hocking
and Srivastava, 2020] and the SegAnnot algorithm [Hocking and Rigaill, 2012].
In this paper, we resolve the drawbacks of both previous algorithms, resulting
in our new Functional Labeled Optimal PARTitioning (FLOPART) algorithm,
which incorporates the label constraint ideas from LOPART with the constraints
on adjacent segment means from GFPOP (Table 1).
2 Mathematical framework and optimization prob-
lems
2.1 Previous problem with label constraints but no con-
straints on adjacent segment means
We begin this section by explaining the optimization problem solved by the
previous LOPART algorithm [Hocking and Srivastava, 2020], which has label
constraints but no up-down constraints. We are given a data sequence
z1, . . . , zn
,
and llabeled regions (p1, p1, t1),...,(pl, pl, tl), where
pj< pj
are the start/end of a labeled region (in data sequence coordinates,
1 to n),
tj∈ {
0
,
1
}
is the type of change, here the number of changes in the region
[pj, pj].
We assume the labels are ordered, 1
p1< p1 · · · pl< pln
. We would
like to compute a mean vector
mRn
which solves the following optimization
problem,
min
mRn
n
X
i=1
`(zi, mi) + λ
n1
X
i=1
I(mi6=mi+1)(1)
subject to
pj1
X
i=pj
I(mi6=mi+1) = tjfor all j∈ {1, . . . , l},(2)
where
`
is a loss function and
I
is the indicator function (1 if true, 0 otherwise).
The objective (1) in the optimization problem above contains terms for the loss
`
(to encourage data fitting), and a per-change penalty of
λ
0(to control
model complexity). There are also constraints (2) which ensure that the correct
number of changes occur in each label.
3
摘要:

FunctionalLabeledOptimalPartitioningTobyD.Hocking,Toby.Hocking@nau.edu*JacobM.Kaufman,jmk478@nau.edu*AlyssaJ.Stenberg,ajs937@nau.edu*October7,2022AbstractPeakdetectionisaprobleminsequentialdataanalysisthatinvolvesdierentiatingregionswithhighercounts(peaks)fromregionswithlowercounts(backgroundnoise)...

展开>> 收起<<
Functional Labeled Optimal Partitioning Toby D. Hocking Toby.Hockingnau.edu Jacob M. Kaufman jmk478nau.edu.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:661.76KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注