Functional Labeled Optimal Partitioning Toby D. Hocking Toby.Hockingnau.edu Jacob M. Kaufman jmk478nau.edu

2025-05-06 0 0 661.76KB 15 页 10玖币

侵权投诉

Functional Labeled Optimal Partitioning

Toby D. Hocking, Toby.Hocking@nau.edu∗

Jacob M. Kaufman, jmk478@nau.edu∗

Alyssa J. Stenberg, ajs937@nau.edu∗

October 7, 2022

Abstract

Peak detection is a problem in sequential data analysis that involves

diﬀerentiating regions with higher counts (peaks) from regions with lower

counts (background noise). It is crucial to correctly predict areas that

deviate from the background noise, in both the train and test sets of labels.

Dynamic programming changepoint algorithms have been proposed to

solve the peak detection problem by constraining the mean to alternatively

increase and then decrease. The current constrained changepoint algo-

rithms only create predictions on the test set, while completely ignoring

the train set. Changepoint algorithms that are both accurate when ﬁtting

the train set, and make predictions on the test set, have been proposed but

not in the context of peak detection models. We propose to resolve these

issues by creating a new dynamic programming algorithm, FLOPART,

that has zero train label errors, and is able to provide highly accurate

predictions on the test set. We provide an empirical analysis that shows

FLOPART has a similar time complexity while being more accurate than

the existing algorithms in terms of train and test label errors.

1 Introduction

Changepoint detection models have become a prominent solution for ﬁelds such

as genomics and medical monitoring that produce large amounts of data in

a time or space series. Many of the proposed optimal changepoint models

follow the classical dynamic programming algorithm of Auger and Lawrence

[1989], solving for the optimal

segments (

K−

1changes) in

data in

(

Kn2

)

time. This algorithm is optimal in the sense that it computes the changepoints

and segment means which result in minimum loss given the data set. Given a

properly chosen non-negative penalty λ, the optimal solution can be computed

(

)time using the classic Optimal Partitioning algorithm of Jackson et al.

[2005]. Newer pruning methods have allowed the algorithm to decrease the

number of changepoints assessed, which lowers the time complexity to

(

nlog n

)

∗Northern Arizona University

arXiv:2210.02580v1 [cs.LG] 5 Oct 2022

Table 1: Comparison between proposed FLOPART algorithm and previous

algorithms for changepoint detection (column for publication, label constraints,

up-down constraint, and time).

Algorithm Publication Labels Up-down Time

OPART Jackson et al. [2005] No No O(n2)

LOPART Hocking and Srivastava [2020] Yes No O(n2)

SegAnnot Hocking and Rigaill [2012] Yes No O(Kn2)

BINSEG Scott and Knott [1974] No No O(nlog K)

FPOP Maidstone et al. [2016] No No O(nlog n)

GFPOP Runge et al. [2020] No Yes O(nlog n)

FLOPART This paper Yes Yes O(nlog n)

while staying optimal [Maidstone et al., 2016]. Binary segmentation is a classic

heuristic algorithm which is often employed to reduce computation times, but

it is not guaranteed to compute a set of changepoints with optimal loss [Scott

and Knott, 1974]. In each of these algorithms there is a model complexity

parameter (penalty or number of splits/segments), which can be chosen using

either unsupervised [Yao, 1988, Zhang and Siegmund, 2007] as well as supervised

methods when labeled training data are available [Rigaill et al., 2013, Truong

et al., 2017]. Labels are typically created by domain experts, and indicate

presence or absence of changepoints in particular regions (Figure 1).

1.1 Algorithms for peak detection

This paper, however, focuses on changepoint models with constrained model

parameters to produce more interpretable results. One problem that motivated

the introduction of constrained models is peak detection in ChIP-seq data [Barski

et al., 2007]. Peak detection is the problem of diﬀerentiating regions with higher

counts (peaks) from areas with lower counts (background noise). Up-down

constrained changepoint models can also be computed in

(

nlog n

)time, and

have shown high peak detection accuracies in ChIP-seq data [Hocking et al., 2020,

Runge et al., 2020]. The up-down constraint on the model forces an up change

in the segment mean parameter to be followed by a down transition and vice

versa. This constraint is necessary for interpretation as a peak detection model

because, in an unconstrained changepoint model, the outputted segmentation

can result in consecutive up-down changes in the segment mean parameter.

In recent ﬁndings, the max jump post-processing rule has improved upon the

accuracy of the up-down constrained models in ChIP-seq data [Liehrmann et al.,

2020]. The max jump rule converts the outputted segmentation of unconstrained

changepoint models into peaks and background states. The post-processing rule

works by grouping consecutive changes in segment mean parameters (either up

or down) together and choosing the maximum mean diﬀerence from each group.

Novelty with respect to previous work.

We propose a new supervised

up-down constrained changepoint detection algorithm for the case of labeled

data sequences. The current fastest constrained algorithm, GFPOP [Runge

et al., 2020], can be considered unsupervised because the algorithm does not use

the labels in the train set, allowing for train errors. In contrast, changepoint

models that currently utilize labels are all unconstrained models (no constraints

between adjacent segment means), including the LOPART algorithm [Hocking

and Srivastava, 2020] and the SegAnnot algorithm [Hocking and Rigaill, 2012].

In this paper, we resolve the drawbacks of both previous algorithms, resulting

in our new Functional Labeled Optimal PARTitioning (FLOPART) algorithm,

which incorporates the label constraint ideas from LOPART with the constraints

on adjacent segment means from GFPOP (Table 1).

2 Mathematical framework and optimization prob-

lems

2.1 Previous problem with label constraints but no con-

straints on adjacent segment means

We begin this section by explaining the optimization problem solved by the

previous LOPART algorithm [Hocking and Srivastava, 2020], which has label

constraints but no up-down constraints. We are given a data sequence

z1, . . . , zn

and llabeled regions (p1, p1, t1),...,(pl, pl, tl), where

•pj< pj

are the start/end of a labeled region (in data sequence coordinates,

1 to n),

•tj∈ {

}

is the type of change, here the number of changes in the region

[pj, pj].

We assume the labels are ordered, 1

≤p1< p1≤ · · · ≤ pl< pl≤n

. We would

like to compute a mean vector

m∈Rn

which solves the following optimization

problem,

min

m∈Rn

i=1

`(zi, mi) + λ

n−1

i=1

I(mi6=mi+1)(1)

subject to

pj−1

i=pj

I(mi6=mi+1) = tjfor all j∈ {1, . . . , l},(2)

where

is a loss function and

is the indicator function (1 if true, 0 otherwise).

The objective (1) in the optimization problem above contains terms for the loss

(to encourage data ﬁtting), and a per-change penalty of

λ≥

0(to control

model complexity). There are also constraints (2) which ensure that the correct

number of changes occur in each label.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

FunctionalLabeledOptimalPartitioningTobyD.Hocking,Toby.Hocking@nau.edu*JacobM.Kaufman,jmk478@nau.edu*AlyssaJ.Stenberg,ajs937@nau.edu*October7,2022AbstractPeakdetectionisaprobleminsequentialdataanalysisthatinvolvesdierentiatingregionswithhighercounts(peaks)fromregionswithlowercounts(backgroundnoise)...

展开>> 收起<<

Functional Labeled Optimal Partitioning Toby D. Hocking Toby.Hockingnau.edu Jacob M. Kaufman jmk478nau.edu.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Functional Labeled Optimal Partitioning Toby D. Hocking Toby.Hockingnau.edu Jacob M. Kaufman jmk478nau.edu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: