sults. Furthermore, we propose a more general heuristics for designing a data-driven procedure
which can be successfully applied to several problems in spectral analysis and give theoretical guar-
antees. The interest of this approach from integral operators lies in the fact that unlike most of
the spectral methods based on noisy matrices (Bonhomme et al., 2016b; De Castro et al., 2017;
Leh´ericy, 2019), the method does not require any choices of a functional basis or its number of
elements. Hence, the proposed method does not require any knowledge of an upper bound of the
order of the HMM. Moreover, different types of data (including circular or mixed-type data) can
be managed. Since the distribution of the pair of consecutive observations is estimated with kernel
method, only the empirical counter-part of the singular values of the operator can be obtained, we
propose to use our new data-driven method for the thresholding procedure. At a non-asymptotic
level, an upper-bound on the probability of overestimating the order of the HMM is provided. At
an asymptotic level, the consistency of the estimator is established. The control at non-asymptotic
and asymptotic levels are obtained by a concentration inequality of the Hilbert-Schmidt norm of
the empirical version of the operator. The statistical tools needed to establish these results differ
from those used in Kwon and Mbakop (2021), and consequently the results are different. Thus,
using concentration results specific to Markov chains, a concentration inequality is obtained by
considering a sum of two terms, where one term does not depend on the bandwidth and the second
term does not depend on the probability of overestimating the order. Note that the bound obtained
in the i.i.d. context considers a product between the bandwidth, the probability of overestimating
the order and the sample size. This bound contains only terms that depends on the kernel and the
bandwidth. Contrary to this setting and because of the dependency of observations the concentra-
tion inequality that we obtain depends on some unknown constant of the HMM (e.g., the mixing
time). To circumvent this issue and practical convenience, we propose a data-driven procedure
based on an unsupervised classification of the singular values of the operator and computed on
mini-batches, for estimating the constant in the concentration inequality. Note that in Leh´ericy
(2019) the model selection for the spectral method is also based on a thresholding rule applied on
the singular values whose choice is a delicate issue since it depends on the functional basis and on
the number of elements. Hence, in his paper the author proposes an empirical method based on
slope heuristic for the practical application. However, this approach requires an additional tuning
parameter that states the number of singular values used to apply the slope heuristic. In theory,
for the spectral methods to work, the rank of the spectral matrix needs to be equal to the order
of the chain. Thus, it is necessary that the number of elements of the orthonormal basis tends to
infinity, otherwise we only obtain an estimator of an upper-bound of the order. However, defining
the thresholding rule for the case of increasing number of basis elements is still an open problem
for the spectral methods. Indeed, for instance, the rank study performed in Kleibergen and Paap
(2006) should be extended to matrices with increasing dimension (but fixed rank). Thus, in prac-
tice, the number of basis elements is set a priori. This number corresponds to an upper-bound on
the order of the HMM. To the best of our knowledge, since the proposed method avoids the use
of functional basis, it is the first method which does not make assumptions on an upper-bound of
the order to be estimated. Numerical studies illustrate the relevance of this proposal and show also
that this new data-driven procedure guarantees good results for our estimator, but also improves
the spectral results of Leh´ericy (2019).
This paper is organized as follows. Section 2 introduces the specific integral operator. Section 3
presents the finite-sample size and the asymptotic properties of the estimator (including its consis-
tency). Section 4 describes the new data-driven procedure with a theoretical justification. Section 5
4