long multivariate time series with large gaps and generate ac-
curate explanations pinpointing the clinically meaningful data
points. The hierarchical model comprises a kernelized local
attention and a recurrent layer, which effectively 1) captures
local patterns while reducing the size of the intermediate repre-
sentations via the attention and 2) learns long-term progression
dynamics via the recurrent module. To make the model end-to-
end interpretable, we design a linear approximating network
parallel to the recurrent module that models the behavior of a
recurrent module locally.
We evaluate SEHM on an extensive dataset from a ma-
jor research hospital with experiments on predicting three
postoperative complications and High time Resolution ICU
Dataset (HiRID) [16] on predicting circulatory failure. In the
evaluation, we show SEHM outperforms other state-of-the-
art models in predictive performance. We also demonstrate
the proposed model achieves better computational efficiency,
which would be an advantage in supporting clinical decisions
for perioperative care. We evaluate the model interpretabil-
ity through both quantitative evaluation on the dataset and
clinician reviews of exemplar surgical cases. Results suggest
the advantage of SEHM over existing model interpretation
approaches in identifying data samples in the input time series
with potential clinical importance.
The main contributions of our work are four-fold: (1) we
present a novel hierarchical model with kernelized local at-
tention to effectively learn representations from intraoperative
time series; (2) we significantly improve the computational
efficiency of the hierarchical model by reducing the size of
intermediate learned representation to the recurrent layer; (3)
we propose a linear approximating network to model the
behavior of the RNN module, which can be integrated with
the kernelized local attention to establish an end-to-end inter-
pretable model with three theoretical properties guaranteed; (4)
we evaluate SEHM with experiments from both computational
as well as clinical perspectives and demonstrate the end-to-
end interpretability of SEHM on large datasets with multiple
predictive outcomes.
II. RELATED WORK
In this section, we review the literature from three perspec-
tives: A) models designed for handling long sequential data,
B) techniques for handling missing values in time series, and
C) model interpretation techniques and self-explaining models.
Traditional RNN models are widely used for learning with
sequential data. However, they are ineffective when dealing
with long sequential data due to the vanishing gradient issue
and computation cost of recurrent operations. Temporal con-
volutional network (TCN), e.g., WaveNet [17], can capture
long-range temporal dependencies via dilated causal convo-
lutions. A more recent work suggests that TCN outperforms
RNN in various prediction problems based on sequential data,
particularly when the input sequences are long [18]. However,
TCN models rely on deep hierarchy to ensure the causal
convolutions and thus achieve large receptive fields. Deep
hierarchy, namely a large stack of layers, incurs significant
computation cost for inference at run time. Efficient attention
models adapted from Transformer [6] have been proposed
recently for learning representations from long sequential data,
which mainly focus on replacing the quadratic dot-product
attention calculation with more efficient operations [19], [20].
In this work, SEHM builds on previous insights and introduces
a hierarchical model that integrates kernelized local attention
and RNN. Kernelized local attention captures important local
patterns and reduces the size of intermediate representation,
while the higher-level RNN model learns long-term dynamics.
As a result, SEHM can achieve better predictive performance
and computational efficiency when learning and inferring from
long multivariate intraoperative time series.
Missing values are prevalent in clinical data. They provide
both challenges and information for predicting clinical out-
comes. Standalone imputation models [21]–[23] impute miss-
ing values at the preprocessing stage. However, imputation in
the preprocessing stage prevents models from exploiting pre-
dictive information associated with gaps. Recently, researchers
introduced imputation approaches that can be integrated with
predictive models in an end-to-end manner. RNN-based im-
putation models, such as GRU-D [24] and BRITS [25],
demonstrate better performance when learning on sequential
data with missing values. However, the recurrent nature of
these models makes it difficult to perform imputation and
predictions on long sequences. An alternative to imputation
is to treat data with missing values as irregularly sampled
time series. In this direction, models like multi-task Gaussian
process RNN (MGP-RNN) [26] and neural ordinal differential
equations (ODE) based RNN [27] have been proposed to ac-
commodate the irregularity by creating evenly-sampled latent
values. However, these models are computationally prohibitive
for long sequences as they either operate with a very large
covariance matrix or forward intermediate values to an ODE
solver numerous times. We note that the aforementioned
imputation approaches are not suitable for handling large
gaps in time series that are common in intraoperative data,
because uncertainty in missing values grows with the time
elapsed from the last observed data. Moreover, the large
gaps in intraoperative time series may reflect information of
the surgery. In the design of kernelized local attention, we
overcome this issue by taking advantage of the characteristics
of locality and using 0s to represent the missing values. This
design can encode the gap information, which helps capture
clinical information associated with the gaps.
Several approaches have been proposed for interpreting
the predictions made by machine learning models, including
model-agnostic approaches and feature attribution approaches
designed for deep models. Model-agnostic explanation ap-
proaches, such as LIME [7] and SHAP [8], provide gen-
eral frameworks for different models while treating them
as black-box models. There are also feature attribution ap-
proaches designed for interpreting neural networks [9], [10],
[28], [29]. Deep models are not always black boxes. When
properly designed attention models can be explainable by
itself. Self-explaining models allow predictions be interpreted