
ICAIF ’22, November 2 2022, New York, and Online Maia Jacopo Villani, Joshua Lockhart, and Daniele Magazzeni
addressing the aforementioned areas, as well as a method for de-
tecting events through SHAP. Moreover, we provide explicit SHAP
values for broadly used time series models AR, MA, ARMA, VARMA,
VARMAX. Our contributions, in the structure of the paper, are:
(1)
Proof of suitability for the application of KernelSHAP in
the context of time series data. Our proof builds on the ap-
proximation of SHAP with linear models in KernelSHAP,
extending it to the time series domain through VAR mod-
els, whose calibration also approximates SHAP. We call this
alteration VARSHAP.
(2)
Explicit SHAP values for widely used time series models:
autoregressive, moving average and vector models of the
same with exogenous variables.
(3)
We present Time Consistent SHAP Values, a new feature
importance technique for the time series domain, which
leverages the temporal component of the problem in order
to cut down the sample space involved in the KernelSHAP
computation.
(4)
An aggregation technique which is able to capture surges
in feature importance across time steps, which we call event
detection.
1.1 Related Work
Shapley Additive Explanations (SHAP Values) [
LL17
] are a broadly
used feature importance technique that leverage an analogy be-
tween value attribution in co-operative game theory and feature
importance assignment in machine learning models. SHAP oper-
ates by assessing the marginal contribution of a chosen feature
to all possible combinations of features which do not contain an
input of interest. However, doing so is computationally expensive,
prompting the development of approximation algorithms. An ex-
ample of this is KernelSHAP [
LL17
], which is shown to converge to
SHAP values as the sample space approaches the set of all possible
coalitions.
Gradient based methods are also popular feature importance
techniques that measure the sensitivity of the output to perturba-
tions of the input. We may use vanilla gradients for a particular
sample [
SVZ13
], or use regularization techniques such as in Smooth-
Grad [
STK+17
] or integrated gradients [
STY17
]. These methods
tend to be less expensive than computing SHAP values. [
BHPB21
]
applies a range of these techniques, as well as LIME explanations
[RSG16], in the context of nancial time series.
Indeed, much research has gone into speeding up the computa-
tion of SHAP values. For example, FastSHAP uses a deep neural
network to approximate the SHAP imputations [
JSC+21
]. While
this approach does not maintain the desirable theoretical guaran-
tees of SHAP, the technique is fast, and generally accurate. However,
since this work relies on neural networks, it raises the potential
challenge of having to explain the explanation.
In particular, with respect to Recurrent Neural Networks, many
of the above model agnostic techniques apply. However, there are
certain methods that are specic to the time series domain and to
certain model architectures. [
MLY18
] leverage a decomposition of
the LSTM function to compute the relevance of a particular feature
in a given context. TimeSHAP [
BSC+21
], the closest work to our
paper, is an algorithm to extend KernelSHAP to the time series
domain, by selectively masking either features, or time steps.
There are other ways of explaining time series predictions. Com-
puting neural activations for a range of samples [
KJFF15
] nds that
certain neurons of neural networks act as symbols for events occur-
ring in the data. [
AMMS17
] apply Layerwise Relevance Propagation
(LRP) [BBM+15] to recurrent neural networks.
Finally, this paper makes use of traditional time series modelling
techniques, including autoregressive models (AR), moving average
models (MA), autoregressive moving average models (ARMA) and
vector autoregressive moving average models with exogenous vari-
ables (VARMAX). We point to [
SSS00
] for a general introduction to
the topic and [
Mil16
] for specics on the VARMAX model and how
to calibrate or train these models.
2 SHAP FOR TIME SERIES
2.1 SHAP and KernelSHAP
Let the supervised learning problem on time series be dened by
D=(X,Y)
, the data of the problem on
X={𝑋}𝑗∈𝐼⊂R𝑁×𝑊
,
Y={𝑦}𝑗∈𝐼⊂R𝑀
, where
𝑋
is a matrix of
𝑊
column vectors (each
column represents the
𝑁
features), one for each step in the lookback
window,
𝐼
is an indexing set, and
|𝐼|
is the number of samples in the
dataset,
𝑊
is the size of the window,
𝑁
is the number of features
shown to the network at each time step
𝑤∈ [𝑊]
, which represents
𝑤∈ {
1
, ...,𝑊 }
. Finally,
𝑀
is the dimensionality of the output space.
A function approximating the relationship described by
D
and
parametrised by
𝜃∈Θ
, a parameter space, is of type
𝑓𝜃
:
R𝑁×𝑊→
R𝑀
. In particular, we let
𝑓𝜃
be a recurrent neural network, such as
an LSTM [HS97].
The formula for the SHAP value [
LL17
] of feature
𝑖∈𝐶
, where
𝐶is the collection of all features, is given by:
𝜙𝑣(𝑖)=
𝑆∈P (𝐶)\{𝑖}
(𝑁− |𝑆| + 1)!|𝑆|!
𝑁!Δ𝑣(𝑆, 𝑖),
where
P(𝐶)
is the powerset of the set of all features, and, for a
value function
𝑣
:
P(𝐶) → R
, the marginal contribution
Δ𝑣(𝑆, 𝑖)
of a feature 𝑖to a coalition 𝑆⊂ P(𝐶)is given by
Δ𝑣(𝑖, 𝑆)=𝑣({𝑖} ∪ 𝑆) − 𝑣(𝑆).
Even for small values of
|𝐶|
,
|P(𝐶)| =
2
|𝐶|
is large, implying
that the SHAP values cannot be easily computed. KernelSHAP,
again presented in [
LL17
] is a commonly used approximation of
SHAP, which provably converges to SHAP values as the number
of perturbed input features approaches
|P(𝐶)|
. More precisely, we
dene KernelSHAP as the SHAP values of the linear model
𝑔
given
by the minimization of
min
𝑔∈𝐿𝑀
𝑧∈Z
(𝑓𝜃(ℎ𝑥(𝑧)) − 𝑔(𝑧))𝜋𝑥(𝑧),(1)
where
ℎ𝑥
:
Z → R𝑑
is a masking function on a
𝑑−
dimensional
{
0
,
1
}
-vector
𝑧
belonging to a sample
Z ⊆ {
0
,
1
}𝑑
, the collection of
all possible said vectors, each representing a dierent coalition. In
practice, this function maps a coalition to the masked data point
𝑥★
, on which we compute the prediction
𝑓𝜃(ℎ𝑥(𝑧))
. Finally,
𝜋𝑥
is
combinatorial kernel, from which the method gets its name, given