approach is most similar to the first one. However, our models are not pre-trained as is standard in
NLP. Despite this, we show our approach remains successful.
Climate Impact.
A major source of inaccuracies in weather and climate models arises from
‘unresolved’ processes (such as those relating to convection and clouds) [
27
,
28
,
29
,
30
,
31
,
32
].
These occur at scales smaller than the resolution of the climate model but have key effects on the
overall climate. For example, most of the variability in how much global surface temperatures
increase after
CO2
concentrations double is due to the representation of clouds [
27
,
29
,
33
]. There
will always be processes too costly to be explicitly resolved by our current operational models.
The standard approach to deal with these unresolved processes is to model their effects as a function
of the resolved ones. This is known as ‘parameterization’ and there is much ML work on this
[
1
,
2
,
3
,
4
,
5
,
6
,
7
,
8
,
9
,
10
,
11
,
12
,
13
,
14
]. We propose that by using all available high-resolution
data, better ML parameterization schemes and therefore better climate models can be created.
2 Methods
Our approach is a two-step process: first, we train our model on the high-resolution data, and second,
we fine-tune it on the low-resolution (target) data.
We denote the low-resolution data at time
t
as
Xt∈Rd
. The goal is to create a sequence model for
the evolution of
Xt
through time, whilst only tracking
Xt
. We denote the high-resolution data at time
t
as
Yt∈Rdm
. In parameterization,
Xt
is often a temporal and/or spatial averaging of
Yt
. We wish
to use Ytto learn a better model of Xt.
A range of ML models for sequences may be used, but we suggest they should contain both shared
and task-specific layers.
We first model
Yt
, training in the standard teacher-forcing way for ML sequence models. We use the
framework of probability, and so train by maximising the log-likelihood of
Yt
,
log Pr(y1,y2, ..., yn)
.
Informally, the likelihood measures how likely
Yt
is to be generated by our sequence model. Next,
the weights of the shared layers are frozen and the weights of the target-specific layers are trained
to model the low-resolution training data,
Xt
. Again, under the probability framework, this means
maximising the log-likelihood of Xt,log Pr(x1,x2, ..., xn).
2.1 RNN Model
We use the recurrent neural network (RNN) to demonstrate our approach (though it is not limited to the
RNN). RNNs are well-suited to parameterization tasks [
4
,
5
,
11
,
14
,
34
] as they only track a summary
representation of the system history, reducing simulation cost. This is unlike the Transformer [
35
]
which requires a slice of the actual history of Xt.
For our RNN, the hidden state is shared and its evolution is described by
ht+1 =fθ(ht,Xt)
where
ht∈RHand fθis a GRU cell [36]. We model the low-resolution data as
Xt+1 =Xt+gθ(ht+1) + σzt(1)
and the high-resolution as
Yt+1 =Yt+jθ(ht+1) + ρwt(2)
where the functions
gθ
and
jθ
are represented by task-specific dense layers,
zt∼ N (0, I)
and
wt∼ N (0, I)
. The learnable parameters are the neural network weights
θ
and the noise terms
σ∈R1and ρ∈R1. Further details are in Appendix A.
2.2 Evaluation
We use hold-out log-likelihood to assess generalization to unseen data, a standard probabilistic
approach in ML. The models were trained with 15 different random seed initializations to ensure
the differences in the results were due to our approach as opposed to a quirk of a particular random
seed. This is used to generate 95% confidence intervals. Likelihood is not easily interpretable nor the
end-goal of operational climate models. Ultimately we want to use weather and climate models to
make forecasts, and it is common to measure forecast skill with error and spread [
6
,
37
] so this is
also done for evaluation.
2