
or with low sampling rates) trajectories, since deep learning
models are strong in capturing the distinctive data features.
t2vec [11] uses an RNN-based sequence-to-sequence model
to learn trajectory embeddings and then the similarity. It uses
a spatial proximity-aware loss that helps encode the spatial
distance between trajectories. E2DTC [14] leverages t2vec
as the backbone encoder for trajectory clustering. It adds
two loss functions to capture the similarity between trajec-
tories from the same cluster. TrjSR [12] captures the spatial
pattern of trajectories by converting trajectories into images.
CSTRM [13] uses vanilla self-attention as its trajectory encoder
and proposes a multi-view hinge loss to capture both point-
level and trajectory-level similarities between trajectories. It
generates positive trajectory pairs using two augmentation
methods, i.e., point shifting and point masking, which are
empirically shown to be sub-optimal in Section V.
Our model is a learned trajectory similarity measure. It aims
to address the limitations of the existing learned measures in
effectiveness and efficiency as discussed in Section I.
Contrastive learning. Contrastive learning [16], [17], [30]–
[38] is a self-supervised learning technique. Its core idea is to
maximize the agreement between the learned representations
of similar objects (i.e., positive pairs) while minimizing that
between dissimilar objects (i.e., negative pairs). The positive
and the negative sample pairs are generated from an input
dataset, and no supervision (labeled) data is needed. Once
trained, the representation generation model (i.e., a backbone
encoder) can be connected to downstream models, to generate
object representations for downstream learning tasks (e.g.,
classification). A few studies introduce contrastive learning
into spatial problems, such as traffic flow prediction [39].
Self-attention models. Self-attention-based models [29],
[40]–[42] learn the correlation between every two elements
of an input sequence. Studies have adopted self-attention for
trajectory similarity measurement (i.e., T3S and CSTRM).
Unlike our model, both T3S and CSTRM adopt the vanilla
multi-head self-attention encoder [29], while we propose a
dual-feature self-attention-based encoder which can capture
trajectory features from two levels of granularity and thus
generate more robust embeddings.
III. SOLUTION OVERVIEW
We consider a trajectory Tas a sequence of points recording
discrete locations of the movement of some entity, denoted by
T= [p1, p2, ..., p|T|], where piis the i-th point on T, and |T|
denotes the number of points on T. A point piis represented
by its coordinates in an Euclidean space, i.e., pi= (xi, yi).
Problem statement. Given a set of trajectories, we aim to
learn a trajectory encoder F:T→hthat maps a trajectory
Tto a d-dimensional embedding vector h∈Rd. The distance
between the learned embeddings of two trajectories should
be negatively correlated to the similarity between the two
trajectories (we use the L1distance in the experiments).
Model overview. Fig. 2 shows an overview of our TrajCL
model. The model follows the dual-branch structure of a strong
contrastive learning framework, MoCo [16]. Our technical
contributions come in the design of the learning modules as
highlighted in red in Fig. 2, to be detailed in the next section.
Given an input trajectory T, it first goes through a trajec-
tory augmentation module to generate two different trajectory
views (i.e., variants) of T, denoted as e
Tand e
T0, respectively.
We propose four different augmentation methods that em-
phasize different features of a trajectory (Section IV-A). The
augmentation process is based on Tdirectly, and hence no
additional manual data labeling efforts are needed.
The generated views e
Tand e
T0are fed into pointwise
trajectory feature enrichment layers to generate pointwise
features beyond just the coordinates, which reflect the key
characteristics of e
Tand e
T0(Section IV-B). We represent the
enriched features by two types of embeddings, the structural
feature embedding and the spatial feature embedding, for each
point in e
T(and e
T0). These embeddings encode pointwise
structural and spatial features, and form a structural embedding
matrix T(T0) and a spatial embedding matrix S(S0).
Then, we input (T,S) and (T0,S0) into trajectory backbone
encoders Fand F0to obtain embeddings hand h0for e
Tand
e
T0, respectively (Section IV-C). Our backbone encoders are
adapted from Transformer [29], and they encode structural and
spatial features of trajectories into the embeddings.
Next, hand h0go through two projection heads Pand P0
(which are fully connected layers of the same structure) to be
mapped into lower-dimensional vectors zand z0, respectively:
z=P(h) = (FC ◦ReLU ◦FC)(h)(1)
Here, FC denotes a fully connected layer, ReLU denotes the
ReLU activation function, and ◦denotes function composition.
We omit the equation for P0as it is the same. Such projections
have been shown to improve the embedding quality [17], [30].
Model training. Following previous contrastive learning
models, we use the InfoNCE [43] loss for model training.
We use zand z0as a pair of positive samples, as they both
come from variants of Tand are supposed to be similar in
the learned latent space. The embeddings (except z0) from
projection head P0that are in the current and recent past
training batches are used as negative samples of z. The
InfoNCE loss Lmaximizes the agreement between positive
samples and minimizes that between negative samples:
L(T) = −log expsim(z,z0)/τ
expsim(z,z0)/τ+P|Qneg |
j=1 expsim(z,z−
j)/τ
(2)
Here, sim is the cosine similarity. τis a temperature parameter
that controls the contribution of the negative samples [44].
We use a queue Qneg of a fixed size (an empirical pa-
rameter) to store negative samples. The queue includes the
embeddings from P0in recent batches, to enlarge the negative
sample pool, since more negative samples help produce more
robust embeddings [16], [17]. To reuse negative samples
from recent batches, the parameters of F0and P0should
change smoothly between batches. We follow the momentum
update [16] procedure to satisfy this requirement:
ΘF0=mΘF0+(1−m)ΘF; ΘP0=mΘP0+(1−m)ΘP(3)