
context prior. Furthermore, as shown in the experiments, our
method leads to an improved domain generalization compared
to other state-of-the-art approaches. Nevertheless, in several
real-world applications (e.g., radar-tracking), improved domain
generalization cannot address the limitations inherent to the
task itself. For example, tracking operations in small, crowded
scenarios with obstacles becomes increasingly tricky. In order
to assess the reliability of our Meta-RL method, we develop
an uncertainty mechanism via bootstrapped networks. The
uncertainty mechanism is combined with the context prior that
encodes information about the task difficulty. In this approach,
scenes where tracking is prone to failure, are classified as
OOD, thus assessing the particular reliability of the tracker
in the current scenario.
As a summary, the contributions of the paper are the follow-
ings:
1) Meta-RL for domain generalization without additional
memory footprint using context priors
2) Enhanced OOD detection with context priors that encode
task difficulty
The remainder of the paper proceeds as follows: in Section II,
we introduce the radar-tracking problem and how to tackle it
with RL. Afterward, we explain the specific signal processing
in Section III. In the same section, we show how the input data
distribution is used to compute an informative context variable.
At the end of this section, using the context variable, we
propose a Meta-RL algorithm for environment generalization
and detection of OOD scenarios. Finally, we evaluate our
method on a multi-target radar tracking dataset against related
Meta-RL methods in Section IV. Our proposed approach
outperforms comparable Meta-RL approaches in terms of peak
performance by 16% on the test scenarios and the baseline of
fixed parameters by 35%. In the same way, it detects OOD
scenarios with an F1-score of 72%. Thus, our approach is
more robust to environmental changes and reliably detects
OOD scenarios.
In Section V, we summarize our results and give an outlook
on future work in Section.
II. BACKGROUND AND MOTIVATION
In this section, we review the background and related work.
In Section II-A, we first outline the principle of radar tracking.
Afterward, we explain how RL can be used to optimize radar
tracking. Additionally, we extend this concept by introducing
the fundamentals of Meta-RL and Uncertainty-based RL.
A. Radar Tracking
Frequency Modulated Continuous Wave (FMCW) radars
can estimate the range, Angle of Arrival (AoA), and velocity of
targets. In the case of radar tracking, we use the range and AoA
to determine the target position. The typical radar tracking
pipeline can be divided into signal processing, detection, clus-
tering, and tracking, as shown in [13]. A high level description
is given in Figure 1. The signal processing stage elaborates the
sensor data from each radar antenna to estimate the reflected
signal’s range and angle. The resulting image is a so-called
Range-Angle Image (RAI). Afterward, the RAI is convolved
with a window that determines the signal threshold based
on the surrounding noise. Usually, a Constant False Alarm
Rate (CFAR) algorithm [14] or a variation thereof defines
the threshold. A clustering algorithm groups nearby detected
signals, and the respective cluster means are input to the track-
ing stage. In this part of the pipeline, the track management
determines whether to assign the measurement to a track, open
a new track, discard the measurement or delete non-active
tracks. Before updating the track, the measurement has to be
filtered by the tracking filter based on the last position and an
underlying movement model. The UKF is a commonly used
tracking filter [15]. The presented tracking pipeline heavily
angle
range
Target Proposals
Hyperparameters
Signal
Processing Detection Clustering
Track
Management
Track Filter
Radar Data
Fig. 1: High-level description of a Radar Tracking Pipeline.
relies on hyperparameters. Namely, the tracking performance
depends on the gating threshold for assigning tracks and the
covariance matrix of the measurement and state transition
models. Typically, those hyperparameters are determined by
an expert user with recorded data and ground truth positions
evaluating the Normalized Estimation Error Squared (NEES).
However, this approach is unlikely to perform well once the
radar is deployed in a different environment. Thus, recent work
proposed to use RL to tackle the combinatorial problem of
finding the best set of parameters for any scenario [16].
B. Reinforcement Learning
In RL the problem is formalized as a Markov Decision
Process (MDP) by (S, A, R, p, γ), where Sis the state space
as radar sensor input, Ais the action space defined as hyper-
parameters, Ris the reward as tracking performance shown
in [16], pπis the unknown transition probability between
states following policy πand γis the discount factor. Let
τ= (st, at, rt, st+1)define the transition from state sat time
step tto the next state st+1 following action atwith reward rt.
In traditional RL the goal is to maximize the sum of expected
rewards X
t
E(st,at)∼pπ[r(st,at)].(1)
This can be achieved by value iteration methods [17]. There,
we define a Q-Value Q(st, at)for each state and action pair
that estimates the expected reward. Afterward, for each state