
ENSEMBLEMOT: A STEP TOWARDS ENSEMBLE LEARNING OF MULTIPLE OBJECT
TRACKING
Yunhao Du1, Zihang Liu1, Fei Su1,2
1Beijing University of Posts and Telecommunications
2Beijing Key Laboratory of Network System and Network Culture, China
{dyh bupt,henry0820,sufei}@bupt.edu.cn
ABSTRACT
Multiple Object Tracking (MOT) has rapidly progressed in
recent years. Existing works tend to design a single tracking
algorithm to perform both detection and association. Though
ensemble learning has been exploited in many tasks, i.e, clas-
sification and object detection, it hasn’t been studied in the
MOT task, which is mainly caused by its complexity and
evaluation metrics. In this paper, we propose a simple but
effective ensemble method for MOT, called EnsembleMOT,
which merges multiple tracking results from various trackers
with spatio-temporal constraints. Meanwhile, several post-
processing procedures are applied to filter out abnormal re-
sults. Our method is model-independent and doesn’t need
the learning procedure. What’s more, it can easily work in
conjunction with other algorithms, e.g., tracklets interpola-
tion. Experiments on the MOT17 dataset demonstrate the ef-
fectiveness of the proposed method. Codes are available at
https://github.com/dyhBUPT/EnsembleMOT.
Index Terms—Multiple Object Tracking, Ensemble
Learning
1. INTRODUCTION
Multiple Object Tracking (MOT) aims to detect and track all
specific classes of objects frame by frame, which plays an es-
sential role in video analysis and understanding. In the past
few years, the MOT task is dominated by the tracking-by-
detection (TBD) paradigm [3,4], which performs detection
per frame and formulates the MOT problem as a data associa-
tion task. Recently, some works integrate the detector and em-
bedding model (i.e., appearance or motion embedding) into a
unified framework, which can benefit from multi-task learn-
ing and tend to achieve a better speed-accuracy trade-off [1,
5].
Ensemble learning [6] generally refers to training and/or
combining multiple models, which is widely used in machine
learning [7,8,9,10] and computer vision [11,12,13,14].
For example, for image classification, Wortsman et al. pro-
poses Model Soups to average weights of multiple models to
improve the classification accuracy [11]. To estimate more
stable and accurate pseudo labels for semi-supervised image
classification, Temporal Ensembling [12] aggregates the pre-
dictions of multiple previous network evaluations into an en-
semble prediction. For the object detection task, Soft-NMS
[13] and WBF [14] are widely used to combine results from
multiple detectors.
Ensemble methods are also used in several MOT works.
Peng et al. proposes the Layer-wise Aggregation Discrimina-
tive Model (LADM) [15], which uses the weighted average
of predictions from three softmax layers to judge whether a
detection box represents a person or not. However, it works in
the detection procedure, and is essentially not for the tracking
algorithm. Inspired by SoftNMS, TrackNMS is designed in
GIAOTracker [16] to fuse multiple tracking results. It first
sorts trajectories by the average confidence scores, and then
performs non-maximum suppression (NMS) based on the
temporal IoU. Though it is designed for combining multi-
ple trackers, it is evaluated by the score-based metrics mAP
[17], in which redundant low-score results can benefit perfor-
mance. Instead, the instance-based metrics, i.e., MOTA [18],
IDF1[19] and HOTA [20], are more common and reasonable
evaluation metrics for the MOT task.
To sum up, ensemble methods used in the MOT task are
still not well exploited. We summarize the reasons as follow-
ing:
• MOT is a complex downstream task. The diversity and
complexity of various tracking algorithms makes it dif-
ficult to design a general and effective ensemble algo-
rithm.
• The tracking results are temporal sequences, not just
classification scores or detection bounding boxes (bboxes).
Therefore, intuitive methods like voting can’t be di-
rectly applied.
• The widely used metrics are instance-based. Compared
with score-based metrics (e.g., mAP) in image classifi-
cation and object detection, the instance-based metrics
have no tolerance for redundant results, which intro-
duces greater risk to ensemble methods.
arXiv:2210.05278v2 [cs.CV] 17 Feb 2023