2 Trong-Tung Nguyen et al.
changing background as well as variances of pill instances in terms of shape, color,
and texture. There have been several works that are developed to mitigate such
challenges, most of them are based on hand-crafted features [3, 5, 6, 10]. These
works are then utilized by Ling et al. [16] and combined with a two-stage training
strategy to create a novel framework for the pill recognition model in few-shot
learning. Another approach is to explore external knowledge from medical text
data (e.g. prescription) to improve the detection performance of visual-based
models [18, 19]. However, existing models are often limited by novel instances
of pill categories which frequently arrive at a pill recognition system. This often
happens when a novel class of pill instance is introduced by images uploaded
from the end-user using mobile devices or from the healthcare community. A
report in [1] shows that there are roughly 40-50 novel drugs being approved each
year. In such a scenario, the core learning model of the system, which is often
deployed in a lightweight device (e.g, mobile phones), might need to rewind the
training process on the whole training data (in which novel categories partici-
pate). This is not an effective strategy for many reasons. Memory allocated for
such extensively training data is often limited. Acquiring novel knowledge while
maintaining what the model has learned so far requires the system to store a
huge amount of samples for both old and new classes, which is infeasible. An-
other solution for this is to provide an initial training dataset for the model.
The model is then fine-tuned on novel categories to update the model’s knowl-
edge about new pill instances. However, this fine-tuning scheme suffers from a
serious behavior of the learning system which is widely known as catastrophic
forgetting [8, 9] (degrading performance on old tasks while accessing data of
novel tasks). This system, therefore, is in need of a flexible and effective strategy
to handle the novel real-world object categorization of pill image instances. In
this way, it would be able to incrementally learn from new classes without ex-
haustively storing old category samples. This scenario is called class-incremental
learning (CIL).
The progress of studies on class incremental learning (CIL) for visual tasks
has been developed significantly for many years. The general setting of CIL is
that the disjoint sets of different classes arrive at the learning algorithm gradu-
ally. Many works such as [4,13,21–23] have proposed several methods which em-
ployed available techniques to tackle the mutual challenge: catastrophic forget-
ting. Knowledge distillation [12] is the most common technique which is widely
adopted to tackle catastrophic forgetting and was first applied to the CIL set-
ting by Li et al. [15]. After that, a derived version [21] with additional usage of
representation learning was proposed, in which valuable herding exemplars are
replayed frequently to keep track of the old knowledge. The strategy of herding is
to pick those neighbors which are nearest to the mean sample of the class. Using
this herding strategy, Castro et al. [4] managed to build an end-to-end framework
with an additionally balanced fine-tuning strategy. On the other hand, Wu et
al. [22] introduced a bias correction approach by adding a bias correction layer.
This is conducted at the last layer of each incremental learning task to refine the
overall scores for the final prediction. Meanwhile, Hou et al. [13] identified the