
Homogeneous and random basis-masks. Somewhat surprisingly, we can even generate all basis-masks
from the same initial task using different random seeds for the learning algorithm (but the same randomly
initialized backbone network). We show that with a sufficiently large number of such homogeneous im-
pressions, our algorithm learns linear combinations with close to benchmark accuracy on new tasks. We
are reminded of an infant learning by taking different “snapshots” of the same object to infer properties of
another. This homogeneous setting is particularly useful to address possible drifts in the data; akin to ensem-
bling, it leverages the power of linear combinations for transfer learning. An additional important advantage
is that the homogeneous setting has no limit on the number of basis-masks we can generate ab-initio.
As another baseline for ImpressLearn, we experimented with optimizing for a linear combination of entirely
random masks of desired sparsity. We demonstrate on several benchmarks that if we choose a sufficiently
large collection of such random basis-masks, our optimization still yields competitive performance. While
combinations of random masks naturally lag behind the heterogeneous and the homogeneous settings, we
show that there is a trade-off between the number of masks and their task-specificity (non-randomness). In
settings where producing task-specific basis-masks is costly, optimizing for a linear combination of a large
number of random masks can still yield satisfactory results.
Example: LeNet-300-100 on RotatedMNIST. As a preview of our approach and its performance,
Figure 2 shows the accuracy of ImpressLearn compared to SupSup on RotatedMNIST dataset. First, as a
sanity check, we apply basis-masks obtained from one task to tasks they were not optimized for. As expected,
this yields essentially random accuracy (see X in Figure 2), confirming that the performance of ImpressLearn
is beyond pure transfer learning and comes from linearly combining the initial impressions. Next, Figure
2 illustrates that ImpressLearn even with a small number of heterogeneous basis-masks is on par or even
superior to SupSup on unseen tasks. Note that, for each additional non-basis task, ImpressLearn with 10
basis-masks requires only 3×10 = 30 parameters (one per basis-mask per layer); in contrast, SupSup needs
to generate an entire binary mask, which requires 25,000+ tensor indices to specify assuming 10% sparsity.
Figure 2 also illustrates the performance of ImpressLearn over homogeneous basis-masks; in this setting,
we need a larger number of basis-masks to achieve accuracy comparable to the heterogeneous scenario.
Ultimately, however, we still match the performance of SupSup with a vastly smaller number of additional
per-task parameters. Lastly, the rightmost plot in Figure 2 shows that our algorithm can successfully
operate when task identity is not provided at inference (cf. GN regime (Wortsman et al., 2020)). Here, our
optimization procedure with the entropy objective finds the “correct” basis-mask or a linear combination
of basis-masks yielding similar or better performance compared to the one-shot baseline from Wortsman
et al. (2020). In Section 4, we provide empirical evidence of the efficacy of ImpressLearn on a variety of
benchmarks, outlining the radical savings in parameters that need to be stored per-task compared to SupSup.
The GN regime. In principle, both SupSup and ImpressLearn require access to task identities in order
to apply the right mask to the backbone network during inference. To relax this requirement, Wortsman
et al. (2020) extend their algorithm to a more challenging regime (coined GN —Given/Not given) where
task identifiers are present during training but unavailable at inference. In this regime, Wortsman et al.
(2020) use a one-shot minimization of the entropy of the model’s outputs to single out the correct mask. In
Section 4, we show that our algorithm is able to achieve this feat, too. We demonstrate that, when applied
to basis-tasks in the GN regime, the optimization routine of ImpressLearn either identifies the corresponding
basis-mask or even finds a better performing combination of basis-masks.
The remainder of the paper is organized as follows. In Section 2, we briefly review related work and
general approaches to countering catastrophic forgetting, highlighting research that motivated our approach.
In Section 3, we detail the ImpressLearn algorithm as well as discuss its components and features. In Section
4, we demonstrate the effectiveness of ImpressLearn on a variety of datasets and architectures to show
close-to-benchmark performance with a drastically reduced parameter count on unseen tasks, especially
where the number of incoming tasks is large. Additionally, we conduct several ablation studies to provide
better understanding of the algorithm. Finally, in Section 5, we talk about limitations of our work and
avenues for future research.
3