
Algorithm 1 Computation of information measures. Algorithmic procedures ENTROPY and MI are
specified by algorithms 2and 3in appendix A.
1: A1, . . . , AN← {f(xi)}S
i=1 ▷Computing activations for all neurons
2: H← {};I← {} ▷Initiating computations for Entropy and MI
3: for i∈ {1, . . . , N}do ▷Iterating over the set of neurons
4: Ii← {} ▷Initiating MI for a particular neuron
5: Hi←ENTROPY(Ai)▷Following Equation 1and Algorithm 2
6: for j∈ {1, . . . , N}do ▷Inner loop over the set of neurons
7: Ii←IiLMI(Ai, Aj)▷Following Equation 3and Algorithm 3
8: end for
9: H←HLHi
10: I←ILIi▷Following Equation 2
11: end for
memorization corresponds to low intra-neuron and inter-neuron diversity, while example-level
memorization corresponds to high diversity (Figures 1e and 1f). We expect that this difference in
diversity would be captured through the above defined information measures.
Figure 1g presents the distribution of entropy values for each of the three networks with varying
generalization behaviors. Throughout this work, we visualize this distribution of entropy using similar
box-plots, where a black marker within the boxes depicts the median of the distribution and a notch
neighboring this marker depicts the
95%
confidence interval around the median. We observe that
entropy for the network exhibiting heuristic memorization is distributed around a lower point than
the others, whereas entropy for the network with example-level memorization is higher.
Furthermore, Figure 1h shows the distribution of MI for the three networks. To interpret the
distribution of MI (an
N×N
square matrix), we fit a Gaussian mixture model over all values and
visualize it through a density plot, where the density (y-axis) at each point corresponds to the number
of neurons pairs that exhibit that MI value (x-axis). Larger peaks in these density plots suggest a
large number of neurons pairs are concentrated in that region. Interestingly, we see such peaks for
the three networks at distinct values of MI. For the network showing example-level memorization
(high inter-neuron diversity), most of the neuron pairs show low values of MI. In contrast, heuristic
memorization (low inter-neuron diversity) has high neuron pair density for higher MI values.2
Based on these findings, we formulate two hypotheses, summarized in Table 1:
H1
Networks exhibiting heuristic memorization
would show low inter- and intra-neuron diversity,
reflected through low entropy and high MI values.
H2
Networks exhibiting example-level memorization
would show high inter- and intra-neuron diversity,
reflected through high entropy and low MI values.
Table 1: Summarizing our hypotheses.
Memorization Diversity
Intra-neuron
(∝Entropy)
Inter-neuron
(∝MI−1)
Heuristic ↓↓↓↓
Example-level ↑↑↑↑
3 Heuristic Memorization
Here, we study different networks with varying degrees of heuristic memorization, and examine if
the information measures—aimed to capture neuron diversity—indicate the extent of memorization.
3.1 Semi-synthetic Setups
We synthetically introduce spurious artifacts in the training examples such that they co-occur with
target labels. Networks trained on such a set are prone to memorizing these artifacts. The same
correlations with an artifact do not hold in the validation sets. To obtain a set of networks with varying
2
This difference in neuron activation patterns for the two memorizing sets could be caused by several factors,
including functional complexity (Lee et al.,2020): Functions that encode individual data points (as in example-
level memorization) need to be much more complex than functions that learn shortcuts (heuristic memorization).
We make a comparison with standard complexity measures in appendix C.4 and observe that our information
measures correlate more strongly with generalization performance—especially for heuristic memorization.
4