
3
II. SCALABLE IBU IMPLEMENTATION
Here, we discuss some assumptions and computational ap-
proaches we use to scalably implement IBU (as in Eq. 3) for
measurement error mitigation (see Algorithm 1).
a. Vectorized Implementation First, we show how to
vectorize the IBU update rule given in Equation 3to enable
fast parallel computation on GPUs. Let Zbe a random vari-
able representing the true distribution over bitstrings without
any measurement error and Z0be a random variable repre-
senting the distribution over bitstrings with measurement er-
ror. With c(z0
i)the counts of bitstring z0
iand Sshots, let
~p∈R2nbe the noisy distribution with pi=c(z0
i)
S, let ~
θk∈R2n
be the kth iteration guess of the error-mitigated distribution
with θkj=P
~
θk(Zj), and let R∈R2n×2nbe the response ma-
trix where Ri j =P(Z0
i|Zj). Then, the IBU update rule is:
~
θk+1=
2n
∑
i=1
pi·Ri~
θk
Ri~
θk
= ~pT·diag1
R~
θk·R·diag(~
θk)!T
=~
θkRT~pR~
θk
(4)
Here, is element-wise multiplication, and is element-
wise division. This expression fully vectorizes the IBU update
rule by writing it solely in terms of elementwise operations
and matrix products. It also optimizes the order of operations
to avoid constructing large intermediate matrices; only the ad-
ditional memory for storing ~
θk+1is needed.
b. Tensor Product Structure of Noise Model Despite the
efficient update scheme above, the memory requirements are
still quite unfavorable. As previously discussed, the first big
memory bottleneck is that the response matrix Rgrows expo-
nentially in the number of qubits: 2n×2n. To get around this
issue, we assume that there is no measurement crosstalk and
hence Rhas a convenient tensor product decomposition [7–9],
i.e., R=R(1)⊗R(2)⊗... ⊗R(n). While we restrict ourselves
to this scenario, our implementation can be extended to cases
where measurement crosstalk is restricted to kqubits.
With this tensor product structure on R, we now have 4npa-
rameters to represent R, as opposed to the original 4nparam-
eters. This structure also allows matrix product R~xto be com-
puted very quickly using only matrix products and reshapes
[21] without any additional memory requirement. This is the
simplest structure we can impose on Rto obtain tractability,
although this can be relaxed into other tensor network decom-
positions of Ras well (such as matrix product states (MPS)).
c. Subspace Reduction for Tractability While we man-
aged the exponential scaling of Rwith a tensor product de-
composition, the parameters of the true latent distribution
~
θ∈R2nalso scale exponentially in the number of qubits.
To manage this, we can use the subspace reduction used in
Ref. [8]. In particular, instead of maintaining and updating a
vector of probabilities over 2nbitstrings, we only maintain
and update an M-dimensional vector over select bitstrings.
Specifically, we pre-select M-bitstrings (e.g., to be all those
bitstrings within Hamming distance dfrom any bitstring in
our dataset) and initialize ~
θ0with non-zero values in the en-
tries corresponding to those Mbitstrings; all other entries are
zero. Under such an initialization ~
θ0, the IBU update rule
will guarantee that all entries that started as zero remain zero.
Thus, we can ignore those entries and track only the M×1
sub-vector corresponding to non-zero entries. We refer to this
subspace-reduced implementation as IBU (Reduced), and IBU
without subspace reduction as IBU (Full). Under this sub-
space reduction trick, we can no longer easily leverage the
Kronecker product structure of Rto compute R~xfor any ~x.
Instead, we must construct a reduced matrix ˜
Rwhose rows
and columns correspond to the Mselected bitstrings. Unlike
[8], we are not solving a least squares problem, so matrix-free
methods like generalized minimal residual method (GMRES)
will not be helpful; we need to construct the reduced matrix
˜
R. Even if the matrix does not grow exponentially with the
number of qubits, it may still be prohibitively large to store
in memory. Our computational solution to this problem is
to build the solution to R~xby weighting each column Rjby
xjand only keeping the running sum. We also use the JAX
library’s built-in accelerated linear algebra routines and just-
in-time compilation, vectorize as many operations as possible,
and use a GPU for parallelization. Although we incur the cost
of repeatedly computing ˜
R, there is a tradeoff between this
cost and memory costs for explicitly storing R.
d. Worst-case Computational Complexity Suppose we
run an experiment with Sshots on an nqubit system and
track strings up to Hamming distance dfrom the measured
bitstrings. In the worst case, there will be Sdifferent mea-
surements, so subspace-reduced IBU tracks S·∑d
k=0n
k∼
O(Snd)bitstrings. The three main computational subroutines
are 1) constructing the subspace-reduced response matrix, 2)
matrix-vector multiplication, and 3) elementwise multiplica-
tion and division. First, the subspace-reduced response matrix
will have O(S2n2d)cells to be constructed. Each cell of the
matrix requires a single call to each of the nsingle-qubit re-
sponse matrices, so constructing the matrix takes O(S2n2d+1)
operations. Second, multiplying this matrix with a vector with
O(Snd)entries takes O(S2n2d)operations. Third, the elemen-
twise multiplication and division operations take O(Snd)op-
erations. Overall, a single IBU iteration scales as O(S2n2d+1),
better than exponential scaling of the naive approach.
III. ITERATIVE BAYESIAN UNFOLDING AS
EXPECTATION-MAXIMIZATION
Here we summarize the argument that iterative Bayesian
unfolding is an instantiation of the well-known Expectation-
Maximization (EM) algorithm from machine learning. See
Appendix Afor complete derivation as well as a ‘true’
Bayesian extension that computes that maximum a posteriori
(MAP) estimate of ~
θgiven Dirichlet priors. Shepp and Vardi
[18] have noted connections between IBU and EM for Poisson