
method BMC (balance model and changes). Experiments show
that our method achieves good results.
II. RELATED WORK
A. Attribution Interpretation Methods
In the field of hindsight, there are ”variable importance”
methods and gradient-based methods. The ”variable impor-
tance” method [15], [16] refers to the difference in the
prediction performance of a model when the value of a variable
changes. In gradient-based methods, the magnitude of the
gradient is used as the feature importance score. Gradient-
based methods are suitable for differentiable models [33].
Erasure [14] as a ”variable importance” method, it is model
independent. The advantage of the erasure method is that it
is conceptually simple and can be optimized for well-defined
objectives [34].
B. Evidence Extraction
In the early MRC, some works focus on better represen-
tation of the features of the question and the context [17],
and constantly explore better fusion matching between them
[4], [18]. With the emergence of pre-training models (such as:
BERT [19]), some works want to understand the basis for the
model to predict the answer. Extracting evidences in MRC is
attracting increasing attention, although still quite challenging.
Evidence extraction aims to find evidence and relevant infor-
mation for downstream processes in the task, which arguably
improves the interpretability of the task. Evidence extraction
is useful which is intuitive and becomes an important part of
fact verification ( [38], [39]), multiple-choice reading com-
prehension ( [40]), open-domain question answering ( [10]),
multi-hop reading comprehension ( [41]) , natural language
inference ( [42]), and a wide range of other tasks ( [43]).
which can be roughly divided into two categories: one
is supervised learning, which requires a lot of resources to
manually label all the evidence sentence labels, such as:
HOTPOTQA [20] Select evidence sentences on the basis of
asking to answer their specific tasks, and work on it [21]
iteratively sorts the importance of sentences to select evidence
sentences, [22] decomposes the question and becomes a single-
hop MRC extracts the evidence sentence while selecting the
answer.
The second is semi-supervised learning. Because it is dif-
ficult to extract evidence sentences in non-extractive MRC,
some works use semi-supervised methods to extract evidence,
[23] use remote supervision to generate imperfect labels, and
then use deep probabilistic logic learning to remove noise. [12]
label and improve model performance by combining specific
tasks with weakly supervised evidence extraction. Finally, on
the basis of weakly supervised learning, [28] use reinforcement
learning to obtain better evidence extraction strategies.
Our method U3E tends to use an unsupervised method,
complete the extraction task in stages. Adjust the erasure
method so that the importance of sentences can be obtained
explicitly. Compared with other methods, not only the cost is
small, but also the effect is remarkable.
Fig. 1. The overall structure of U3E, which includes three stages: T&A, S&R,
and A&R.
III. METHOD
The overall architecture of U3E is shown in figure 1, which
consists of three stages:
Train and Acquire (T&A): train models according to the
specific task and achieve changes.
Select and Reacquire (S&R): select the optimal memory
model according to our proposed BMC method and use the
model to reacquire changes.
Apply and Retrain (A&R): extract evidence through changes
and retrain according to the evidence.
In the following specific implementation, we will explain these
stages in order.
A. Task Definition
Assuming that each sample of the dataset can be formalized
as follows: Given a reference document consisting of multiple
sentences D={S1, S2, ..., Sm}and a statement O(If there
is a question, then Ois represented as the concatenation of
the question and the candidate) . The model should determine
whether to support this statement according to the document,
the support is marked as 1, otherwise it is marked as 0.
It can also use to extract the evidence sentence set E=
{Sj, Sj+1, ..., Sj+k−1}, which contains k(< m)sentences in
D.
B. Train and Acquire
1) Task-specific Training: We first train according to the
specific task (here is the classified task), and then save the
model M={M1, M2, ..., Mx}under all epochs, where
x represents the largest epoch trained. The model structure
during training is pretrained model1and linear layer. The input
is in ” [CLS] + Option + [SEP ] + Document + [SEP ]”
format. The hidden representation of the [CLS]token goes
through a linear layer for binary classification to predict
whether the document Dsupports the sentence O:
ˆy=softmax(Wphcls)(1)
1different pretrained models used on different datasets