
Task Example Answer Candidates
COMPARISON 1.59mg is [MASK]than 3.8g larger, smaller
ARGMIN/MAX [MASK]value among 0.5mg, 3.4g, 2.8mg is 0.5mg largest, smallest, middle
SORTING sort 0.53mg, 32.54g, 2.8mg in [MASK]order is 0.53mg, 32.54g, 2.8mg increasing, decreasing, random
UNIT CONVERSION 3.5g and 3500mg are [MASK]value same, different
REFERENCE RANGE DETECTION 85mg/dL of Glucose is [MASK]normal, abnormal
Table 1: Examples of measuring skill tests (MSTs). We underline the correct answer for each example.
Task Template
COMPARISON [M]is [MASK]than [M]
ARGMIN/MAX [MASK]value among [LoM]is [M]
SORTING sort [LoM]in [MASK]order is [LoM]
UNIT CONVERSION [M]and [M]are [MASK]value
REFERENCE RANGE DETECTION [M]of [ENT]is [MASK]
Table 2: Templates which we used for data generation.
[M],[LoM], and [ENT]are the placeholder for the mea-
surement, the list of measurements, and the biomedical
entity, respectively.
stands the conversion of units correctly. In general,
it is a convention to combine the unit (e.g., liter,
meter) and its prefix (e.g., kilo, milli) to represent
the numerical value of the measurement within a
range
[10−3,103)
. Therefore, various unit prefixes
can appear in a single passage, even if the units
are the same. To handle this, UNIT CONVERSION
is essential for complex reasoning over measure-
ments. To succeed in UNIT CONVERSION, we ex-
pect the model to handle the unit and numerical
value jointly, based on an understanding of the sys-
tem of measurement.
2.2 Reference Range Detection
Given a biomedical entity and measurement, this
task requires a model to predict whether the mea-
surement falls within the reference range. Knowl-
edge of the biomedical entity plays a crucial role
in understanding measurements, since the unit is
determined by the biomedical entity. For example,
we measure the hemoglobin level in g/dL. In addi-
tion to understanding UoMs, PLMs must rely on
domain knowledge embedded in their parameters
to solve this task, as context alone does not provide
sufficient clues as to what the reference range is for
the given biomedical entity.
2.3 Measurement Comparison
Given two measurements (or a series of nmea-
surements), the task is to predict the correct re-
lationship between them. We created the syn-
thetic dataset following other well-known NRoT
tasks. Here, we consider three numerical reason-
ing tasks: COMPARISON (Talmor et al.,2020),
ARGMIN/MAX (Wallace et al.,2019), and SORT-
ING (Pal and Baral,2021), all requiring the model
to compare numbers. Note that each measurement
in this task can have a different unit prefix. For
example, the sample "1.59mg is
[MASK]
than 3.8g"
containing two different units "mg" and "g" appears
in the COMPARISON dataset. This task assesses the
model’s ability to combine an understanding of
measurements and numerical reasoning skills.
3 Experiments
Probing Setup
We formulated MSTs as a Cloze
test (Talmor et al.,2020) to fully utilize the
knowledge captured by masked language modeling
(MLM). Specifically, a PLM received the masked
inputs given in Table 1, and the MLM head output
the probability distribution of the answer candi-
dates for
[MASK]
. Among the answer candidates,
we chose the one with the highest probability as
the final prediction.
We probed four transformer-based PLMs.
BERT (Devlin et al.,2019) and ALBERT (Lan
et al.,2020) were trained on Wikipedia articles and
Book Corpus. BioBERT (Lee et al.,2020) was
trained on biomedical articles from PubMed ab-
stracts, and BlueBERT (Peng et al.,2020) used
both clinical (MIMIC-III (Johnson et al.,2016))
and biomedical (PubMed abstracts) corpus for pre-
training. We also tested a randomly initialized
transformer encoder (i.e. Scratch) to evaluate the
difficulty of our MSTs. For each model, we did not
update the parameters during training, except for
the MLM head in the last transformer layer. In all
tasks, the models were trained with three random
seeds and we report the mean classification accu-
racy for all the probing tasks. Appendix Aprovides
further details on training and evaluation.
Data Preparation
We manually crafted templates
in Table 2that contained at most two slots for mea-
surements and
[MASK]
token for an answer. We
instantiated
[M]
and
[LoM]
by sampling the mea-
surement and the list of measurements, respectively.
For measurement sampling, we independently sam-
pled a number and a unit and then combined them.
Specifically, we sampled units from the predefined