Do Language Models Understand Measurements Sungjin Park Seungwoo Ryu Edward Choi KAIST

2025-05-03 0 0 342.75KB 11 页 10玖币
侵权投诉
Do Language Models Understand Measurements?
Sungjin Park Seungwoo Ryu Edward Choi
KAIST
{zxznm, swryu, edwardchoi}@kaist.ac.kr
Abstract
Recent success of pre-trained language models
(PLMs) has stimulated interest in their ability
to understand and work with numbers. Yet, the
numerical reasoning over measurements has
not been formally studied despite their impor-
tance. In this study, we show that PLMs lack
the capability required for reasoning over mea-
surements. Furthermore, we find that a lan-
guage model trained on a measurement-rich
corpus shows better performance on under-
standing measurements. We propose a sim-
ple embedding strategy to better distinguish
between numbers and units, which leads to a
significant improvement in the probing tasks.
1 Introduction
The success of pre-trained language models
(PLMs) has led to more research on their ability to
understand commonsense. In this context, numeri-
cal reasoning over text (NRoT) is a NLP model’s
ability to interpret and work with numbers in ei-
ther digit or word form (Spithourakis and Riedel,
2018). Recent studies on NRoT test PLMs to an-
swer questions on numeracy (Wallace et al.,2019),
scalar magnitude comparison (Zhang et al.,2020),
numerical facts (Lin et al.,2020), and math word
problems (Wu et al.,2021).
Despite these efforts, existing works lack an anal-
ysis of the forms in which numbers appear. In par-
ticular, we focus on the case where numbers appear
as a measurement in the context. In most scien-
tific articles, measurements are an integral part of
the context for capturing its appropriate meaning.
For example, the two sentences "40g of Aspirin
is lethal" and "40mg of Aspirin is lethal" contain
the same words except for the unit of measure-
ment (UoM), but the second sentence is incorrect
because of the UoM.
In this work, we examine the measuring skill of
PLMs: the ability to understand the system of mea-
surement and perform numerical reasoning over
measurements. We design three measuring skill
tests (MSTs) and study how many measuring skills
can be acquired. Specifically, UNIT CONVERSION,
REFERENCE RANGE DETECTION, and MEASURE-
MENT COMPARISON require understanding of the
system of measurement, the normal range of the
biomedical entity, and the ability to combine knowl-
edge about the system of measurement and NRoT,
respectively. Table 1shows an example of each of
the measuring skill tests.
MST results showed that the models struggled
to find the largest (or smallest) value on the list of
measurements and convert the measurement to an-
other unit, while they performed well on other tests.
Compared to other PLMs, BioBERT (Lee et al.,
2020) showed superior performance on UNIT CON-
VERSION and REFERENCE RANGE DETECTION,
which implies that pre-training with measurement-
rich text helps the model understand the system of
measurement. Finally, we speculate that the lack
of skills to distinguish numbers, units, and other
words in the context makes the models fail in some
MSTs. To mitigate this, we introduce scale embed-
ding, which provides the model with the informa-
tion regarding the position and scale of the numbers
in the input text. We show that scale embedding
significantly improves the MST performance of all
PLMs.
2 Measuring Skill Test
In this section, we describe three MSTs to carefully
study the ability of PLMs to understand the system
of measurement and perform numerical reasoning
over the measurements.
2.1 Unit Conversion
This task requires the model to decide whether
the two measurements represent the same quantity.
For example, the model might correctly predict
[MASK]
in a sentence, such as "3.5g and 3500mg
are
[MASK]
value" to be filled with same if it under-
arXiv:2210.12694v1 [cs.CL] 23 Oct 2022
Task Example Answer Candidates
COMPARISON 1.59mg is [MASK]than 3.8g larger, smaller
ARGMIN/MAX [MASK]value among 0.5mg, 3.4g, 2.8mg is 0.5mg largest, smallest, middle
SORTING sort 0.53mg, 32.54g, 2.8mg in [MASK]order is 0.53mg, 32.54g, 2.8mg increasing, decreasing, random
UNIT CONVERSION 3.5g and 3500mg are [MASK]value same, different
REFERENCE RANGE DETECTION 85mg/dL of Glucose is [MASK]normal, abnormal
Table 1: Examples of measuring skill tests (MSTs). We underline the correct answer for each example.
Task Template
COMPARISON [M]is [MASK]than [M]
ARGMIN/MAX [MASK]value among [LoM]is [M]
SORTING sort [LoM]in [MASK]order is [LoM]
UNIT CONVERSION [M]and [M]are [MASK]value
REFERENCE RANGE DETECTION [M]of [ENT]is [MASK]
Table 2: Templates which we used for data generation.
[M],[LoM], and [ENT]are the placeholder for the mea-
surement, the list of measurements, and the biomedical
entity, respectively.
stands the conversion of units correctly. In general,
it is a convention to combine the unit (e.g., liter,
meter) and its prefix (e.g., kilo, milli) to represent
the numerical value of the measurement within a
range
[103,103)
. Therefore, various unit prefixes
can appear in a single passage, even if the units
are the same. To handle this, UNIT CONVERSION
is essential for complex reasoning over measure-
ments. To succeed in UNIT CONVERSION, we ex-
pect the model to handle the unit and numerical
value jointly, based on an understanding of the sys-
tem of measurement.
2.2 Reference Range Detection
Given a biomedical entity and measurement, this
task requires a model to predict whether the mea-
surement falls within the reference range. Knowl-
edge of the biomedical entity plays a crucial role
in understanding measurements, since the unit is
determined by the biomedical entity. For example,
we measure the hemoglobin level in g/dL. In addi-
tion to understanding UoMs, PLMs must rely on
domain knowledge embedded in their parameters
to solve this task, as context alone does not provide
sufficient clues as to what the reference range is for
the given biomedical entity.
2.3 Measurement Comparison
Given two measurements (or a series of nmea-
surements), the task is to predict the correct re-
lationship between them. We created the syn-
thetic dataset following other well-known NRoT
tasks. Here, we consider three numerical reason-
ing tasks: COMPARISON (Talmor et al.,2020),
ARGMIN/MAX (Wallace et al.,2019), and SORT-
ING (Pal and Baral,2021), all requiring the model
to compare numbers. Note that each measurement
in this task can have a different unit prefix. For
example, the sample "1.59mg is
[MASK]
than 3.8g"
containing two different units "mg" and "g" appears
in the COMPARISON dataset. This task assesses the
model’s ability to combine an understanding of
measurements and numerical reasoning skills.
3 Experiments
Probing Setup
We formulated MSTs as a Cloze
test (Talmor et al.,2020) to fully utilize the
knowledge captured by masked language modeling
(MLM). Specifically, a PLM received the masked
inputs given in Table 1, and the MLM head output
the probability distribution of the answer candi-
dates for
[MASK]
. Among the answer candidates,
we chose the one with the highest probability as
the final prediction.
We probed four transformer-based PLMs.
BERT (Devlin et al.,2019) and ALBERT (Lan
et al.,2020) were trained on Wikipedia articles and
Book Corpus. BioBERT (Lee et al.,2020) was
trained on biomedical articles from PubMed ab-
stracts, and BlueBERT (Peng et al.,2020) used
both clinical (MIMIC-III (Johnson et al.,2016))
and biomedical (PubMed abstracts) corpus for pre-
training. We also tested a randomly initialized
transformer encoder (i.e. Scratch) to evaluate the
difficulty of our MSTs. For each model, we did not
update the parameters during training, except for
the MLM head in the last transformer layer. In all
tasks, the models were trained with three random
seeds and we report the mean classification accu-
racy for all the probing tasks. Appendix Aprovides
further details on training and evaluation.
Data Preparation
We manually crafted templates
in Table 2that contained at most two slots for mea-
surements and
[MASK]
token for an answer. We
instantiated
[M]
and
[LoM]
by sampling the mea-
surement and the list of measurements, respectively.
For measurement sampling, we independently sam-
pled a number and a unit and then combined them.
Specifically, we sampled units from the predefined
Task MEASUREMENT COMPARISON UNIT REF
COMP ARG SORT
Model Notation in ex in ex in ex in ex in ex
ALBERT Sci 81.2 77.3 60.4 58.0 78.2 76.5 48.6 49.9 71.9 59.9
Deci 81.8 72.1 57.1 50.5 82.5 74.3 61.5 56.2 71.1 61.0
BERT Sci 73.3 72.4 55.1 52.2 45.6 45.0 52.7 51.2 73.5 64.3
Deci 81.4 77.0 60.9 54.3 54.9 54.5 61.9 59.2 77.2 67.5
BioBERT Sci 82.7 82.3 55.0 54.4 68.2 69.1 58.7 57.3 81.3 63.7
Deci 90.1 88.0 59.0 57.6 77.3 73.0 73.0 70.5 87.0 64.2
BlueBERT Sci 77.3 76.3 46.9 46.9 63.6 64.3 53.0 51.3 73.6 65.4
Deci 74.6 73.2 57.0 55.5 73.0 68.0 59.2 57.1 77.1 69.0
Scratch Sci 50.9 50.8 40.2 37.1 33.3 33.8 52.5 50.7 66.3 60.8
Deci 57.7 51.3 44.3 43.0 33.3 33.7 56.8 53.9 62.6 65.0
Table 3: Test-set results on MSTs. We report the classification accuracy on interpolation (in) and extrapolation
(ex) test dataset. COMP,ARG,SORT,UNIT, and REF are abbreviations of COMPARISON,ARGMIN/MAX,SORTING,
UNIT CONVERSION, and REFERENCE RANGE DETECTION, respectively. Sci and Deci stand for scientific and
decimal notations, respectively.
set in Table 7which consists of SI units and some
units in MIMIC-III.
The numbers in the training dataset were sam-
pled from
[102,102)
. For evaluation, we con-
structed two evaluation datasets: 1) Interpolation
sampled numbers from the same range as the train-
ing dataset; 2) Extrapolation sampled numbers
from
[103,103)
. Note that we did not consider
the numbers outside the range
[103,103)
, because
many of the unit prefixes are in the power of thou-
sands. Zhang et al. (2020) reported that represent-
ing numbers in scientific notation made it easier
for the language model to capture the scale of num-
bers. Following this observation, we tested two
different number notations: decimal and scientific.
For example, 32.6 can be represented as 32.6 and
3.26E+01 in decimal and scientific notation, respec-
tively. We randomly varied the number of digits
after the decimal point between zero and three, and
the significant digits were maintained after convert-
ing the number notation.
For REFERENCE RANGE DETECTION, we col-
lected biomedical entities from six tables in
MIMIC-III (INPUT, OUTPUT, LAB, PRESCRIP-
TION, PROCEDURE, and CHART) and chose the
subset.
We report the number of samples and the distri-
bution of labels for each MST in Table 8.
4 Results and Analysis
Measuring Skills of PLMs
Table 3shows the re-
sults of MSTs stated in Section 2.
PLMs performed reasonably well on COMPARI-
SON,SORTING, and REFERENCE RANGE DETEC-
TION, but struggled considerably on ARGMIN/MAX
and UNIT CONVERSION tasks. This shows that
some measuring skills are difficult to learn from
an LM objective. Similar to previous NRoT stud-
ies (Wallace et al.,2019;Pal and Baral,2021),
PLMs often failed to successfully extrapolate to
values outside the training range. Further, in most
cases, MST results got worse when we represented
numbers in scientific notation.
We observed that BioBERT outperformed other
PLMs in UNIT CONVERSION,REFERENCE RANGE
DETECTION, and COMPARISON, and showed com-
parable performance in the rest of the MSTs. Com-
pared to BioBERT, BlueBERT was pre-trained on
a larger volume of biomedical text, but showed
worse performance. This shows that pre-training on
measurement-rich corpora assists the model in ac-
quiring measuring skills, but further training on the
noisy clinical text could harm it when performing
reasoning over measurements. We also found that
ALBERT outperformed its competitors in SORT-
ING even though it performed the same or worse on
other tasks. This may be because ALBERT benefits
from its sentence order prediction (SOP) objective,
which predicts the ordering of two consecutive seg-
ments of text.
Effect of using Different Prompts
One can ex-
pect that the choice of prompt has an impact on
the results, and recent studies (Jiang et al.,2020;
Petroni et al.,2019) support this. To see whether
摘要:

DoLanguageModelsUnderstandMeasurements?SungjinParkSeungwooRyuEdwardChoiKAIST{zxznm,swryu,edwardchoi}@kaist.ac.krAbstractRecentsuccessofpre-trainedlanguagemodels(PLMs)hasstimulatedinterestintheirabilitytounderstandandworkwithnumbers.Yet,thenumericalreasoningovermeasurementshasnotbeenformallystudiedde...

展开>> 收起<<
Do Language Models Understand Measurements Sungjin Park Seungwoo Ryu Edward Choi KAIST.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:342.75KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注