Do Language Models Understand Measurements Sungjin Park Seungwoo Ryu Edward Choi KAIST

2025-05-03 0 0 342.75KB 11 页 10玖币

侵权投诉

Do Language Models Understand Measurements?

Sungjin Park Seungwoo Ryu Edward Choi

KAIST

{zxznm, swryu, edwardchoi}@kaist.ac.kr

Abstract

Recent success of pre-trained language models

(PLMs) has stimulated interest in their ability

to understand and work with numbers. Yet, the

numerical reasoning over measurements has

not been formally studied despite their impor-

tance. In this study, we show that PLMs lack

the capability required for reasoning over mea-

surements. Furthermore, we ﬁnd that a lan-

guage model trained on a measurement-rich

corpus shows better performance on under-

standing measurements. We propose a sim-

ple embedding strategy to better distinguish

between numbers and units, which leads to a

signiﬁcant improvement in the probing tasks.

1 Introduction

The success of pre-trained language models

(PLMs) has led to more research on their ability to

understand commonsense. In this context, numeri-

cal reasoning over text (NRoT) is a NLP model’s

ability to interpret and work with numbers in ei-

ther digit or word form (Spithourakis and Riedel,

2018). Recent studies on NRoT test PLMs to an-

swer questions on numeracy (Wallace et al.,2019),

scalar magnitude comparison (Zhang et al.,2020),

numerical facts (Lin et al.,2020), and math word

problems (Wu et al.,2021).

Despite these efforts, existing works lack an anal-

ysis of the forms in which numbers appear. In par-

ticular, we focus on the case where numbers appear

as a measurement in the context. In most scien-

tiﬁc articles, measurements are an integral part of

the context for capturing its appropriate meaning.

For example, the two sentences "40g of Aspirin

is lethal" and "40mg of Aspirin is lethal" contain

the same words except for the unit of measure-

ment (UoM), but the second sentence is incorrect

because of the UoM.

In this work, we examine the measuring skill of

PLMs: the ability to understand the system of mea-

surement and perform numerical reasoning over

measurements. We design three measuring skill

tests (MSTs) and study how many measuring skills

can be acquired. Speciﬁcally, UNIT CONVERSION,

REFERENCE RANGE DETECTION, and MEASURE-

MENT COMPARISON require understanding of the

system of measurement, the normal range of the

biomedical entity, and the ability to combine knowl-

edge about the system of measurement and NRoT,

respectively. Table 1shows an example of each of

the measuring skill tests.

MST results showed that the models struggled

to ﬁnd the largest (or smallest) value on the list of

measurements and convert the measurement to an-

other unit, while they performed well on other tests.

Compared to other PLMs, BioBERT (Lee et al.,

2020) showed superior performance on UNIT CON-

VERSION and REFERENCE RANGE DETECTION,

which implies that pre-training with measurement-

rich text helps the model understand the system of

measurement. Finally, we speculate that the lack

of skills to distinguish numbers, units, and other

words in the context makes the models fail in some

MSTs. To mitigate this, we introduce scale embed-

ding, which provides the model with the informa-

tion regarding the position and scale of the numbers

in the input text. We show that scale embedding

signiﬁcantly improves the MST performance of all

PLMs.

2 Measuring Skill Test

In this section, we describe three MSTs to carefully

study the ability of PLMs to understand the system

of measurement and perform numerical reasoning

over the measurements.

2.1 Unit Conversion

This task requires the model to decide whether

the two measurements represent the same quantity.

For example, the model might correctly predict

[MASK]

in a sentence, such as "3.5g and 3500mg

are

[MASK]

value" to be ﬁlled with same if it under-

arXiv:2210.12694v1 [cs.CL] 23 Oct 2022

Task Example Answer Candidates

COMPARISON 1.59mg is [MASK]than 3.8g larger, smaller

ARGMIN/MAX [MASK]value among 0.5mg, 3.4g, 2.8mg is 0.5mg largest, smallest, middle

SORTING sort 0.53mg, 32.54g, 2.8mg in [MASK]order is 0.53mg, 32.54g, 2.8mg increasing, decreasing, random

UNIT CONVERSION 3.5g and 3500mg are [MASK]value same, different

REFERENCE RANGE DETECTION 85mg/dL of Glucose is [MASK]normal, abnormal

Table 1: Examples of measuring skill tests (MSTs). We underline the correct answer for each example.

Task Template

COMPARISON [M]is [MASK]than [M]

ARGMIN/MAX [MASK]value among [LoM]is [M]

SORTING sort [LoM]in [MASK]order is [LoM]

UNIT CONVERSION [M]and [M]are [MASK]value

REFERENCE RANGE DETECTION [M]of [ENT]is [MASK]

Table 2: Templates which we used for data generation.

[M],[LoM], and [ENT]are the placeholder for the mea-

surement, the list of measurements, and the biomedical

entity, respectively.

stands the conversion of units correctly. In general,

it is a convention to combine the unit (e.g., liter,

meter) and its preﬁx (e.g., kilo, milli) to represent

the numerical value of the measurement within a

range

[10−3,103)

. Therefore, various unit preﬁxes

can appear in a single passage, even if the units

are the same. To handle this, UNIT CONVERSION

is essential for complex reasoning over measure-

ments. To succeed in UNIT CONVERSION, we ex-

pect the model to handle the unit and numerical

value jointly, based on an understanding of the sys-

tem of measurement.

2.2 Reference Range Detection

Given a biomedical entity and measurement, this

task requires a model to predict whether the mea-

surement falls within the reference range. Knowl-

edge of the biomedical entity plays a crucial role

in understanding measurements, since the unit is

determined by the biomedical entity. For example,

we measure the hemoglobin level in g/dL. In addi-

tion to understanding UoMs, PLMs must rely on

domain knowledge embedded in their parameters

to solve this task, as context alone does not provide

sufﬁcient clues as to what the reference range is for

the given biomedical entity.

2.3 Measurement Comparison

Given two measurements (or a series of nmea-

surements), the task is to predict the correct re-

lationship between them. We created the syn-

thetic dataset following other well-known NRoT

tasks. Here, we consider three numerical reason-

ing tasks: COMPARISON (Talmor et al.,2020),

ARGMIN/MAX (Wallace et al.,2019), and SORT-

ING (Pal and Baral,2021), all requiring the model

to compare numbers. Note that each measurement

in this task can have a different unit preﬁx. For

example, the sample "1.59mg is

[MASK]

than 3.8g"

containing two different units "mg" and "g" appears

in the COMPARISON dataset. This task assesses the

model’s ability to combine an understanding of

measurements and numerical reasoning skills.

3 Experiments

Probing Setup

We formulated MSTs as a Cloze

test (Talmor et al.,2020) to fully utilize the

knowledge captured by masked language modeling

(MLM). Speciﬁcally, a PLM received the masked

inputs given in Table 1, and the MLM head output

the probability distribution of the answer candi-

dates for

[MASK]

. Among the answer candidates,

we chose the one with the highest probability as

the ﬁnal prediction.

We probed four transformer-based PLMs.

BERT (Devlin et al.,2019) and ALBERT (Lan

et al.,2020) were trained on Wikipedia articles and

Book Corpus. BioBERT (Lee et al.,2020) was

trained on biomedical articles from PubMed ab-

stracts, and BlueBERT (Peng et al.,2020) used

both clinical (MIMIC-III (Johnson et al.,2016))

and biomedical (PubMed abstracts) corpus for pre-

training. We also tested a randomly initialized

transformer encoder (i.e. Scratch) to evaluate the

difﬁculty of our MSTs. For each model, we did not

update the parameters during training, except for

the MLM head in the last transformer layer. In all

tasks, the models were trained with three random

seeds and we report the mean classiﬁcation accu-

racy for all the probing tasks. Appendix Aprovides

further details on training and evaluation.

Data Preparation

We manually crafted templates

in Table 2that contained at most two slots for mea-

surements and

[MASK]

token for an answer. We

instantiated

[M]

and

[LoM]

by sampling the mea-

surement and the list of measurements, respectively.

For measurement sampling, we independently sam-

pled a number and a unit and then combined them.

Speciﬁcally, we sampled units from the predeﬁned

Task MEASUREMENT COMPARISON UNIT REF

COMP ARG SORT

Model Notation in ex in ex in ex in ex in ex

ALBERT Sci 81.2 77.3 60.4 58.0 78.2 76.5 48.6 49.9 71.9 59.9

Deci 81.8 72.1 57.1 50.5 82.5 74.3 61.5 56.2 71.1 61.0

BERT Sci 73.3 72.4 55.1 52.2 45.6 45.0 52.7 51.2 73.5 64.3

Deci 81.4 77.0 60.9 54.3 54.9 54.5 61.9 59.2 77.2 67.5

BioBERT Sci 82.7 82.3 55.0 54.4 68.2 69.1 58.7 57.3 81.3 63.7

Deci 90.1 88.0 59.0 57.6 77.3 73.0 73.0 70.5 87.0 64.2

BlueBERT Sci 77.3 76.3 46.9 46.9 63.6 64.3 53.0 51.3 73.6 65.4

Deci 74.6 73.2 57.0 55.5 73.0 68.0 59.2 57.1 77.1 69.0

Scratch Sci 50.9 50.8 40.2 37.1 33.3 33.8 52.5 50.7 66.3 60.8

Deci 57.7 51.3 44.3 43.0 33.3 33.7 56.8 53.9 62.6 65.0

Table 3: Test-set results on MSTs. We report the classiﬁcation accuracy on interpolation (in) and extrapolation

(ex) test dataset. COMP,ARG,SORT,UNIT, and REF are abbreviations of COMPARISON,ARGMIN/MAX,SORTING,

UNIT CONVERSION, and REFERENCE RANGE DETECTION, respectively. Sci and Deci stand for scientiﬁc and

decimal notations, respectively.

set in Table 7which consists of SI units and some

units in MIMIC-III.

The numbers in the training dataset were sam-

pled from

[10−2,102)

. For evaluation, we con-

structed two evaluation datasets: 1) Interpolation

sampled numbers from the same range as the train-

ing dataset; 2) Extrapolation sampled numbers

from

[10−3,103)

. Note that we did not consider

the numbers outside the range

[10−3,103)

, because

many of the unit preﬁxes are in the power of thou-

sands. Zhang et al. (2020) reported that represent-

ing numbers in scientiﬁc notation made it easier

for the language model to capture the scale of num-

bers. Following this observation, we tested two

different number notations: decimal and scientiﬁc.

For example, 32.6 can be represented as 32.6 and

3.26E+01 in decimal and scientiﬁc notation, respec-

tively. We randomly varied the number of digits

after the decimal point between zero and three, and

the signiﬁcant digits were maintained after convert-

ing the number notation.

For REFERENCE RANGE DETECTION, we col-

lected biomedical entities from six tables in

MIMIC-III (INPUT, OUTPUT, LAB, PRESCRIP-

TION, PROCEDURE, and CHART) and chose the

subset.

We report the number of samples and the distri-

bution of labels for each MST in Table 8.

4 Results and Analysis

Measuring Skills of PLMs

Table 3shows the re-

sults of MSTs stated in Section 2.

PLMs performed reasonably well on COMPARI-

SON,SORTING, and REFERENCE RANGE DETEC-

TION, but struggled considerably on ARGMIN/MAX

and UNIT CONVERSION tasks. This shows that

some measuring skills are difﬁcult to learn from

an LM objective. Similar to previous NRoT stud-

ies (Wallace et al.,2019;Pal and Baral,2021),

PLMs often failed to successfully extrapolate to

values outside the training range. Further, in most

cases, MST results got worse when we represented

numbers in scientiﬁc notation.

We observed that BioBERT outperformed other

PLMs in UNIT CONVERSION,REFERENCE RANGE

DETECTION, and COMPARISON, and showed com-

parable performance in the rest of the MSTs. Com-

pared to BioBERT, BlueBERT was pre-trained on

a larger volume of biomedical text, but showed

worse performance. This shows that pre-training on

measurement-rich corpora assists the model in ac-

quiring measuring skills, but further training on the

noisy clinical text could harm it when performing

reasoning over measurements. We also found that

ALBERT outperformed its competitors in SORT-

ING even though it performed the same or worse on

other tasks. This may be because ALBERT beneﬁts

from its sentence order prediction (SOP) objective,

which predicts the ordering of two consecutive seg-

ments of text.

Effect of using Different Prompts

One can ex-

pect that the choice of prompt has an impact on

the results, and recent studies (Jiang et al.,2020;

Petroni et al.,2019) support this. To see whether

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

DoLanguageModelsUnderstandMeasurements?SungjinParkSeungwooRyuEdwardChoiKAIST{zxznm,swryu,edwardchoi}@kaist.ac.krAbstractRecentsuccessofpre-trainedlanguagemodels(PLMs)hasstimulatedinterestintheirabilitytounderstandandworkwithnumbers.Yet,thenumericalreasoningovermeasurementshasnotbeenformallystudiedde...

展开>> 收起<<

Do Language Models Understand Measurements Sungjin Park Seungwoo Ryu Edward Choi KAIST.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Do Language Models Understand Measurements Sungjin Park Seungwoo Ryu Edward Choi KAIST

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: