LANGUAGE MODELS ARE MULTILINGUAL CHAIN -OF-THOUGHT REASONERS Freda Shi12Mirac Suzgun13Markus Freitag1Xuezhi Wang1

2025-05-03 0 0 942.03KB 20 页 10玖币

侵权投诉

LANGUAGE MODELS ARE

MULTILINGUAL CHAIN-OF-THOUGHT REASONERS

Freda Shi1,2,∗Mirac Suzgun1,3,∗Markus Freitag1Xuezhi Wang1

Suraj Srivats4Soroush Vosoughi4Hyung Won Chung1Yi Tay1

Sebastian Ruder1Denny Zhou1Dipanjan Das1Jason Wei1

1Google Research 2Toyota Technological Institute at Chicago

3Stanford University 4Dartmouth College

ABSTRACT

We evaluate the reasoning abilities of large language models in multilingual

settings. We introduce the Multilingual Grade School Math (MGSM) bench-

mark, by manually translating 250 grade-school math problems from the GSM8K

dataset (Cobbe et al.,2021) into ten typologically diverse languages. We ﬁnd

that the ability to solve MGSM problems via chain-of-thought prompting emerges

with increasing model scale, and that models have strikingly strong multilin-

gual reasoning abilities, even in underrepresented languages such as Bengali

and Swahili. Finally, we show that the multilingual reasoning abilities of lan-

guage models extend to other tasks such as commonsense reasoning and word-

in-context semantic judgment. The MGSM benchmark is publicly available at

https://github.com/google-research/url-nlp.

0.01% 1% 100%

Underrepresented

languages

(SW,BN,TE,TH)

High-resource

languages

(JA,ZH,RU,ES,FR,DE)

English

(EN)

Frequency of language in pre-training dataset (token percentage)

MGSM Accuracy (%)

Translate to English with Google Translate and solve with English intermediate steps

Intermediate reasoning steps in the language of the question

Intermediate reasoning steps in English

Figure 1: Correlation between language frequency and MGSM accuracy for PaLM-540B. The

accuracy is surprisingly high, even for underrepresented languages like Swahili (SW) and Bengali

(BN), which account for less than 0.01% of the pre-training dataset.

∗Equal contribution. Work done during internship at Google Research.

arXiv:2210.03057v1 [cs.CL] 6 Oct 2022

1 INTRODUCTION

Recent work has shown that presenting explicit reasoning steps (i.e., chains of thought; COT) in En-

glish elicits multi-step reasoning abilities of large language models such as GPT-3 and PaLM (Brown

et al.,2020;Chowdhery et al.,2022;Wei et al.,2022b,inter alia). Pretrained multilingual language

models have also achieved impressive performance on various NLP tasks across typologically distinct

languages (Conneau et al.,2020;Xue et al.,2021;Chowdhery et al.,2022;Clark et al.,2020;Hu

et al.,2020;Ruder et al.,2021,inter alia). Tasks in existing multilingual benchmarks usually require

only simple reasoning steps, and so it is still unclear how well language models perform on tasks that

require more complex reasoning in a multilingual setting.

In this work, we introduce the

MGSM

benchmark to bridge the gap between the progress on English-

based chain-of-thought reasoning and multilingual NLP. We extend a subset of the English-language

GSM8K dataset (Cobbe et al.,2021) to ten typologically diverse languages via manual translation of

problems into target languages. To the best of our knowledge, this is the ﬁrst multilingual benchmark

to evaluate the arithmetic reasoning abilities of language models.

We evaluate two large language models, GPT-3 (Brown et al.,2020;Ouyang et al.,2022) and PaLM

(Chowdhery et al.,2022), on this benchmark. While both models solve less than 20% of problems

with standard prompting, the 540-billion-parameter PaLM model in particular shows exceptional

multilingual reasoning abilities with intermediate reasoning steps (Figure 1), solving more than 40%

of the problems in any investigated language, including underrepresented languages such as Bengali

and Swahili. In our best setting, PaLM achieves an average solve rate of 55% across languages. We

ﬁnd that intermediate reasoning steps in English consistently lead to competitive or better results

than those written in the native language of the question, suggesting that English chain-of-thought

prompting may be a useful baseline for future multilingual reasoning work.

We further demonstrate that the multilingual reasoning abilities of pretrained models extend to

common-sense reasoning (Ponti et al.,2020) and word-in-context semantic judgment (Raganato

et al.,2020). By presenting the models with few-shot examples in different languages, PaLM sets a

new state-of-the-art performance (

89.9%

) on XCOPA (Ponti et al.,2020), outperforming the prior

approaches that require thousands of training examples.

2 THE MGSM BENCHMARK

In this section, we describe the collection process of Multilingual Grade School Math (MGSM), to

our knowledge the ﬁrst multilingual arithmetic reasoning benchmark.

Figure 2: MGSM problem distribution

with respect to the number of reasoning

steps in the standard solution.

Source data.

We used GSM8K (Cobbe et al.,2021), an

English-language human-annotated grade-school math

problem dataset, as the base data source. For MGSM,

we took the ﬁrst 250 examples from the GSM8K ofﬁcial

test example list. Each problem requires two to eight

steps to solve according to the ofﬁcial solution (Figure 2).

The answer for each question in GSM8K was written as

an Arabic numeral, which we kept consistent across all

languages to facilitate cross-lingual prediction.1

Target language selection.

We selected a typologi-

cally diverse set of ten languages other than English (EN),

spanning eight language families and different levels of

representation in standard pretraining datasets such as

mC4 (Xue et al.,2021): Bengali (BN), Chinese (ZH),

French (FR), German (DE), Japanese (JA), Russian (RU),

Spanish (ES), Swahili (SW), Telugu (TE), and Thai (TH).

Certain scripts such as Devanagari employ different numerals. We restrict the data to Arabic numerals for

consistency but future work may investigate cross-lingual numeracy by mapping Arabic numerals to those of the

corresponding script (see Spithourakis & Riedel,2018).

Original Question Frage: Roger hat 5 Tennisbälle. Er kauft noch 2 Dosen Tennisbälle. In jeder

Dose sind 3 Tennisbälle. Wie viele Tennisbälle hat er jetzt?

DIRECT Antwort: 11

NATIVE-COT Schritt-für-Schritt-Antwort: Roger begann mit 5 Bällen. 2 Dosen von jeweils 3

Tennisbällen macht 6 Tennisbälle. 5 + 6 = 11. Die Antwort ist 11.

EN-COT

Step-by-Step Answer: Roger started with 5 balls. 2 cans of 3 tennis balls each

is 6 tennis balls. 5 + 6 = 11. The answer is 11.

Translated

English Question

Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each

can has 3 tennis balls. How many tennis balls does he have now?

TRANSLATE-EN

Step-by-Step Answer: Roger started with 5 balls. 2 cans of 3 tennis balls each

is 6 tennis balls. 5 + 6 = 11. The answer is 11.

Table 1: Example solution formats (§3) for a German exemplar problem, where German-speciﬁc

components are underlined and are changed to the corresponding translations for other investigated

languages. For DIRECT, NATIVE-COT and EN-COT, we provide the original German question

as input to the model and expect an answer in the corresponding format; for TRANSLATE-EN, we

input the translated question in English, and expect a step-by-step solution in English. To obtain the

desirable output format, we prepend few-shot examples in the corresponding format.

Manual translation process.

We enlisted the help of paid professional translators (two for Chinese

and German, three for Russian, ﬁve for Thai, one for each remaining target language) for the

manual translation of the 250 selected English-language examples from GSM8K. All translators

involved were native speakers of the target language and had at least two years of professional

experience in translating between English and the target language. All translators had signed a

machine translation (MT) non-usage declaration before they started to work. To verify the quality of

the human translations, the vendor sent a random subset of translations to an additional translator

to verify the quality, and checked for

-gram overlap with popular MT providers to ensure that

no machine translation toolkit has been used. We employ the translation results as gold standard

translations.

3 MULTILINGUAL CHAIN-OF-THOUGHT PROMPTING

We provide an overview of standard prompting and chain-of-thought prompting, as well as their

extensions to the multilingual setting, which we illustrate in Table 1and use in our experiments (§4).

In standard prompting, given a prompt in the source language, the model is asked to predict the

answer (Brown et al.,2020;Schick & Schütze,2021). This can be done in a zero-shot or few-shot

setting by providing exemplars following the same template as additional input to the model. We

refer to this setting as

direct answer prediction (DIRECT)

as the model directly predicts the answer

to the problem. This setting measures the model’s ability to solve problems without any intermediate

reasoning steps.

Chain-of-thought (COT; Wei et al.,2022b) prompting helps improve many few-shot reasoning tasks,

by augmenting few-shot examples with intermediate reasoning steps that should be predicted by the

model. In the multilingual setting, we can apply CoT to

solve the problem in the native language

(NATIVE-COT)

by predicting the reasoning steps in the original language of the problem. This

measures the model’s ability to both understand and solve the problem in a speciﬁc language.

Alternatively, we can ask the model to

predict the chain of thought in English (EN-COT)

, regard-

less of the problem language. Such an approach may be useful as English is often used as the source

language for cross-lingual transfer (Hu et al.,2020) and has been found effective when used as the

prompt language (Zhao & Schütze,2021;Winata et al.,2021;Lin et al.,2021b).

Finally, we can

translate the problem to English and solve it with English CoT (TRANSLATE-

EN)

. In this setting, we use the Google Translate API to translate problems into English. This mirrors

the translate-train setup (Hu et al.,2020;Xue et al.,2021;Ruder et al.,2021), the best-performing

setting for ﬁne-tuning multilingual models where the training data is translated to English.

DIRECT NATIVE-COT EN-COT TRANSLATE-EN

NATIVE-EXEMPLARS X X X X

ENGLISH-EXEMPLARS XN/A XN/A

MULTILINGUAL-EXEMPLARS X X X N/A

Table 2: Possible combinations between few-shot exemplar selection and solution strategies.

Model output Model output

Model input (native exemplar prompting)

: রজােরর 5 টিনস বল আেছ। স আরও 2 কান টিনস বল

িকেনেছ। িত কােন 3 কের টিনস বল আেছ। তার কােছ এখন

কতিল টিনস বল আেছ?

ধােপ ধােপ উর: রজােরর থেম 5 বল িছল। 2 কােনর িতেত

3ট টিনস বল মােন 6 টিনস বল। 5 + 6 = 11। উর হল 11।

: জেনেটর হাঁসিল িতিদন 16 কের িডম পােড়। িতিন িতিদন

াতরােশ িতন কের িডম খান এবং বুেদর জন িতিদন চার িডম

িদেয় মািফন তির কেরন। অবিশ হাঁেসর িডমিল িতিন িতিদন

কৃ ষকেদর বাজাের িত িডম $2 দের িবয় কেরন। িতিন কৃ ষকেদর

বাজাের িতিদন কত ডলার উপাজ ন কেরন?

ধােপ ধােপ উর:

িতিদন 16 িডম পােড়। িতিদন িতন িডম খান এবং চার িডম

িদেয় মািফন তির কেরন। তাই িতিদন 16 - 3 - 4 = 9 িডম

অবিশ থােক। িত িডেমর মূল $2 হেল িতিদন 9 * 2 = 18

ডলার উপাজ ন কেরন। উর হল 18।

িতিদন 16 িডম পােড়। িতিদন িতিন িতন িডম খান এবং চার

িডম িদেয় মািফন তির কেরন। তাই িতিদন িতিন 16 - 3 - 4 = 9

িডম িবয় কেরন। িত িডেমর দাম $2। তাই িতিদন িতিন 9 * 2

= $18 উপাজ ন কেরন। উর $18।

Задача: у Майкла было 58 мячей для гольфа. …

Сколько мячей для гольфа осталось у него к концу

среды?

Пошаговое решение: вначале у Майкла было 58

мячей для гольфа, 23 он потерял, и у него осталось

58 - 23 = 35. … Ответ — 33.

问题：奥利维亚有 23 美元。... 她还剩多少钱？

逐步解答： 5 个单价 3 美元的百吉饼应该花费 5 * 3 = 15

美元。... 答案是 8。

: জেনেটর হাঁসিল িতিদন 16 কের িডম পােড়। িতিন িতিদন

াতরােশ িতন কের িডম খান এবং বুেদর জন িতিদন চার িডম

িদেয় মািফন তির কেরন। অবিশ হাঁেসর িডমিল িতিন িতিদন

কৃ ষকেদর বাজাের িত িডম $2 দের িবয় কেরন। িতিন কৃ ষকেদর

বাজাের িতিদন কত ডলার উপাজ ন কেরন?

ধােপ ধােপ উর:

Model input (multilingual exemplar prompting)

Bengali question Russian question

Bengali question

Bengali chain

of thought

Bengali chain

of thought

Russian chain

of thought

Chinese question

Chinese chain

of thought

Bengali question

Bengali chain

of thought

Figure 3: The chain-of-thought prompts and example model outputs in the MGSM experiments. The

solutions are written in the same language as the questions of interest (NATIVE-COT).

Beyond the prompting methods, there are different ways to provide few-shot examples in context for

multilingual prompting:

•All native question exemplars (NATIVE-EXEMPLARS).

We use a few in-language questions

together with their solutions as the few-shot prompt exemplars. This is the most natural setting

when we have a few examples in each investigated language.

•All English question exemplars (ENGLISH-EXEMPLARS).

When we are unable to access any

existing questions or solution examples in some languages, an intuitive way is to use English

questions and solutions as exemplars to perform zero-shot cross-lingual transfer. Note that it is

unrealistic to combine this exemplar selection setting with NATIVE-COT, since we assume no

access to the native language for prompting.

•Generic multilingual question exemplars (MULTILINGUAL-EXEMPLARS).

Similar to

ENGLISH-EXEMPLARS, we assume access to questions and solutions in a few languages, and test

if multilingual exemplars better elicit the multilingual reasoning ability of models.

For TRANSLATE-EN, as all exemplar questions and solutions are in English, we only experiment

with the translated native question exemplars and English CoT. We summarize the combinations

of prompting and exemplar methods in Table 2, and present an illustration in Figure 3. Detailed

prompting input for each investigated combination can be found in Appendix A.2.

4 EXPERIMENTS ON MGSM

In this section, we evaluate the multilingual reasoning abilities of two representative state-of-the-art

pretrained large language models—GPT-3 (Brown et al.,2020) and PaLM (Chowdhery et al.,2022)

—on our MGSM benchmark in various prompting settings using exemplars in the source language

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LANGUAGEMODELSAREMULTILINGUALCHAIN-OF-THOUGHTREASONERSFredaShi1;2;MiracSuzgun1;3;MarkusFreitag1XuezhiWang1SurajSrivats4SoroushVosoughi4HyungWonChung1YiTay1SebastianRuder1DennyZhou1DipanjanDas1JasonWei11GoogleResearch2ToyotaTechnologicalInstituteatChicago3StanfordUniversity4DartmouthCollegeABSTRACT...

展开>> 收起<<

LANGUAGE MODELS ARE MULTILINGUAL CHAIN -OF-THOUGHT REASONERS Freda Shi12Mirac Suzgun13Markus Freitag1Xuezhi Wang1.pdf

共20页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

LANGUAGE MODELS ARE MULTILINGUAL CHAIN -OF-THOUGHT REASONERS Freda Shi12Mirac Suzgun13Markus Freitag1Xuezhi Wang1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: