LANGUAGE MODELS ARE MULTILINGUAL CHAIN -OF-THOUGHT REASONERS Freda Shi12Mirac Suzgun13Markus Freitag1Xuezhi Wang1

2025-05-03 0 0 942.03KB 20 页 10玖币
侵权投诉
LANGUAGE MODELS ARE
MULTILINGUAL CHAIN-OF-THOUGHT REASONERS
Freda Shi1,2,Mirac Suzgun1,3,Markus Freitag1Xuezhi Wang1
Suraj Srivats4Soroush Vosoughi4Hyung Won Chung1Yi Tay1
Sebastian Ruder1Denny Zhou1Dipanjan Das1Jason Wei1
1Google Research 2Toyota Technological Institute at Chicago
3Stanford University 4Dartmouth College
ABSTRACT
We evaluate the reasoning abilities of large language models in multilingual
settings. We introduce the Multilingual Grade School Math (MGSM) bench-
mark, by manually translating 250 grade-school math problems from the GSM8K
dataset (Cobbe et al.,2021) into ten typologically diverse languages. We find
that the ability to solve MGSM problems via chain-of-thought prompting emerges
with increasing model scale, and that models have strikingly strong multilin-
gual reasoning abilities, even in underrepresented languages such as Bengali
and Swahili. Finally, we show that the multilingual reasoning abilities of lan-
guage models extend to other tasks such as commonsense reasoning and word-
in-context semantic judgment. The MGSM benchmark is publicly available at
https://github.com/google-research/url-nlp.
0.01% 1% 100%
0
10
20
30
40
50
60
70
Underrepresented
languages
(SW,BN,TE,TH)
High-resource
languages
(JA,ZH,RU,ES,FR,DE)
English
(EN)
Frequency of language in pre-training dataset (token percentage)
MGSM Accuracy (%)
Translate to English with Google Translate and solve with English intermediate steps
Intermediate reasoning steps in the language of the question
Intermediate reasoning steps in English
Figure 1: Correlation between language frequency and MGSM accuracy for PaLM-540B. The
accuracy is surprisingly high, even for underrepresented languages like Swahili (SW) and Bengali
(BN), which account for less than 0.01% of the pre-training dataset.
Equal contribution. Work done during internship at Google Research.
1
arXiv:2210.03057v1 [cs.CL] 6 Oct 2022
1 INTRODUCTION
Recent work has shown that presenting explicit reasoning steps (i.e., chains of thought; COT) in En-
glish elicits multi-step reasoning abilities of large language models such as GPT-3 and PaLM (Brown
et al.,2020;Chowdhery et al.,2022;Wei et al.,2022b,inter alia). Pretrained multilingual language
models have also achieved impressive performance on various NLP tasks across typologically distinct
languages (Conneau et al.,2020;Xue et al.,2021;Chowdhery et al.,2022;Clark et al.,2020;Hu
et al.,2020;Ruder et al.,2021,inter alia). Tasks in existing multilingual benchmarks usually require
only simple reasoning steps, and so it is still unclear how well language models perform on tasks that
require more complex reasoning in a multilingual setting.
In this work, we introduce the
MGSM
benchmark to bridge the gap between the progress on English-
based chain-of-thought reasoning and multilingual NLP. We extend a subset of the English-language
GSM8K dataset (Cobbe et al.,2021) to ten typologically diverse languages via manual translation of
problems into target languages. To the best of our knowledge, this is the first multilingual benchmark
to evaluate the arithmetic reasoning abilities of language models.
We evaluate two large language models, GPT-3 (Brown et al.,2020;Ouyang et al.,2022) and PaLM
(Chowdhery et al.,2022), on this benchmark. While both models solve less than 20% of problems
with standard prompting, the 540-billion-parameter PaLM model in particular shows exceptional
multilingual reasoning abilities with intermediate reasoning steps (Figure 1), solving more than 40%
of the problems in any investigated language, including underrepresented languages such as Bengali
and Swahili. In our best setting, PaLM achieves an average solve rate of 55% across languages. We
find that intermediate reasoning steps in English consistently lead to competitive or better results
than those written in the native language of the question, suggesting that English chain-of-thought
prompting may be a useful baseline for future multilingual reasoning work.
We further demonstrate that the multilingual reasoning abilities of pretrained models extend to
common-sense reasoning (Ponti et al.,2020) and word-in-context semantic judgment (Raganato
et al.,2020). By presenting the models with few-shot examples in different languages, PaLM sets a
new state-of-the-art performance (
89.9%
) on XCOPA (Ponti et al.,2020), outperforming the prior
approaches that require thousands of training examples.
2 THE MGSM BENCHMARK
In this section, we describe the collection process of Multilingual Grade School Math (MGSM), to
our knowledge the first multilingual arithmetic reasoning benchmark.
Figure 2: MGSM problem distribution
with respect to the number of reasoning
steps in the standard solution.
Source data.
We used GSM8K (Cobbe et al.,2021), an
English-language human-annotated grade-school math
problem dataset, as the base data source. For MGSM,
we took the first 250 examples from the GSM8K official
test example list. Each problem requires two to eight
steps to solve according to the official solution (Figure 2).
The answer for each question in GSM8K was written as
an Arabic numeral, which we kept consistent across all
languages to facilitate cross-lingual prediction.1
Target language selection.
We selected a typologi-
cally diverse set of ten languages other than English (EN),
spanning eight language families and different levels of
representation in standard pretraining datasets such as
mC4 (Xue et al.,2021): Bengali (BN), Chinese (ZH),
French (FR), German (DE), Japanese (JA), Russian (RU),
Spanish (ES), Swahili (SW), Telugu (TE), and Thai (TH).
1
Certain scripts such as Devanagari employ different numerals. We restrict the data to Arabic numerals for
consistency but future work may investigate cross-lingual numeracy by mapping Arabic numerals to those of the
corresponding script (see Spithourakis & Riedel,2018).
2
Original Question Frage: Roger hat 5 Tennisbälle. Er kauft noch 2 Dosen Tennisbälle. In jeder
Dose sind 3 Tennisbälle. Wie viele Tennisbälle hat er jetzt?
DIRECT Antwort: 11
NATIVE-COT Schritt-für-Schritt-Antwort: Roger begann mit 5 Bällen. 2 Dosen von jeweils 3
Tennisbällen macht 6 Tennisbälle. 5 + 6 = 11. Die Antwort ist 11.
EN-COT
Step-by-Step Answer: Roger started with 5 balls. 2 cans of 3 tennis balls each
is 6 tennis balls. 5 + 6 = 11. The answer is 11.
Translated
English Question
Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each
can has 3 tennis balls. How many tennis balls does he have now?
TRANSLATE-EN
Step-by-Step Answer: Roger started with 5 balls. 2 cans of 3 tennis balls each
is 6 tennis balls. 5 + 6 = 11. The answer is 11.
Table 1: Example solution formats (§3) for a German exemplar problem, where German-specific
components are underlined and are changed to the corresponding translations for other investigated
languages. For DIRECT, NATIVE-COT and EN-COT, we provide the original German question
as input to the model and expect an answer in the corresponding format; for TRANSLATE-EN, we
input the translated question in English, and expect a step-by-step solution in English. To obtain the
desirable output format, we prepend few-shot examples in the corresponding format.
Manual translation process.
We enlisted the help of paid professional translators (two for Chinese
and German, three for Russian, ve for Thai, one for each remaining target language) for the
manual translation of the 250 selected English-language examples from GSM8K. All translators
involved were native speakers of the target language and had at least two years of professional
experience in translating between English and the target language. All translators had signed a
machine translation (MT) non-usage declaration before they started to work. To verify the quality of
the human translations, the vendor sent a random subset of translations to an additional translator
to verify the quality, and checked for
n
-gram overlap with popular MT providers to ensure that
no machine translation toolkit has been used. We employ the translation results as gold standard
translations.
3 MULTILINGUAL CHAIN-OF-THOUGHT PROMPTING
We provide an overview of standard prompting and chain-of-thought prompting, as well as their
extensions to the multilingual setting, which we illustrate in Table 1and use in our experiments (§4).
In standard prompting, given a prompt in the source language, the model is asked to predict the
answer (Brown et al.,2020;Schick & Schütze,2021). This can be done in a zero-shot or few-shot
setting by providing exemplars following the same template as additional input to the model. We
refer to this setting as
direct answer prediction (DIRECT)
as the model directly predicts the answer
to the problem. This setting measures the model’s ability to solve problems without any intermediate
reasoning steps.
Chain-of-thought (COT; Wei et al.,2022b) prompting helps improve many few-shot reasoning tasks,
by augmenting few-shot examples with intermediate reasoning steps that should be predicted by the
model. In the multilingual setting, we can apply CoT to
solve the problem in the native language
(NATIVE-COT)
by predicting the reasoning steps in the original language of the problem. This
measures the model’s ability to both understand and solve the problem in a specific language.
Alternatively, we can ask the model to
predict the chain of thought in English (EN-COT)
, regard-
less of the problem language. Such an approach may be useful as English is often used as the source
language for cross-lingual transfer (Hu et al.,2020) and has been found effective when used as the
prompt language (Zhao & Schütze,2021;Winata et al.,2021;Lin et al.,2021b).
Finally, we can
translate the problem to English and solve it with English CoT (TRANSLATE-
EN)
. In this setting, we use the Google Translate API to translate problems into English. This mirrors
the translate-train setup (Hu et al.,2020;Xue et al.,2021;Ruder et al.,2021), the best-performing
setting for fine-tuning multilingual models where the training data is translated to English.
3
DIRECT NATIVE-COT EN-COT TRANSLATE-EN
NATIVE-EXEMPLARS X X X X
ENGLISH-EXEMPLARS XN/A XN/A
MULTILINGUAL-EXEMPLARS X X X N/A
Table 2: Possible combinations between few-shot exemplar selection and solution strategies.
Model output Model output
Model input (native exemplar prompting)
: রজােরর 5 টিনস বল আেছ স আরও 2 কান টিনস বল
িকেনেছ িত কােন 3 কের টিনস বল আেছ তার কােছ এখন
কতিল টিনস বল আেছ?
ধােপ ধােপ উর: রজােরর থেম 5 বল িছল 2 কােনর িতেত
3ট টিনস বল মােন 6 টিনস বল 5 + 6 = 11 উর হল 11
: জেনেটর হাঁসিল িতিদন 16 কের িডম পােড় িতিন িতিদন
াতরােশ িতন কের িডম খান এবং বেদর জন িতিদন চার িডম
িদেয় মািফন তির কেরন অবিশ হাঁেসর িডমিল িতিন িতিদন
ৃ ষকেদর বাজাের িত িডম $2 দের িবয় কেরন িতিন ৃ ষকেদর
বাজাের িতিদন কত ডলার উপাজ ন কেরন?
ধােপ ধােপ উর:
িতিদন 16 িডম পােড় িতিদন িতন িডম খান এবং চার িডম
িদেয় মািফন তির কেরন তাই িতিদন 16 - 3 - 4 = 9 িডম
অবিশ থােক িত িডেমর মূল $2 হেল িতিদন 9 * 2 = 18
ডলার উপাজ ন কেরন উর হল 18
িতিদন 16 িডম পােড় িতিদন িতিন িতন িডম খান এবং চার
িডম িদেয় মািফন তির কেরন তাই িতিদন িতিন 16 - 3 - 4 = 9
িডম িবয় কেরন িত িডেমর দাম $2 তাই িতিদন িতিন 9 * 2
= $18 উপাজ ন কেরন উর $18
Задача: у Майкла было 58 мячей для гольфа. …
Сколько мячей для гольфа осталось у него к концу
среды?
Пошаговое решение: вначале у Майкла было 58
мячей для гольфа, 23 он потерял, и у него осталось
58 - 23 = 35. … Ответ — 33.
问题:奥利维亚 23 美元。... 剩多少
逐步解答: 5 3 美元的百吉 饼应该 5 * 3 = 15
美元。... 答案是 8
: জেনেটর হাঁসিল িতিদন 16 কের িডম পােড় িতিন িতিদন
াতরােশ িতন কের িডম খান এবং বেদর জন িতিদন চার িডম
িদেয় মািফন তির কেরন অবিশ হাঁেসর িডমিল িতিন িতিদন
ৃ ষকেদর বাজাের িত িডম $2 দের িবয় কেরন িতিন ৃ ষকেদর
বাজাের িতিদন কত ডলার উপাজ ন কেরন?
ধােপ ধােপ উর:
Model input (multilingual exemplar prompting)
Bengali question Russian question
Bengali question
Bengali chain
of thought
Bengali chain
of thought
Russian chain
of thought
Chinese question
Chinese chain
of thought
Bengali question
Bengali chain
of thought
Figure 3: The chain-of-thought prompts and example model outputs in the MGSM experiments. The
solutions are written in the same language as the questions of interest (NATIVE-COT).
Beyond the prompting methods, there are different ways to provide few-shot examples in context for
multilingual prompting:
All native question exemplars (NATIVE-EXEMPLARS).
We use a few in-language questions
together with their solutions as the few-shot prompt exemplars. This is the most natural setting
when we have a few examples in each investigated language.
All English question exemplars (ENGLISH-EXEMPLARS).
When we are unable to access any
existing questions or solution examples in some languages, an intuitive way is to use English
questions and solutions as exemplars to perform zero-shot cross-lingual transfer. Note that it is
unrealistic to combine this exemplar selection setting with NATIVE-COT, since we assume no
access to the native language for prompting.
Generic multilingual question exemplars (MULTILINGUAL-EXEMPLARS).
Similar to
ENGLISH-EXEMPLARS, we assume access to questions and solutions in a few languages, and test
if multilingual exemplars better elicit the multilingual reasoning ability of models.
For TRANSLATE-EN, as all exemplar questions and solutions are in English, we only experiment
with the translated native question exemplars and English CoT. We summarize the combinations
of prompting and exemplar methods in Table 2, and present an illustration in Figure 3. Detailed
prompting input for each investigated combination can be found in Appendix A.2.
4 EXPERIMENTS ON MGSM
In this section, we evaluate the multilingual reasoning abilities of two representative state-of-the-art
pretrained large language models—GPT-3 (Brown et al.,2020) and PaLM (Chowdhery et al.,2022)
—on our MGSM benchmark in various prompting settings using exemplars in the source language
4
摘要:

LANGUAGEMODELSAREMULTILINGUALCHAIN-OF-THOUGHTREASONERSFredaShi1;2;MiracSuzgun1;3;MarkusFreitag1XuezhiWang1SurajSrivats4SoroushVosoughi4HyungWonChung1YiTay1SebastianRuder1DennyZhou1DipanjanDas1JasonWei11GoogleResearch2ToyotaTechnologicalInstituteatChicago3StanfordUniversity4DartmouthCollegeABSTRACT...

展开>> 收起<<
LANGUAGE MODELS ARE MULTILINGUAL CHAIN -OF-THOUGHT REASONERS Freda Shi12Mirac Suzgun13Markus Freitag1Xuezhi Wang1.pdf

共20页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:20 页 大小:942.03KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 20
客服
关注