Extractive Question Answering on Queries in Hindi and Tamil Adhitya Thirumala Elisa Ferracane

2025-04-26 0 0 247.58KB 8 页 10玖币
侵权投诉
Extractive Question Answering on Queries
in Hindi and Tamil
Adhitya Thirumala, Elisa Ferracane
Abstract
Indic languages like Hindi and Tamil are underrepresented in the natural language
processing (NLP) field compared to languages like English. Due to this under-
representation, performance on NLP tasks (such as search algorithms) in Indic
languages are inferior to their English counterparts. This difference dispropor-
tionately affects those who come from lower socioeconomic statuses because they
consume the most Internet content in local languages. The goal of this project is
to build an NLP model that performs better than pre-existing models for the task
of extractive question-answering (QA) on a public dataset in Hindi and Tamil.
Extractive QA is an NLP task where answers to questions are extracted from
a corresponding body of text. To build the best solution, we used three different
models. The first model is an unmodified cross-lingual version of the NLP model
RoBERTa, known as XLM-RoBERTa, that is pretrained on 100 languages. The
second model is based on the pretrained RoBERTa model with an extra classifica-
tion head for the question answering, but we used a custom Indic tokenizer, then
optimized hyperparameters and fine tuned on the Indic dataset. The third model
is based on XLM-RoBERTa, but with extra finetuning and training on the Indic
dataset. We hypothesize the third model will perform best because of the variety
of languages the XLM-RoBERTa model has been pretrained on and the additional
finetuning on the Indic dataset. This hypothesis was proven wrong because the
paired RoBERTa models performed the best as the training data used was most
specific to the task performed as opposed to the XLM-RoBERTa models which had
much data that was not in either Hindi or Tamil.
I. Introduction
Indic languages are underrepresented in
the natural language processing (NLP)
field compared to languages like English.
For example, in the corpora provided by
the Natural Language Toolkit (NLTK)
(Arora [2020]), over half of them are in
English. Indic languages include Hindi,
Tamil, Telugu, Malayalam, and others
that originate from the Indian subconti-
nent, and are spoken by the 1.7 billion
people that live on the Indian subconti-
nent. As seen in Table 1, around 57%
speak Hindi, while only 10.6% speak En-
glish. An English-centric view is bad for
the NLP community for a variety of rea-
sons. First, English’s written script is dif-
ferent from other languages, and a focus
on English or only English-like languages
(such as French, German, Spanish, etc.)
can make current systems work poorly on
languages with different writing systems,
1
arXiv:2210.06356v1 [cs.CL] 27 Sep 2022
such as Chinese or Korean. Second, tok-
enization, which is the process of splitting
a document into smaller pieces, such as
paragraphs, sentences, or words, is more
straightforward in English and other lan-
guages that have an easier way to define
a “word.” However, in languages such
as Tamil where one does not tradition-
ally put spaces between words, the task
of tokenizing changes and becomes com-
pletely different (Bender [2021]). Due
to this under-representation, NLP tasks
in Indic languages, such as search al-
gorithms and sentiment analysis, have
performance inferior to their English
counterparts (Arora [2020]). This differ-
ence disproportionately affects those who
come from lower socioeconomic statuses
because they consume the most Inter-
net content in local languages (S [2019]).
One way to allow people to access infor-
mation easier is to improve the perfor-
mance of extractive question-answering
algorithms on search engines in Indic lan-
guages. Extractive QA is an NLP task
that extracts an answer to questions from
a given body of text. This body of text
is known as the context and normally
contains the answer to the question pro-
vided. In the dataset we used, the con-
texts (and answers) came from Wikipedia
articles, and the questions were gener-
ated by native speakers of both Hindi
and Tamil. For this project, we will use
Hindi and Tamil as train data as they are
the ones provided by the Kaggle compe-
tition (Google [2021]) that inspired this
project.
Table 1:
Total speakers of languages in In-
dia (Source: 2011 Indian Census)
Language
Speakers(M) Speaker%
in Pop.
Hindi 692 57.1
English 129 10.6
Bengali 107 8.9
Marathi 99 8.2
Telugu 95 7.8
Tamil 77 6.3
Gujarati 60 5
Urdu 63 5.2
Kannada 59 4.9
Malayalam 36 2.9
Punjabi 36 3
Assamese 24 2
Maithili 14 1.2
Sanskrit 0.025 0.3
II. Related Work
Previously, the task of extractive QA
on English datasets was popularized in
SQuAD: 100,000+ Questions for Machine
Comprehension of Text (Rajpurkar et al.
[2016]). This paper outlined the creation
process for an English dataset for the
task of Extractive QA. The questions
were proposed by human annotators and
based on Wikipedia articles. Similarly,
the dataset that we used was also de-
rived from Wikipedia articles and had
questions produced by human annota-
tors. However, our dataset is different
because it is one of the first datasets pro-
duced in Hindi and Tamil. Currently, in
English, the best models on extractive
question answering use transformer mod-
els. Transformers are a computationally
efficient deep learning model that intro-
2
摘要:

ExtractiveQuestionAnsweringonQueriesinHindiandTamilAdhityaThirumala,ElisaFerracaneAbstractIndiclanguageslikeHindiandTamilareunderrepresentedinthenaturallanguageprocessing(NLP) eldcomparedtolanguageslikeEnglish.Duetothisunder-representation,performanceonNLPtasks(suchassearchalgorithms)inIndiclanguage...

展开>> 收起<<
Extractive Question Answering on Queries in Hindi and Tamil Adhitya Thirumala Elisa Ferracane.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:247.58KB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注