ALT A software for readability analysis of Portuguese-language texts Gleice Carvalho de Lima Moreno13 Marco P. M. de Souza2 Nelson Hein3_2

2025-04-30 0 0 1.25MB 21 页 10玖币

侵权投诉

ALT∗

: A software for readability analysis

of Portuguese-language texts

Gleice Carvalho de Lima Moreno1,3, Marco P. M. de Souza2, Nelson Hein3,

Adriana Kroenke Hein3

1Departamento de Ciˆ

encias Cont´

abeis, Universidade Federal de Rondˆ

onia, 76801-974, Porto Velho, Rondˆ

onia, Brasil

2Departamento de F´

ısica, Universidade Federal de Rondˆ

onia, 76900-726, Ji-Paran ´

a, Rondˆ

onia, Brasil

3Programa de P´

os-Graduac¸ ˜

ao em Ciˆ

encias Cont´

abeis, Universidade Regional de Blumenau, 89030-903, Blumenau,

Santa Catarina, Brasil

October 11, 2022

In the initial stage of human life, communication, seen as a process of social interaction,

was always the best way to reach consensus between the parties. Understanding and

credibility in this process are essential for the mutual agreement to be validated. But,

how to do it so that this communication reaches the great mass? This is the main challenge

when what is sought is the dissemination of information and its approval. In this context,

this study presents the ALT software, developed from original readability metrics adapted to

the Portuguese language, available on the web, to reduce communication diﬃculties. The

development of the software was motivated by the theory of communicative action of Haber-

mas, which uses a multidisciplinary style to measure the credibility of the discourse in the

communication channels used to build and maintain a safe and healthy relationship with the

public.

——————

No estágio inicial da vida humana a comunicação, vista como um processo de interação

social, foi sempre o melhor caminho para o consenso entre as partes. O entendi-

mento e a credibilidade nesse processo são fundamentais para que o acordo mútuo

seja validado. Mas, como fazê-lo de forma que essa comunicação alcance a grande massa?

Esse é o principal desaﬁo quando o que se busca é a difusão da informação e a sua aprovação.

Nesse contexto, este estudo apresenta o software ALT, desenvolvido a partir de métricas de

legibilidade originais adaptadas para a Língua Portuguesa, disponível na web, para reduzir

as diﬁculdades na comunicação. O desenvolvimento do software foi motivado pela teoria

do agir comunicativo de Habermas, que faz uso de um estilo multidisciplinar para medir a

credibilidade do discurso nos canais de comunicação utilizados para construir e manter uma

relação segura e saudável com o público.

1 Introduction

Jürgen Habermas is a German philosopher and sociologist from the Frankfurt School who worked

hard to study democracy by devoting extensively to the theory of communicative action published in

1981. This theory emphasized the way in which communication should occur, by multidisciplinary

treating the credibility in the relationship between the system (economic and political) and the

∗https://legibilidade.com/

arXiv:2210.00553v2 [cs.CL] 8 Oct 2022

ALT: A software for readability analysis of Portuguese-language texts

world of life (represented by the people’s prior knowledge, standards in society and knowledge from

culture). Based on the individual’s innate language, he described that dialog should occur freely,

with communicative rationality and making critical analysis in this interaction in order to reach the

essential and apex of the communication [1].

In view of this, inﬂuenced by Habermas’s theory of communicative action, this work was developed

to reduce the ﬂaws that prevent the understanding of communications written in Portuguese. The

most common ﬂaws are the absences of objectivity, clarity, and simplicity.

Communication is the main driver who intervenes in the search for a better relationship between

people or groups of people. Through dialog (written or oral form), it is important to establish a

transmission of information backed by ethics and morals in order to reach persuasion. Following these

criteria, the prospect of achieving credibility and, consequently, the concordance of the collectivity

is greater.

However, this is not always the case. Written communications are often aimed at a speciﬁc

audience. Thus, the use of complex words (common to the group) and long sentences is the most

common, making reading diﬃcult and preventing the understanding of readers that are part of other

groups. Therefore, the message does not reach the expected range, and it does not meet a greater

number of people (either layperson or specialist). It is almost a game of bad luck with errors and

trials [2].

In this sense, Habermas found, through the theory of communicative action, that the terms

used in a communication must follow four pretensions of validity (intelligibility, sincerity, normative

correction and truth) in order to reach the summit (credibility) at this important and necessary stage

of human interaction.

The pretense of intelligibility corresponds to the communication process that has been clearly

carried out, thus allowing for an easy understanding of what has been declared in order to reach

consensus between the parties. Thus, this claim refers to the indicator of comprehensibility, which

occurs when the degree of eﬀectiveness of communication is achieved. To measure it, there are

several textual readability metrics developed over the last decades. We will deal with some of them

in this Article.

As for the claim of sincerity, it is the disclosure of information in a detailed manner, presenting

honestly what has been done or has not been done. To accurately measure this indicator, keywords

should be considered for the purpose of assessing whether the author(s) dealt with or dealt with in

detail the subject to which the text is intended.

Regarding the claim of regulatory correction, it has as its principle the compliance with established

legal rules or standards, dealing with the adequacy of the reports in relation to a speciﬁc situation

(environmental, social, cultural, among others) or considering the reader’s ability to respond to what

is proposed [

]. This indicator is measured by means of the readability metrics, which indicate the

intended audience for the communication.

The claim of truth, however, deals with the availability of reliable information, always considering

the truth of the facts. As stated, reliability is fundamental to the communication process, being

obtained from the relationship between the guidelines observed and suggested guidelines based on

pre-established standards (standards and rules).

From the Habermas’ theory, it has to be said that, if these claims or claims are present, the

communication existing in the relationship between the system (government and market) and the

world of life (subjective, normative and objective) will be more solid.

This has been seen as a major challenge for humanity, by inﬂuencing the communication process

in its various relations. As examples, we can cite the relationship between government and taxpayers;

between company and society; between manager and collaborators; between doctor and patients;

between scientist and public; and so many other cases. In order to attenuate the imperfections in

the communication process, particularly in the written form, some studies have been highlighted

because they are pioneers in this ﬁeld of study.

The ﬁeld of study referred to relates to readability, which aims to analyze the diﬃculty under-

standing a text. Some studies, such as Flesch [

] in 1948, Gunning [

] in 1952, Smith and Senter

[

] in 1967, Coleman and Liau [

] in 1975, Kincaid [

] and collaborators in 1975, and Gulpease

[

] em 1988, in addition to others with scientiﬁc evidence, were important for proposing solutions

to measure the degree of reading diﬃculty.

Thus, in this work we present the ALT software: Textual Readability Analysis [

], a tool

developed to measure textual readability indexes of Portuguese texts, using formulas adapted for that

language from the originals. Within Habermas’s theory of communicative action, readability indices

can be used when the objective is to obtain quantitative data of the pretensions of intelligibility and

Page 2 of 21

ALT: A software for readability analysis of Portuguese-language texts

normative correction. In addition, the the claim of sincerity, which seeks to measure the completeness

of the text, observing the frequency with which the chosen keywords have been mentioned, is also

proposed in the software.

That said, the ALT program was built to meet two needs:

1. To enable the analysis of textual readability for texts written in Portuguese.

To ﬁll an existing gap in the scientiﬁc environment, since researchers from several areas develop

studies focused on textual readability in Portuguese and end up making use of international

software based on readability indexes not suitable for this language.

Still within the second point of the list above, it should be noted that even recent works, with

four years or less of publication, used textual readability indices in their original languages (English).

Among these studies, it is possible to mention the references [

], with the ﬁrst four

using the Flesch Index of readability and the latter, The Gunning fog Index. This is, of course, due to

the absence of studies involving the adaptation of the foreign readability indexes to the Portuguese

language.

This Article is organized as follows: in Section 2we present a brief review of the readability of

interest indices, including their original formulas. In Section 3we show the algorithms responsible

for counting characters, words, sentences, and syllables of any text. The reasons characters/words,

words/sentences and other combinations are the basic variables of the main known readability

indexes. The adaptation of formulas for the Portuguese language is shown in Section 4. The overview

of the ALT program and its features are covered in Section 5. In order to know the ALT’s degree of

accuracy, in Section 6we compared the indices obtained by ALT with those obtained by the original

formulas. Finally, we discussed the limitations of the application of readability formulas in Section 7

and concluded this Article in Section 8.

2 The Readability Indexes – original formulas

In this section we discuss the readability indexes used in the ALT program – textual readability

analysis.

2.1 Flesch reading ease

The Flesch Reading Index is one of the oldest methods capable of quantifying the “diﬃculty un-

derstanding” texts, according to the very words of its creator, Rudolf Flesch [

].The method was

developed in 1943 and the formula was revised in 1948, being normalized in a scale ranging from 0

(minimum readability) to 100 (maximum readability). The formula, focused only on English texts,

is given by

Flesch reading ease = 206.835 −1.015 ×words

sentences−84.6×syllables

words .(1)

As Flesch’s readability Index was presented to the public many decades before the popularization

of personal computers, the use of the formula was impractical in long articles or books. In this sense,

the author recommended the sampling criterion: the counting of syllables, words and sentences

could be made from 100 words, three to ﬁve texts in articles and 25 to 30 texts in a book. The choice

of texts could follow a certain pattern, such as starting from the third paragraph of each page, for

example [6].

With the advent of computer popularization, the counting of words and sentences in texts as long

as those of books became a very simple process. But it must be pointed out that the syllable count

still remains a challenge. There is currently no algorithm capable of providing, without errors, the

amount of syllables of a word, and not even the syllable concept is a consensus among linguists [

2.2 Gunning fog index

The Gunning fog index , often translated literally as Índice de Nevoeiro de Gunning, was developed

by Robert Gunning in 1952. Gunning, a consultant who came to work for United Press, the Wall

Street Journal and Newsweek, was the ﬁrst to present a readability formula that estimates the years

of formal education that a person must have in order to understand the text without diﬃculty [

Page 3 of 21

ALT: A software for readability analysis of Portuguese-language texts

For example, a text with a 12-point index would be suitable for a reader with high school education

degree, which revolves around 12 years of studies (3 years of high school + 9 years of elementary

school). This scale, based on years of studies rather than the arbitrary centigrade scale, is now

known as level of education or level of schooling (grade level ).

The formula for obtaining the level of text instruction is given by

Gunning fog index = 0.4×words

sentences+ 40 ×complex words

words .(2)

As with the Flesch Readability Index, the Gunning Fox Index is based on the concepts of “complex

sentences” and “complex words.” Complex sentences, in the sense of Gunning, are those very

long in terms of the amount of words: note that the ﬁrst variable (words/sentences) is one of the

determinants in the text level of instruction. It is intuitive that large sentences, composed of much

recursion [

], make reading diﬃcult. It is not rare that even an educated reader has to reread a

very long sentence in order to understand the information contained in it.

Over the complex words, Gunning deﬁnes them as being those that contain three or more syllables.

Proper names, family jargon, and composite words should not be taken into account. It is interesting

to note the diﬀerence between Flesch and Gunning: while the former encompasses a broad spectrum

of words ranging from the very simple (monosyllabic), passing through the moderate (dissyllabic

and trisyllabic) and reaching the very diﬃcult (with 4 or more syllables), the second highlights that

words are either common or complex. In this sense, the Gunning Fog Index considers that the words

“message” and “heterozygote” have the same degree of complexity.

2.3 Automated readability index

The ARI – Automated readability index – was developed by E. A. Smith and R. J. Senter in 1967. As

pointed out in the seminal paper [

], its objective was to oﬀer a readability index for books, reports

and technical manuals of the United States Air Force with the purpose of decreasing the time of

extracting information from these documents.

Until 1967, there were at least three methods of obtaining textual readability indices. Two of

them were the Flesch Readability Index and the Gunning Fox Index which, as already pointed out,

used the number of syllables to infer the complexity of a word. The other was the 1948 Dale-Chall

algorithm [

], which presented the index in a scale of 4.9 (or less) up to 9.9 points. The problem

with the latter was that the formula depended on comparing the text with a list of 763 “very common

words”, such as yes and no. This index was quite interesting for children’s texts, since the repertoire

of words known to children is small. However, the formula was not suitable for texts aimed at

the adult public. It was at this juncture that the authors of the ARI method presented a formula

that avoided the complexity involved in the syllable count of the Flesch Readability Index and the

presence of a list of words of Dale-Chall Index. The Automated Readability Index, which is based on

the grade level scale, is given by

ARI =−21.43 + 0.50 ×words

sentences+ 4.71 ×characters

words .(3)

As you can see, the Automated Readability Index is also based on the concept of “complex words” and

“complex sentences.” The advantage of ARI is the ease with which these concepts can be quantiﬁed,

since it is enough to count the number of strokes, words and sentences. The count of these three

variables is quite simple today with the help of a word processor such as Microsoft Word and the like.

2.4 Flesch–Kincaid grade level

In 1975, J. Peter Kincaid and collaborators recalculated three readability indices (ARI, Gunning fog

index and Flesch reading ease) for texts related to the United States Navy. In addition, the formula

of Flesch rewritten on the grade level scale, now known as the Flesch-Kincaid grade level [(Nível de

Instrução de Flesch-Kincaid), was also presented:

Flesch–Kincaid grade level =−15.59 + 0.39 ×words

sentences+ 11.8×syllables

words .(4)

Page 4 of 21

ALT: A software for readability analysis of Portuguese-language texts

The conversion formula between the centigrade scales (from zero to 100) and the level of

instruction, obtained from the manipulation of the Equations (1) and (4), is given by

Flesch–Kincaid grade level = 63.88 −0.38424 ×(Flesch reading ease)−20.7×syllables

words .(5)

2.5 Coleman–Liau index

The Coleman–Liau index (Índice de Coleman-Liau] follows the criterion adopted by the ARI method,

in the sense that it was developed with the purpose of being an index of easy computational

implementation. Its formula, created by Meri Coleman and T. L. Liau, is given by [9]

Coleman–Liau grade level =−15.8−2.96 ×sentences

words + 5.88 ×letters

words .(6)

This index is quite similar to ARI. The most notable diﬀerence is to infer the complexity of sentences

by the sentence/word ratio, which is the inverse of what appears in the ARI and in the other indexes

already presented (words/sentences).

2.6 Indice Gulpease

Developed by Gruppo Universitario Linguistico Pedagogico (GULP) of the University of Rome La

Sapienza in 1987, the Indice Gulpease (Gulpease Index) provides a number for the readability of

texts in Italian language. Also delimited by the centigrade scale, its formula is given by

Indice Gulpease = 89 + 300 ×sentences

words −10 ×letters

words .(7)

This index also does not use the criterion of the number of syllables to delimit complex words.

3 The algorithms

To count the number of characters, words, sentences, and syllables in the ALT software, the ﬁrst

procedure is to store all the characters in the document in a vector, which we will call text, N compo-

nents, as example shown in Figure 1. These characters can include letters, numbers, punctuation

marks, other symbols arranged on the keyboard, and other symbols found in other languages. The

count of these four mentioned variables is based on the analysis of the text vector components, whose

algorithms are described in the sections below.

text[1]

...

text[2]

text[3]

...

Th i srep

Figure 1:

Storing the content of a document that starts with “This report presents ...” in the text vector. In this

example, the ﬁrst ﬁve vector components are the characters

and a blank space. An arbitrary

k-th component of the vector is represented by text[k].

3.1 Characters count

We consider all letters, both upper and lower case, and the - symbol (hyphen), as well as numbers,

signs, and other symbols as characters. We then deﬁne the variable qntCharacters, which represents

the total character of the text. As text vector components are read, a function is called to know if the

symbol stored in the vector is a letter or hyphen. In either case, the qntCharacteristic variable is

incremented in one unit.

At the end of the reading of the

components of the text vector, the qntCharacter variable will

contain the total text characters of the document.

Page 5 of 21

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ALT*:AsoftwareforreadabilityanalysisofPortuguese-languagetextsGleiceCarvalhodeLimaMoreno1,3,MarcoP.M.deSouza2,NelsonHein3,AdrianaKroenkeHein31DepartamentodeCienciasCont´abeis,UniversidadeFederaldeRondonia,76801-974,PortoVelho,Rondonia,Brasil2DepartamentodeF´sica,UniversidadeFederaldeRondonia,76...

展开>> 收起<<

ALT A software for readability analysis of Portuguese-language texts Gleice Carvalho de Lima Moreno13 Marco P. M. de Souza2 Nelson Hein3_2.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ALT A software for readability analysis of Portuguese-language texts Gleice Carvalho de Lima Moreno13 Marco P. M. de Souza2 Nelson Hein3_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: