ALT A software for readability analysis of Portuguese-language texts Gleice Carvalho de Lima Moreno13 Marco P. M. de Souza2 Nelson Hein3_2

2025-04-30 0 0 1.25MB 21 页 10玖币
侵权投诉
ALT
: A software for readability analysis
of Portuguese-language texts
Gleice Carvalho de Lima Moreno1,3, Marco P. M. de Souza2, Nelson Hein3,
Adriana Kroenke Hein3
1Departamento de Ciˆ
encias Cont´
abeis, Universidade Federal de Rondˆ
onia, 76801-974, Porto Velho, Rondˆ
onia, Brasil
2Departamento de F´
ısica, Universidade Federal de Rondˆ
onia, 76900-726, Ji-Paran ´
a, Rondˆ
onia, Brasil
3Programa de P´
os-Graduac¸ ˜
ao em Ciˆ
encias Cont´
abeis, Universidade Regional de Blumenau, 89030-903, Blumenau,
Santa Catarina, Brasil
October 11, 2022
In the initial stage of human life, communication, seen as a process of social interaction,
was always the best way to reach consensus between the parties. Understanding and
credibility in this process are essential for the mutual agreement to be validated. But,
how to do it so that this communication reaches the great mass? This is the main challenge
when what is sought is the dissemination of information and its approval. In this context,
this study presents the ALT software, developed from original readability metrics adapted to
the Portuguese language, available on the web, to reduce communication difficulties. The
development of the software was motivated by the theory of communicative action of Haber-
mas, which uses a multidisciplinary style to measure the credibility of the discourse in the
communication channels used to build and maintain a safe and healthy relationship with the
public.
——————
No estágio inicial da vida humana a comunicação, vista como um processo de interação
social, foi sempre o melhor caminho para o consenso entre as partes. O entendi-
mento e a credibilidade nesse processo são fundamentais para que o acordo mútuo
seja validado. Mas, como fazê-lo de forma que essa comunicação alcance a grande massa?
Esse é o principal desafio quando o que se busca é a difusão da informação e a sua aprovação.
Nesse contexto, este estudo apresenta o software ALT, desenvolvido a partir de métricas de
legibilidade originais adaptadas para a Língua Portuguesa, disponível na web, para reduzir
as dificuldades na comunicação. O desenvolvimento do software foi motivado pela teoria
do agir comunicativo de Habermas, que faz uso de um estilo multidisciplinar para medir a
credibilidade do discurso nos canais de comunicação utilizados para construir e manter uma
relação segura e saudável com o público.
1 Introduction
Jürgen Habermas is a German philosopher and sociologist from the Frankfurt School who worked
hard to study democracy by devoting extensively to the theory of communicative action published in
1981. This theory emphasized the way in which communication should occur, by multidisciplinary
treating the credibility in the relationship between the system (economic and political) and the
https://legibilidade.com/
arXiv:2210.00553v2 [cs.CL] 8 Oct 2022
ALT: A software for readability analysis of Portuguese-language texts
world of life (represented by the people’s prior knowledge, standards in society and knowledge from
culture). Based on the individual’s innate language, he described that dialog should occur freely,
with communicative rationality and making critical analysis in this interaction in order to reach the
essential and apex of the communication [1].
In view of this, influenced by Habermas’s theory of communicative action, this work was developed
to reduce the flaws that prevent the understanding of communications written in Portuguese. The
most common flaws are the absences of objectivity, clarity, and simplicity.
Communication is the main driver who intervenes in the search for a better relationship between
people or groups of people. Through dialog (written or oral form), it is important to establish a
transmission of information backed by ethics and morals in order to reach persuasion. Following these
criteria, the prospect of achieving credibility and, consequently, the concordance of the collectivity
is greater.
However, this is not always the case. Written communications are often aimed at a specific
audience. Thus, the use of complex words (common to the group) and long sentences is the most
common, making reading difficult and preventing the understanding of readers that are part of other
groups. Therefore, the message does not reach the expected range, and it does not meet a greater
number of people (either layperson or specialist). It is almost a game of bad luck with errors and
trials [2].
In this sense, Habermas found, through the theory of communicative action, that the terms
used in a communication must follow four pretensions of validity (intelligibility, sincerity, normative
correction and truth) in order to reach the summit (credibility) at this important and necessary stage
of human interaction.
The pretense of intelligibility corresponds to the communication process that has been clearly
carried out, thus allowing for an easy understanding of what has been declared in order to reach
consensus between the parties. Thus, this claim refers to the indicator of comprehensibility, which
occurs when the degree of effectiveness of communication is achieved. To measure it, there are
several textual readability metrics developed over the last decades. We will deal with some of them
in this Article.
As for the claim of sincerity, it is the disclosure of information in a detailed manner, presenting
honestly what has been done or has not been done. To accurately measure this indicator, keywords
should be considered for the purpose of assessing whether the author(s) dealt with or dealt with in
detail the subject to which the text is intended.
Regarding the claim of regulatory correction, it has as its principle the compliance with established
legal rules or standards, dealing with the adequacy of the reports in relation to a specific situation
(environmental, social, cultural, among others) or considering the reader’s ability to respond to what
is proposed [
5
]. This indicator is measured by means of the readability metrics, which indicate the
intended audience for the communication.
The claim of truth, however, deals with the availability of reliable information, always considering
the truth of the facts. As stated, reliability is fundamental to the communication process, being
obtained from the relationship between the guidelines observed and suggested guidelines based on
pre-established standards (standards and rules).
From the Habermas’ theory, it has to be said that, if these claims or claims are present, the
communication existing in the relationship between the system (government and market) and the
world of life (subjective, normative and objective) will be more solid.
This has been seen as a major challenge for humanity, by influencing the communication process
in its various relations. As examples, we can cite the relationship between government and taxpayers;
between company and society; between manager and collaborators; between doctor and patients;
between scientist and public; and so many other cases. In order to attenuate the imperfections in
the communication process, particularly in the written form, some studies have been highlighted
because they are pioneers in this field of study.
The field of study referred to relates to readability, which aims to analyze the difficulty under-
standing a text. Some studies, such as Flesch [
6
] in 1948, Gunning [
7
] in 1952, Smith and Senter
[
8
] in 1967, Coleman and Liau [
9
] in 1975, Kincaid [
10
] and collaborators in 1975, and Gulpease
[
11
] em 1988, in addition to others with scientific evidence, were important for proposing solutions
to measure the degree of reading difficulty.
Thus, in this work we present the ALT software: Textual Readability Analysis [
12
], a tool
developed to measure textual readability indexes of Portuguese texts, using formulas adapted for that
language from the originals. Within Habermas’s theory of communicative action, readability indices
can be used when the objective is to obtain quantitative data of the pretensions of intelligibility and
Page 2 of 21
ALT: A software for readability analysis of Portuguese-language texts
normative correction. In addition, the the claim of sincerity, which seeks to measure the completeness
of the text, observing the frequency with which the chosen keywords have been mentioned, is also
proposed in the software.
That said, the ALT program was built to meet two needs:
1. To enable the analysis of textual readability for texts written in Portuguese.
2.
To fill an existing gap in the scientific environment, since researchers from several areas develop
studies focused on textual readability in Portuguese and end up making use of international
software based on readability indexes not suitable for this language.
Still within the second point of the list above, it should be noted that even recent works, with
four years or less of publication, used textual readability indices in their original languages (English).
Among these studies, it is possible to mention the references [
13
,
14
,
15
,
16
,
17
], with the first four
using the Flesch Index of readability and the latter, The Gunning fog Index. This is, of course, due to
the absence of studies involving the adaptation of the foreign readability indexes to the Portuguese
language.
This Article is organized as follows: in Section 2we present a brief review of the readability of
interest indices, including their original formulas. In Section 3we show the algorithms responsible
for counting characters, words, sentences, and syllables of any text. The reasons characters/words,
words/sentences and other combinations are the basic variables of the main known readability
indexes. The adaptation of formulas for the Portuguese language is shown in Section 4. The overview
of the ALT program and its features are covered in Section 5. In order to know the ALT’s degree of
accuracy, in Section 6we compared the indices obtained by ALT with those obtained by the original
formulas. Finally, we discussed the limitations of the application of readability formulas in Section 7
and concluded this Article in Section 8.
2 The Readability Indexes – original formulas
In this section we discuss the readability indexes used in the ALT program – textual readability
analysis.
2.1 Flesch reading ease
The Flesch Reading Index is one of the oldest methods capable of quantifying the “difficulty un-
derstanding” texts, according to the very words of its creator, Rudolf Flesch [
6
].The method was
developed in 1943 and the formula was revised in 1948, being normalized in a scale ranging from 0
(minimum readability) to 100 (maximum readability). The formula, focused only on English texts,
is given by
Flesch reading ease = 206.835 1.015 ×words
sentences84.6×syllables
words .(1)
As Flesch’s readability Index was presented to the public many decades before the popularization
of personal computers, the use of the formula was impractical in long articles or books. In this sense,
the author recommended the sampling criterion: the counting of syllables, words and sentences
could be made from 100 words, three to five texts in articles and 25 to 30 texts in a book. The choice
of texts could follow a certain pattern, such as starting from the third paragraph of each page, for
example [6].
With the advent of computer popularization, the counting of words and sentences in texts as long
as those of books became a very simple process. But it must be pointed out that the syllable count
still remains a challenge. There is currently no algorithm capable of providing, without errors, the
amount of syllables of a word, and not even the syllable concept is a consensus among linguists [
18
].
2.2 Gunning fog index
The Gunning fog index , often translated literally as Índice de Nevoeiro de Gunning, was developed
by Robert Gunning in 1952. Gunning, a consultant who came to work for United Press, the Wall
Street Journal and Newsweek, was the first to present a readability formula that estimates the years
of formal education that a person must have in order to understand the text without difficulty [
7
].
Page 3 of 21
ALT: A software for readability analysis of Portuguese-language texts
For example, a text with a 12-point index would be suitable for a reader with high school education
degree, which revolves around 12 years of studies (3 years of high school + 9 years of elementary
school). This scale, based on years of studies rather than the arbitrary centigrade scale, is now
known as level of education or level of schooling (grade level ).
The formula for obtaining the level of text instruction is given by
Gunning fog index = 0.4×words
sentences+ 40 ×complex words
words .(2)
As with the Flesch Readability Index, the Gunning Fox Index is based on the concepts of “complex
sentences” and “complex words.” Complex sentences, in the sense of Gunning, are those very
long in terms of the amount of words: note that the first variable (words/sentences) is one of the
determinants in the text level of instruction. It is intuitive that large sentences, composed of much
recursion [
19
], make reading difficult. It is not rare that even an educated reader has to reread a
very long sentence in order to understand the information contained in it.
Over the complex words, Gunning defines them as being those that contain three or more syllables.
Proper names, family jargon, and composite words should not be taken into account. It is interesting
to note the difference between Flesch and Gunning: while the former encompasses a broad spectrum
of words ranging from the very simple (monosyllabic), passing through the moderate (dissyllabic
and trisyllabic) and reaching the very difficult (with 4 or more syllables), the second highlights that
words are either common or complex. In this sense, the Gunning Fog Index considers that the words
“message” and “heterozygote” have the same degree of complexity.
2.3 Automated readability index
The ARI – Automated readability index – was developed by E. A. Smith and R. J. Senter in 1967. As
pointed out in the seminal paper [
8
], its objective was to offer a readability index for books, reports
and technical manuals of the United States Air Force with the purpose of decreasing the time of
extracting information from these documents.
Until 1967, there were at least three methods of obtaining textual readability indices. Two of
them were the Flesch Readability Index and the Gunning Fox Index which, as already pointed out,
used the number of syllables to infer the complexity of a word. The other was the 1948 Dale-Chall
algorithm [
20
], which presented the index in a scale of 4.9 (or less) up to 9.9 points. The problem
with the latter was that the formula depended on comparing the text with a list of 763 “very common
words”, such as yes and no. This index was quite interesting for children’s texts, since the repertoire
of words known to children is small. However, the formula was not suitable for texts aimed at
the adult public. It was at this juncture that the authors of the ARI method presented a formula
that avoided the complexity involved in the syllable count of the Flesch Readability Index and the
presence of a list of words of Dale-Chall Index. The Automated Readability Index, which is based on
the grade level scale, is given by
ARI =21.43 + 0.50 ×words
sentences+ 4.71 ×characters
words .(3)
As you can see, the Automated Readability Index is also based on the concept of “complex words” and
“complex sentences.” The advantage of ARI is the ease with which these concepts can be quantified,
since it is enough to count the number of strokes, words and sentences. The count of these three
variables is quite simple today with the help of a word processor such as Microsoft Word and the like.
2.4 Flesch–Kincaid grade level
In 1975, J. Peter Kincaid and collaborators recalculated three readability indices (ARI, Gunning fog
index and Flesch reading ease) for texts related to the United States Navy. In addition, the formula
of Flesch rewritten on the grade level scale, now known as the Flesch-Kincaid grade level [(Nível de
Instrução de Flesch-Kincaid), was also presented:
Flesch–Kincaid grade level =15.59 + 0.39 ×words
sentences+ 11.8×syllables
words .(4)
Page 4 of 21
ALT: A software for readability analysis of Portuguese-language texts
The conversion formula between the centigrade scales (from zero to 100) and the level of
instruction, obtained from the manipulation of the Equations (1) and (4), is given by
Flesch–Kincaid grade level = 63.88 0.38424 ×(Flesch reading ease)20.7×syllables
words .(5)
2.5 Coleman–Liau index
The Coleman–Liau index (Índice de Coleman-Liau] follows the criterion adopted by the ARI method,
in the sense that it was developed with the purpose of being an index of easy computational
implementation. Its formula, created by Meri Coleman and T. L. Liau, is given by [9]
Coleman–Liau grade level =15.82.96 ×sentences
words + 5.88 ×letters
words .(6)
This index is quite similar to ARI. The most notable difference is to infer the complexity of sentences
by the sentence/word ratio, which is the inverse of what appears in the ARI and in the other indexes
already presented (words/sentences).
2.6 Indice Gulpease
Developed by Gruppo Universitario Linguistico Pedagogico (GULP) of the University of Rome La
Sapienza in 1987, the Indice Gulpease (Gulpease Index) provides a number for the readability of
texts in Italian language. Also delimited by the centigrade scale, its formula is given by
Indice Gulpease = 89 + 300 ×sentences
words 10 ×letters
words .(7)
This index also does not use the criterion of the number of syllables to delimit complex words.
3 The algorithms
To count the number of characters, words, sentences, and syllables in the ALT software, the first
procedure is to store all the characters in the document in a vector, which we will call text, N compo-
nents, as example shown in Figure 1. These characters can include letters, numbers, punctuation
marks, other symbols arranged on the keyboard, and other symbols found in other languages. The
count of these four mentioned variables is based on the analysis of the text vector components, whose
algorithms are described in the sections below.
text[1]
...
text[2]
text[3]
...
Th i srep
Figure 1:
Storing the content of a document that starts with “This report presents ...” in the text vector. In this
example, the first five vector components are the characters
T
,
h
,
i
,
s
and a blank space. An arbitrary
k-th component of the vector is represented by text[k].
3.1 Characters count
We consider all letters, both upper and lower case, and the - symbol (hyphen), as well as numbers,
signs, and other symbols as characters. We then define the variable qntCharacters, which represents
the total character of the text. As text vector components are read, a function is called to know if the
symbol stored in the vector is a letter or hyphen. In either case, the qntCharacteristic variable is
incremented in one unit.
At the end of the reading of the
N
components of the text vector, the qntCharacter variable will
contain the total text characters of the document.
Page 5 of 21
摘要:

ALT*:AsoftwareforreadabilityanalysisofPortuguese-languagetextsGleiceCarvalhodeLimaMoreno1,3,MarcoP.M.deSouza2,NelsonHein3,AdrianaKroenkeHein31DepartamentodeCiˆenciasCont´abeis,UniversidadeFederaldeRondˆonia,76801-974,PortoVelho,Rondˆonia,Brasil2DepartamentodeF´sica,UniversidadeFederaldeRondˆonia,76...

展开>> 收起<<
ALT A software for readability analysis of Portuguese-language texts Gleice Carvalho de Lima Moreno13 Marco P. M. de Souza2 Nelson Hein3_2.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:21 页 大小:1.25MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注