
where visual information can provide input about
the world, stimulating hypotheses about pertinent
aspects of the linguistic system (Andersen et al.,
1984;Pérez-Pereira and Conti-Ramsden,2013).
The puzzling question on the role of sensory vs.
linguistics input in shaping our color perception
remains therefore sound. In this work, we make a
step forward towards better understanding of the
conceptual perception of the red and green colors
in red-green color-blind individuals, as mirrored in
their spontaneous linguistic production.
We perform a first (to the best of our knowledge)
large-scale computational study on the usage of
the "red" and "green" color terms in (self-reported)
population with deutan and protan visual impair-
ment. Using a novel dataset of linguistic produc-
tions by color-blind (CB) individuals, we show that
they use the "red" and "green" color terms in less
predictable contexts, and in linguistic environments
evoking mental image to a lower extent, when com-
pared to normal-sighted (NS) authors.
The contribution of this study is, therefore,
twofold: First, we release a large, diverse, and
carefully curated dataset of linguistic productions
by red-green CB authors, accompanied by a cor-
pus of utterances by NS individuals, aligned on
various linguistic properties. Second, we show pre-
liminary evidence for subtle, yet reliably detected,
divergences in the usage of "red" and "green" by
CB speakers, compared to their NS counterparts.
We make the dataset and our code available for
facilitating future research in this field.1
2 Datasets
We collected datasets used in this work from Reddit
– an online community-driven platform consisting
of numerous forums for news aggregation, content
rating, and discussions. As of 2021, it had over
430 million monthly active users, positioning it as
the sixth most popular social site in the US. Con-
tent entries are organized by areas of interest called
subreddits, ranging from main forums that receive
extensive attention to smaller ones that foster dis-
cussion on niche areas.
2.1 Collection of Posts by CB Users
Multiple subreddits allow their contributors to spec-
ify a flair – a metadata attribute adding context to
1
Code is available at
https://github.com/IBM/
colorblind-language; complying with Reddit’s terms
of use, we provide a full pipeline for re-producing the dataset
(extraction and filtering), rather than the data itself.
the specific subreddit, such as country of origin,
political association, occupation, age, etc. We col-
lected the set of color-blind Reddit authors from
r/colorblind
, considering only those self-
reported as having one of the red-green color blind-
ness types we study in this work: deuteranopia,
deuteranomaly,protanopia, and protanomaly. This
procedure resulted in
2,523
authors in total. Using
the collected list of user IDs, we were further able
to retrieve their entire digital footprint from Reddit,
spanning years 2005 through 2021.
Manual inspection of utterances produced by
the color-blind Reddit users reveals that CB au-
thors occasionally discuss various aspects related
to the impairment, as in "this game’s color-scheme
is not a good fit for colorblind, I cannot tell red
from green". Aiming at the analysis of deficiency-
agnostic linguistic productions, we apply strict fil-
ters on user utterances, by excluding (1) sentences
originating from a manually collected list of sub-
reddits potentially related to the color blindness
phenomenon, and (2) sentences containing words
possibly indicative of the CB impairment, such
as "color", "colorblind", "vision", their inflections
and spelling alternatives (e.g., "colour"), to prevent
potential biases stemming from deficiency-related
discussions. The full list of excluded subreddits
can be found in Appendix A.1.
2.2 Collection of Posts by NS Users
The comparative nature of our analysis requires
a collection of utterances produced by normal-
sighted Reddit authors. Assuming the relatively
low ratio of
∼8%
of people with the CB deficiency
in the population (Wong,2011), we sampled a large
set of posts and comments from the general popula-
tion of Reddit authors, excluding the (self-reported)
set of CB users. We believe that this approach
largely targets the language of NS authors due to
their large numbers and extensive diversity.
Usage patterns of color terms in linguistic pro-
ductions can be affected by several dimensions: de-
mographic factors (gender, age), language modality
(spoken vs. written), linguistic register (formal vs.
informal), topical preferences, etc. Multiple works
have shown that there exist detectable differences
in the language of male and female speakers, and
that topical tendencies shape both the frequency
and contextual environment of word usage. There-
fore, we strived to create a control set of NS pro-
ductions that would be aligned with CB language