Electroencephalography (EEG) is a non-invasive method to record the electrical activity on the scalp which has been
shown to represent the macroscopic activity of the brain underneath. It is used by several studies to assess individuals
health conditions and to study brain function in healthy individuals as well as to diagnose various diseases that al-
ter the brain electrical activity such as: Parkinson’s Disease, epilepsy, Alzheimer’s, sleep disorders, schizophrenia,
etc (Soufineyestani et al.,2020).
EEG signals are known to have a low signal-to-noise ratio and present many difficulties. EEG noise is defined by
any measured signal whose source is not the coveted brain activity (Urig¨
uen and Garcia-Zapirain,2015). Unfortu-
nately, in most cases the EEG signal is contaminated by various unwanted artefacts, even though we try to limit their
occurrence during the recording session. These artefacts are entangled with the desired brain activity and can have
an amplitude up to 100 times that of the brain activity. In most EEG we encounter the following undesired artefacts:
ocular, muscular, cardiac, perspiration, line noise, etc. (Luca;Urig¨
uen and Garcia-Zapirain (2015) give more details).
Another difficulty that we may encounter during EEG analysis is the volume conduction, i.e. the transmission of elec-
tric fields from a primary current source through biological tissue towards the recording electrodes (Olejniczak,2006).
Because of volume conduction, unwanted artefacts will impact a broader region and therefore will contaminate more
electrodes. In addition, we lose the ability to study a single source or brain region of interest; information is diluted
and a signal recorded at one electrode is a combination of all the electrical activities present elsewhere (Urig¨
uen and
Garcia-Zapirain,2015).
Parkinson’s disease diagnosis using EEG has been studied in several works. Cavanagh et al. (2018) uses a selection of
Fourier transform coefficients to achieve a maximum accuracy of 82 %. It is to be noted that in our study we use the
same data as the former. Oh et al. (2020) proposes a fully automated approach based on a 1-Dimensional Convolu-
tional Neural Network (1-D CNN). The model directly classifies the temporal EEG epochs achieving an accuracy of
88.2 %. To perform the diagnosis, Bhurane et al. (2019) relies on correlation coefficients calculated between channels
as well as the coefficients of an AR model identified on the EEG to yield a presumable accuracy of 99.1 %. Yuvaraj
et al. (2018) uses high-order spectra to perform the diagnosis by extracting thirteen features from the EEG frequency
spectrum, he achieved a presumable accuracy of 99.25 %. Han et al. (2013) uses the coefficients of an AR model and
the wavelet packet entropy to analyse and investigate whether there is a difference between the parkinsonians and the
healthy individuals with no attempt to separate the subjects. Finally, Liu et al. (2017) utilises entropy-based features of
10 channels and a three-way decision model to obtain a classification accuracy of 92.9 %. This last study would have
been more relevant if the author addressed the problem of unbalanced data-set. We note that the majority of studies
are based only on the frequency features of the EEG and that few studies focus on the temporal features while the two
domains should complement each other. Only a few of the features used are explainable and we can understand their
design basis to derive conclusions for future work.
We strongly believe that some of the above mentioned methods (Cavanagh et al.,2018;Oh et al.,2020;Bhurane et al.,
2019;Yuvaraj et al.,2018) are subject to data leakage problems. Data leakage is defined as the use of information in
the model training process that is not supposed to be available at the time of prediction (Kaufman et al.,2012). This
would not be possible in a real life scenario, where we receive new samples of unlabelled data that we need to cate-
gorise. This data leakage will bias the evaluation of the model, which will perform better on the available data used
for training, but will perform poorly on the new data. The first type of data leakage that some of the proposed methods
suffer from is group leakage, where correlated data from the same subject are present in both the training and the test
sets (Ayotte et al.,2021). In this case, and using limited amounts of data, a complex model such as the 1-D CNN can
even identify the subject’s signature. The second type of data leakage is the fact of optimising hyper-parameters and
perform feature selection directly on the test-set (absence of a validation set) (Kaufman et al.,2012).
The aim of this paper is to propose a method for PD diagnosis using EEG signals recorded during a 3-oddball audi-
tory task. The data at our disposal are composed of N=50 subjects, of which 25 patients suffering from Parkinson’s
disease. Our main focus is not to have the highest accuracy at any cost, but rather to develop a valid method with
minimal bias. We aim to identify new biomarkers that go beyond traditional EEG statistics and spectral content as
found in the literature, but instead consider the combination of frequency content, dynamics, and temporal aspects of
the EEG.
2