Classification of cow diet based on milk Mid Infrared Spectra a data analysis competition at the International Workshop of

2025-04-27 0 0 1.77MB 27 页 10玖币
侵权投诉
Classification of cow diet based on milk Mid
Infrared Spectra: a data analysis competition
at the “International Workshop of
Spectroscopy and Chemometrics 2022”
Maria Frizzarin1,2, Giulio Visentin3, Alessandro Ferragina4, Elena Hayes5, Antonio
Bevilacqua6, Bhaskar Dhariyal6, Katarina Domijan7, Hussain Khan4, Georgiana Ifrim6,
Thach Le Nguyen6, Joe Meagher2,8, Laura Menchetti9, Ashish Singh6, Suzy
Whoriskey2,8, Robert Williamson10, Martina Zappaterra11, and Alessandro Casa12
1Teagasc, Animal & Grassland Research and Innovation Centre, Moorepark, Ireland
2School of Mathematics and Statistics, University College Dublin, Ireland
3Department of Veterinary Medical Sciences, University of Bologna, Italy
4Teagasc Food Research Centre, Ashtown, Ireland
5Teagasc, Food Research Centre, Moorepark, Ireland
6School of Computer Science, University College Dublin, Ireland
7Department of Mathematics and Statistics, National University of Ireland, Maynooth, Ireland
8Insight Centre for Data Analytics, University College Dublin, Ireland
9School of Biosciences and Veterinary Medicine, University of Camerino, Italy
10School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, UK
11Department of Agricultural and Food Sciences, University of Bologna, Italy
12Faculty of Economics and Management, Free University of Bozen-Bolzano, Italy
Abstract
In April 2022, the Vistamilk SFI Research Centre organized the second edition of the
“International Workshop on Spectroscopy and Chemometrics – Applications in Food and
Agriculture”. Within this event, a data challenge was organized among participants of the
workshop. Such data competition aimed at developing a prediction model to discriminate
dairy cows’ diet based on milk spectral information collected in the mid-infrared region. In
fact, the development of an accurate and reliable discriminant model for dairy cows’ diet can
provide important authentication tools for dairy processors to guarantee product origin for
dairy food manufacturers from grass-fed animals. Different statistical and machine learning
modelling approaches have been employed during the workshop, with different pre-processing
steps involved and different degree of complexity. The present paper aims to describe the
statistical methods adopted by participants to develop such classification model.
Keywords: Chemometrics, Fourier transform mid-infrared spectroscopy, machine learning,
milk quality, food authenticity
1 Introduction
The use of mid-infrared spectroscopy (MIRS) has become a relevant topic in agri-food sciences,
due to its capacity to routinely quantify a wide range of important characteristics rapidly and
cost-effective. In particular, MIRS is nowadays commonly employed to monitor and quantify
Corresponding author: address. Email: giulio.visentin@unibo.it
1
arXiv:2210.04479v1 [q-bio.QM] 10 Oct 2022
milk quality parameters, such as concentrations of fat, protein, casein, and lactose. These
parameters are used for milk quality-based payment schemes, genetic and genomic selection,
and as farmers’ support tool. Spectral information generated from MIRS analysis have also
proven to be effective in predicting fine milk quality parameters, including protein fractions,
free amino acids [Bonfatti et al.,2011;McDermott et al.,2016], individual and groups of fatty
acids [Soyeurt et al.,2006;Fleming et al.,2017], milk processing traits [Ferragina et al.,2013;
Visentin et al.,2015], animal-related characteristics [McParland et al.,2014;Shetty et al.,2017;
Ho et al.,2019], and can be used as a tool for the verification of the authenticity of agricultural
foods [Cozzolino,2012]. A more extended list of applications of MIRS in the dairy science
framework can be retrieved from the reviews by De Marchi et al. [2014] and Tiplady et al.
[2020].
The two-day event “International Workshop on Spectroscopy and Chemometrics” was orga-
nized by Vistamilk SFI Research Centre in April 2022, following its first edition held in 2021
[Frizzarin et al.,2021a]. The workshop focused on describing the main challenges and appli-
cations of near and mid-infrared spectroscopy in food, animal, and agricultural sciences with
internationally recognised researchers. Moreover, participants, on a voluntary basis, were pro-
vided with a large dataset containing individual cow milk spectra with the sole information on
animal’s diet for a chemometric data competition. Such data presented many challenges from
a methodological and statistical point of view, due to the high dimensionality of the spectral
matrices, and strong collinearity between adjacent spectral wavelengths. The chemometric chal-
lenge, therefore, encouraged the engagement of participants with different background and skills
and required the application of different statistical and machine learning strategies.
The purpose of the data challenge was to develop a model to predict the diet fed to dairy
cows by exploiting mid-infrared spectral information. Participants, or groups of participants,
were required to apply their developed model to a test set containing only individual milk spectra
and to submit their prediction of animals’ diet. Although the participation to the chemometric
challenge was extremely high among participants, only the best six contributions, in terms of
accuracy of prediction and methodological innovativeness, were selected to present their results
both at the workshop and in the present manuscript.
2 Data description and challenge
A dataset consisting of 4,364 individual milk spectra from 120 cows was collected between May
and August in 2015, 2016 and 2017 [O’Callaghan et al.,2016]. The samples were from Hol-
stein Friesian cows with different parity from Irish Dairy Research Herd in Teagasc Moorepark,
Fermoy, Co. Cork. Three dietary groups were evaluated with 54 cows being assigned to each di-
etary group each year. The three diet treatments were grass (GRS) which consisted of perennial
ryegrass only, clover (CLV) which consisted of perennial ryegrass with 20% annual clover sward,
and total mixed ration (TMR) where cows were fed grass silage, maize silage and concentrates
while being maintained indoors for the full season. Milk samples were collected in the morning
(AM) and evening (PM) milking session; subsequently AM+PM samples were pooled and anal-
ysed weekly using Pro-FOSS FT6000 (FOSS). A total of 1060 transmittance data points in the
region from 925 cm1to 5,000 cm1were collected.
The dataset was divided into training (3275 spectra) and test (1089 spectra) data; for the
latter only spectral information was provided, while diet information, to be used as a classifi-
cation variable, was available for the training set. The training data included 1094 spectra for
GRS, 1120 spectra from CLV and 1061 spectra for TMR. There were no missing values in the
training or test set. The specific information about the wavenumbers had not been shared with
the participants.
The three dietary groups were carefully selected based on their characteristics. As described
by Frizzarin et al. [2021b], pasture-based diets are easily discriminated from TMR diets, while
2
discriminating between GRS and CLV diets is much more difficult due to the similarities in the
sward composition resulting in similar milk composition. However, with the increased pressure
to reduce fertilizer use, and the introduction of multi-species swards, the development of a robust
discriminant model for classifying milk spectra based on diet is of paramount importance.
After the analysis, the participants submitted their predicted values for the test dataset and
a short explanation of the methodology used. The best methods were selected based on the
novelty of the contribution and on the accuracy of the predictions for the test dataset. The
accuracy was calculated as the proportion of the correctly classified samples divided by the total
number of samples in the test dataset.
3 Modelling approaches and results
3.1 Participant 1
The data were analyzed following different modelling strategies, focusing both on methods that
considered the ordering of the wavelengths and on methods that do not. All the analyses have
been mainly conducted using Python libraries pandas, sklearn, sktime and matplotbib [see
Pedregosa et al.,2011, and references therein]: the code is available at https://github.com/
mlgig/vistamilk_diet_challenge.
As a first step, some descriptive statistics were computed, and the outliers have been removed,
following both the recommendations given prior to the competition and a visual inspection of the
data. In the subsequent step, the labeled dataset was split according to a 3-fold cross-validation
(3CV) strategy. Therefore, the best model was selected based on cross-validation accuracy, and
then trained on the full training set and used to perform prediction on the provided unlabeled
test set.
In order to predict the diet, the following classification strategies were considered:
Tabular models: each sample is considered as a vector of unordered features. In particu-
lar, Ridge Classifier and Linear Discriminant Analysis (LDA) were tested. In the following,
these methods were coupled both with feature selection strategies and with random polyno-
mial feature transformations. The latter approach, by generating new polynomial variables
from the original ones, aimed to check if non-linear interactions improved the classification
accuracy. In particular, a new approach is presented which aimed to diversify polynomial
features while keeping low computational requirements.
Deep Neural Network Models: a family of approaches based on deep neural networks,
both fully connected and convolutional, were tested. This strategy implicitly generates
complex features interactions, as captured by the network architecture.
Note that previously obtained results [Frizzarin et al.,2021a] suggest that tabular methods
work quite well with spectroscopy data. Moreover, following the suggestions in Frizzarin et al.
[2021b], feature selection strategies were coupled with the information about the presence of
water regions in the spectra. In addition, state-of-the-art time series classification algorithms,
such as ROCKET [Dempster et al.,2020], MiniROCKET [Dempster et al.,2021], MrSQM
[Nguyen and Ifrim,2021,2022] and FreshPrince [Middlehurst and Bagnall,2022], were tested.
Lastly, ensemble methods were applied, aiming to mix together time series and tabular models, to
combine their predictions and strengths. Nonetheless, these approaches have been outperformed
by the ones mentioned above, therefore the corresponding results are not shown in the next
sections.
3.1.1 Tabular models, feature selection and transformation
In Table 1, results for the best tabular methods are presented. Both the ridge classifier, appro-
priately tuned, and LDA performed quite well, while being extremely fast to train. Nonetheless,
3
Table 1: Accuracy results, evaluated on the 3-fold cross-validation, for the tabular methods considered,
coupled with feature selection strategies.
Method Accuracy
Ridge Classifier 0.760
LDA 0.747
Feature Selection + Ridge Classifier 0.777
Feature Selection + LDA 0.778
No water + Ridge Classifier 0.777
No water + LDA 0.783
Feature Selection + Polynomial Features + LDA 0.844
No water + Feature Selection + Polynomial Features + LDA 0.844
Figure 1: LDA visualisation for the model Feature Selection + Polynomial Features + LDA, applied to
the unlabeled test data to predict class labels.
the selection of some specific wavelengths seemed to improve the accuracy further. In fact, both
the removal of the noisy water regions and the data-driven feature selection (performed using
the SelectFromModel routine in Python), provides better results.
Nevertheless, all these approaches hover around 80% accuracy, therefore, in order to improve
it, the data were augmented considering polynomial features of degree two (using sklearn
method PolynomialFeatures(degree = 2)). This led to an increase of the accuracy to 84.4%.
The LDA component visualisation for the model with Feature Selection and Polynomial Features,
applied on the unlabeled test dataset, is shown in Figure 1and a good discrimination between
the three classes is clearly visible.
The improvements obtained when considering polynomial features, come at a price in terms
of the computational requirements. In fact, starting from the 1060 original wavelengths, the
addition of second-degree polynomial features resulted in a total number of variables which
made the model estimation task unfeasible. To address this issue, in this work a new Random
Polynomial Features (RPolyTransformer in the following) approach was introduced. The key
idea was to implement random sampling in the non-linear feature space. This lead to relevant
advantages as the total number of features can be controlled and it can consider both higher-
degree (>2) polynomial features and complex mathematical functions (e.g., cosine, exp).
This strategy firstly generated Krandom arithmetic expressions (see Table 2for some ex-
amples), which are then used to compute Knon-linear features. From the new and the original
4
Table 2: Examples of RPolyTransformer features used. Here xjdenote the j-th wavelength.
(x32 x19) + x103 x2
(x102 (x78) + x26 )
(x1x150) + x64 x4x5
Table 3: Results for different combinations with RPolyTransformer.SelectFromModel and SelectKBest
are feature selection modules to remove noise from data (the former) and select the most discriminative
non-linear features (the latter).
Method Accuracy
Region: FULL
RPolyTransformer + Ridge Classifier 0.717
RPolyTransformer + LDA 0.619
SelectFromModel + RPolyTransformer + SelectKBest + LDA 0.848
Region: [925:1585, 1720:2989]
RPolyTransformer + Ridge Classifier 0.805
RPolyTransformer + LDA 0.847
SelectFromModel + RPolyTransformer + SelectKBest + LDA 0.843
Region: [925:1585, 1720:2989, 3738:3807]
RPolyTransformer + Ridge Classifier 0.811
RPolyTransformer + LDA 0.833
SelectFromModel + RPolyTransformer + SelectKBest + LDA 0.835
Optimized model
Region: [925:1585, 1720:2989]
RPolyTransformer(K= 17000) + SelectKBest(K= 7000) + LDA 0.864
features, Kvariables are selected using SelectKBest from sklearn. The hyperparameters K
and Kwere optimized via cross-validation in the final model (see the final row of Table 3).
In Table 3the results obtained with this method, again combined with different classifiers and
feature selection approaches and tested with the full data and the data after water region removal,
are presented. At first, when combining RpolyTransformer with a classifier, a significant drop
in the accuracy was observed, if compared with simple tabular models. Ridge was more accurate
than LDA but it was still far behind the previous results. However, by carefully filtering the
features either automatically with SelectFromModel or manually by removing the water regions,
the results improved noticeably. In these experiments, LDA outperforms Ridge consistently.
Compared to the PolynomialFeatures method, the one proposed here is faster (a few seconds
versus a few minutes) and just as accurate. However, the initial results without noise reduction
(i.e., feature selection) suggest that this strategy is more sensitive to noise in the data.
3.1.2 Deep Learning Models
When considering deep learning models, the task of exploding the feature space and learning
feature interactions is completely deferred to the network, without requiring any feature engi-
neering steps. In turn, deep neural networks require a careful design process, to avoid overfitting
and to identify the best model architecture and input modality.
The designed model architectures considered here can be grouped into two main categories,
namely, Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs). FCNs
do not require any manipulation or adaptation of the input data, as each single wavelength
is treated as an independent feature and fed to an input unit. In contrast, CNNs require the
5
摘要:

ClassificationofcowdietbasedonmilkMidInfraredSpectra:adataanalysiscompetitionattheInternationalWorkshopofSpectroscopyandChemometrics2022MariaFrizzarin1,2,GiulioVisentin*3,AlessandroFerragina4,ElenaHayes5,AntonioBevilacqua6,BhaskarDhariyal6,KatarinaDomijan7,HussainKhan4,GeorgianaIfrim6,ThachLeNguye...

展开>> 收起<<
Classification of cow diet based on milk Mid Infrared Spectra a data analysis competition at the International Workshop of.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:1.77MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注