Classification of cow diet based on milk Mid Infrared Spectra a data analysis competition at the International Workshop of

2025-04-27 1 0 1.77MB 27 页 10玖币

侵权投诉

Classification of cow diet based on milk Mid

Infrared Spectra: a data analysis competition

at the “International Workshop of

Spectroscopy and Chemometrics 2022”

Maria Frizzarin1,2, Giulio Visentin∗3, Alessandro Ferragina4, Elena Hayes5, Antonio

Bevilacqua6, Bhaskar Dhariyal6, Katarina Domijan7, Hussain Khan4, Georgiana Ifrim6,

Thach Le Nguyen6, Joe Meagher2,8, Laura Menchetti9, Ashish Singh6, Suzy

Whoriskey2,8, Robert Williamson10, Martina Zappaterra11, and Alessandro Casa12

1Teagasc, Animal & Grassland Research and Innovation Centre, Moorepark, Ireland

2School of Mathematics and Statistics, University College Dublin, Ireland

3Department of Veterinary Medical Sciences, University of Bologna, Italy

4Teagasc Food Research Centre, Ashtown, Ireland

5Teagasc, Food Research Centre, Moorepark, Ireland

6School of Computer Science, University College Dublin, Ireland

7Department of Mathematics and Statistics, National University of Ireland, Maynooth, Ireland

8Insight Centre for Data Analytics, University College Dublin, Ireland

9School of Biosciences and Veterinary Medicine, University of Camerino, Italy

10School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, UK

11Department of Agricultural and Food Sciences, University of Bologna, Italy

12Faculty of Economics and Management, Free University of Bozen-Bolzano, Italy

Abstract

In April 2022, the Vistamilk SFI Research Centre organized the second edition of the

“International Workshop on Spectroscopy and Chemometrics – Applications in Food and

Agriculture”. Within this event, a data challenge was organized among participants of the

workshop. Such data competition aimed at developing a prediction model to discriminate

dairy cows’ diet based on milk spectral information collected in the mid-infrared region. In

fact, the development of an accurate and reliable discriminant model for dairy cows’ diet can

provide important authentication tools for dairy processors to guarantee product origin for

dairy food manufacturers from grass-fed animals. Diﬀerent statistical and machine learning

modelling approaches have been employed during the workshop, with diﬀerent pre-processing

steps involved and diﬀerent degree of complexity. The present paper aims to describe the

statistical methods adopted by participants to develop such classiﬁcation model.

Keywords: Chemometrics, Fourier transform mid-infrared spectroscopy, machine learning,

milk quality, food authenticity

1 Introduction

The use of mid-infrared spectroscopy (MIRS) has become a relevant topic in agri-food sciences,

due to its capacity to routinely quantify a wide range of important characteristics rapidly and

cost-eﬀective. In particular, MIRS is nowadays commonly employed to monitor and quantify

∗Corresponding author: address. Email: giulio.visentin@unibo.it

arXiv:2210.04479v1 [q-bio.QM] 10 Oct 2022

milk quality parameters, such as concentrations of fat, protein, casein, and lactose. These

parameters are used for milk quality-based payment schemes, genetic and genomic selection,

and as farmers’ support tool. Spectral information generated from MIRS analysis have also

proven to be eﬀective in predicting ﬁne milk quality parameters, including protein fractions,

free amino acids [Bonfatti et al.,2011;McDermott et al.,2016], individual and groups of fatty

acids [Soyeurt et al.,2006;Fleming et al.,2017], milk processing traits [Ferragina et al.,2013;

Visentin et al.,2015], animal-related characteristics [McParland et al.,2014;Shetty et al.,2017;

Ho et al.,2019], and can be used as a tool for the veriﬁcation of the authenticity of agricultural

foods [Cozzolino,2012]. A more extended list of applications of MIRS in the dairy science

framework can be retrieved from the reviews by De Marchi et al. [2014] and Tiplady et al.

[2020].

The two-day event “International Workshop on Spectroscopy and Chemometrics” was orga-

nized by Vistamilk SFI Research Centre in April 2022, following its ﬁrst edition held in 2021

[Frizzarin et al.,2021a]. The workshop focused on describing the main challenges and appli-

cations of near and mid-infrared spectroscopy in food, animal, and agricultural sciences with

internationally recognised researchers. Moreover, participants, on a voluntary basis, were pro-

vided with a large dataset containing individual cow milk spectra with the sole information on

animal’s diet for a chemometric data competition. Such data presented many challenges from

a methodological and statistical point of view, due to the high dimensionality of the spectral

matrices, and strong collinearity between adjacent spectral wavelengths. The chemometric chal-

lenge, therefore, encouraged the engagement of participants with diﬀerent background and skills

and required the application of diﬀerent statistical and machine learning strategies.

The purpose of the data challenge was to develop a model to predict the diet fed to dairy

cows by exploiting mid-infrared spectral information. Participants, or groups of participants,

were required to apply their developed model to a test set containing only individual milk spectra

and to submit their prediction of animals’ diet. Although the participation to the chemometric

challenge was extremely high among participants, only the best six contributions, in terms of

accuracy of prediction and methodological innovativeness, were selected to present their results

both at the workshop and in the present manuscript.

2 Data description and challenge

A dataset consisting of 4,364 individual milk spectra from 120 cows was collected between May

and August in 2015, 2016 and 2017 [O’Callaghan et al.,2016]. The samples were from Hol-

stein Friesian cows with diﬀerent parity from Irish Dairy Research Herd in Teagasc Moorepark,

Fermoy, Co. Cork. Three dietary groups were evaluated with 54 cows being assigned to each di-

etary group each year. The three diet treatments were grass (GRS) which consisted of perennial

ryegrass only, clover (CLV) which consisted of perennial ryegrass with 20% annual clover sward,

and total mixed ration (TMR) where cows were fed grass silage, maize silage and concentrates

while being maintained indoors for the full season. Milk samples were collected in the morning

(AM) and evening (PM) milking session; subsequently AM+PM samples were pooled and anal-

ysed weekly using Pro-FOSS FT6000 (FOSS). A total of 1060 transmittance data points in the

region from 925 cm−1to 5,000 cm−1were collected.

The dataset was divided into training (3275 spectra) and test (1089 spectra) data; for the

latter only spectral information was provided, while diet information, to be used as a classiﬁ-

cation variable, was available for the training set. The training data included 1094 spectra for

GRS, 1120 spectra from CLV and 1061 spectra for TMR. There were no missing values in the

training or test set. The speciﬁc information about the wavenumbers had not been shared with

the participants.

The three dietary groups were carefully selected based on their characteristics. As described

by Frizzarin et al. [2021b], pasture-based diets are easily discriminated from TMR diets, while

discriminating between GRS and CLV diets is much more diﬃcult due to the similarities in the

sward composition resulting in similar milk composition. However, with the increased pressure

to reduce fertilizer use, and the introduction of multi-species swards, the development of a robust

discriminant model for classifying milk spectra based on diet is of paramount importance.

After the analysis, the participants submitted their predicted values for the test dataset and

a short explanation of the methodology used. The best methods were selected based on the

novelty of the contribution and on the accuracy of the predictions for the test dataset. The

accuracy was calculated as the proportion of the correctly classiﬁed samples divided by the total

number of samples in the test dataset.

3 Modelling approaches and results

3.1 Participant 1

The data were analyzed following diﬀerent modelling strategies, focusing both on methods that

considered the ordering of the wavelengths and on methods that do not. All the analyses have

been mainly conducted using Python libraries pandas, sklearn, sktime and matplotbib [see

Pedregosa et al.,2011, and references therein]: the code is available at https://github.com/

mlgig/vistamilk_diet_challenge.

As a ﬁrst step, some descriptive statistics were computed, and the outliers have been removed,

following both the recommendations given prior to the competition and a visual inspection of the

data. In the subsequent step, the labeled dataset was split according to a 3-fold cross-validation

(3CV) strategy. Therefore, the best model was selected based on cross-validation accuracy, and

then trained on the full training set and used to perform prediction on the provided unlabeled

test set.

In order to predict the diet, the following classiﬁcation strategies were considered:

•Tabular models: each sample is considered as a vector of unordered features. In particu-

lar, Ridge Classiﬁer and Linear Discriminant Analysis (LDA) were tested. In the following,

these methods were coupled both with feature selection strategies and with random polyno-

mial feature transformations. The latter approach, by generating new polynomial variables

from the original ones, aimed to check if non-linear interactions improved the classiﬁcation

accuracy. In particular, a new approach is presented which aimed to diversify polynomial

features while keeping low computational requirements.

•Deep Neural Network Models: a family of approaches based on deep neural networks,

both fully connected and convolutional, were tested. This strategy implicitly generates

complex features interactions, as captured by the network architecture.

Note that previously obtained results [Frizzarin et al.,2021a] suggest that tabular methods

work quite well with spectroscopy data. Moreover, following the suggestions in Frizzarin et al.

[2021b], feature selection strategies were coupled with the information about the presence of

water regions in the spectra. In addition, state-of-the-art time series classiﬁcation algorithms,

such as ROCKET [Dempster et al.,2020], MiniROCKET [Dempster et al.,2021], MrSQM

[Nguyen and Ifrim,2021,2022] and FreshPrince [Middlehurst and Bagnall,2022], were tested.

Lastly, ensemble methods were applied, aiming to mix together time series and tabular models, to

combine their predictions and strengths. Nonetheless, these approaches have been outperformed

by the ones mentioned above, therefore the corresponding results are not shown in the next

sections.

3.1.1 Tabular models, feature selection and transformation

In Table 1, results for the best tabular methods are presented. Both the ridge classiﬁer, appro-

priately tuned, and LDA performed quite well, while being extremely fast to train. Nonetheless,

Table 1: Accuracy results, evaluated on the 3-fold cross-validation, for the tabular methods considered,

coupled with feature selection strategies.

Method Accuracy

Ridge Classiﬁer 0.760

LDA 0.747

Feature Selection + Ridge Classiﬁer 0.777

Feature Selection + LDA 0.778

No water + Ridge Classiﬁer 0.777

No water + LDA 0.783

Feature Selection + Polynomial Features + LDA 0.844

No water + Feature Selection + Polynomial Features + LDA 0.844

Figure 1: LDA visualisation for the model Feature Selection + Polynomial Features + LDA, applied to

the unlabeled test data to predict class labels.

the selection of some speciﬁc wavelengths seemed to improve the accuracy further. In fact, both

the removal of the noisy water regions and the data-driven feature selection (performed using

the SelectFromModel routine in Python), provides better results.

Nevertheless, all these approaches hover around 80% accuracy, therefore, in order to improve

it, the data were augmented considering polynomial features of degree two (using sklearn

method PolynomialFeatures(degree = 2)). This led to an increase of the accuracy to 84.4%.

The LDA component visualisation for the model with Feature Selection and Polynomial Features,

applied on the unlabeled test dataset, is shown in Figure 1and a good discrimination between

the three classes is clearly visible.

The improvements obtained when considering polynomial features, come at a price in terms

of the computational requirements. In fact, starting from the 1060 original wavelengths, the

addition of second-degree polynomial features resulted in a total number of variables which

made the model estimation task unfeasible. To address this issue, in this work a new Random

Polynomial Features (RPolyTransformer in the following) approach was introduced. The key

idea was to implement random sampling in the non-linear feature space. This lead to relevant

advantages as the total number of features can be controlled and it can consider both higher-

degree (>2) polynomial features and complex mathematical functions (e.g., cosine, exp).

This strategy ﬁrstly generated Krandom arithmetic expressions (see Table 2for some ex-

amples), which are then used to compute Knon-linear features. From the new and the original

Table 2: Examples of RPolyTransformer features used. Here xjdenote the j-th wavelength.

(x32 ∗x19) + x103 −x2

(x102 ∗(x78) + x26 )

(x1−x150) + x64 ∗x4∗x5

Table 3: Results for diﬀerent combinations with RPolyTransformer.SelectFromModel and SelectKBest

are feature selection modules to remove noise from data (the former) and select the most discriminative

non-linear features (the latter).

Method Accuracy

Region: FULL

RPolyTransformer + Ridge Classiﬁer 0.717

RPolyTransformer + LDA 0.619

SelectFromModel + RPolyTransformer + SelectKBest + LDA 0.848

Region: [925:1585, 1720:2989]

RPolyTransformer + Ridge Classiﬁer 0.805

RPolyTransformer + LDA 0.847

SelectFromModel + RPolyTransformer + SelectKBest + LDA 0.843

Region: [925:1585, 1720:2989, 3738:3807]

RPolyTransformer + Ridge Classiﬁer 0.811

RPolyTransformer + LDA 0.833

SelectFromModel + RPolyTransformer + SelectKBest + LDA 0.835

Optimized model

Region: [925:1585, 1720:2989]

RPolyTransformer(K= 17000) + SelectKBest(K∗= 7000) + LDA 0.864

features, K∗variables are selected using SelectKBest from sklearn. The hyperparameters K

and K∗were optimized via cross-validation in the ﬁnal model (see the ﬁnal row of Table 3).

In Table 3the results obtained with this method, again combined with diﬀerent classiﬁers and

feature selection approaches and tested with the full data and the data after water region removal,

are presented. At ﬁrst, when combining RpolyTransformer with a classiﬁer, a signiﬁcant drop

in the accuracy was observed, if compared with simple tabular models. Ridge was more accurate

than LDA but it was still far behind the previous results. However, by carefully ﬁltering the

features either automatically with SelectFromModel or manually by removing the water regions,

the results improved noticeably. In these experiments, LDA outperforms Ridge consistently.

Compared to the PolynomialFeatures method, the one proposed here is faster (a few seconds

versus a few minutes) and just as accurate. However, the initial results without noise reduction

(i.e., feature selection) suggest that this strategy is more sensitive to noise in the data.

3.1.2 Deep Learning Models

When considering deep learning models, the task of exploding the feature space and learning

feature interactions is completely deferred to the network, without requiring any feature engi-

neering steps. In turn, deep neural networks require a careful design process, to avoid overﬁtting

and to identify the best model architecture and input modality.

The designed model architectures considered here can be grouped into two main categories,

namely, Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs). FCNs

do not require any manipulation or adaptation of the input data, as each single wavelength

is treated as an independent feature and fed to an input unit. In contrast, CNNs require the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ClassificationofcowdietbasedonmilkMidInfraredSpectra:adataanalysiscompetitionattheInternationalWorkshopofSpectroscopyandChemometrics2022MariaFrizzarin1,2,GiulioVisentin*3,AlessandroFerragina4,ElenaHayes5,AntonioBevilacqua6,BhaskarDhariyal6,KatarinaDomijan7,HussainKhan4,GeorgianaIfrim6,ThachLeNguye...

展开>> 收起<<

Classification of cow diet based on milk Mid Infrared Spectra a data analysis competition at the International Workshop of.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Classification of cow diet based on milk Mid Infrared Spectra a data analysis competition at the International Workshop of

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: