1 A systematic review of the use of Deep Learning in Satellite Imagery for Agriculture

2025-04-28 0 0 3.99MB 24 页 10玖币

侵权投诉

A systematic review of the use of Deep Learning in

Satellite Imagery for Agriculture

Brandon Victor , Aiden Nibali , Zhen He

Abstract—Agricultural research is essential for increasing food

production to meet the needs of a rapidly growing human

population. Collecting large quantities of agricultural data helps

to improve decision making for better food security at various

levels: from international trade and policy decisions, down to

individual farmers. At the same time, deep learning has seen

a wave of popularity across many different research areas and

data modalities. And satellite imagery has become available in

unprecedented quantities, driving much research from the wider

remote sensing community. The data hungry nature of deep

learning models and this huge data volume seem like a perfect

match. But has deep learning been adopted for agricultural tasks

using satellite images? This systematic review of 193 studies

analyses the tasks that have reaped beneﬁts from deep learning

algorithms, and those that have not. It was found that while

Land Use / Land Cover research has embraced deep learning

algorithms, research on other agricultural tasks has not. This

poor adoption appears to be due to a critical lack of labelled

datasets for these other tasks. Thus, we give suggestions for

collecting larger datasets. Additionally, satellite images differ

from ground-based images in a number of ways, resulting in a

proliferation of interesting data interpretations unique to satellite

images. So, this review also introduces a taxonomy of data

input shapes and how they are interpreted in order to facilitate

easier communication of algorithm types and enable quantitative

analysis.

Index Terms—Systematic Review, Deep learning, Satellite im-

agery, Agriculture, Computer Vision

I. INTRODUCTION

THERE are big agricultural challenges coming. The human

population is expected to increase signiﬁcantly in the next

decades [122], which will require an estimated global yield

increase of 25-70% [47]. At the same time, a changing climate

brings additional challenges [118] and the need to reduce the

environmental impact of agriculture [38].

Collecting more agricultural data is one path to improving

global food production and distribution. Remote sensing is

an extremely useful tool for such because it provides a non-

destructive and non-intrusive way to monitor agricultural ﬁelds

simultaneously at a ﬁne level of detail and across wide areas

and times. This makes it a technology that can be used for

fast and targeted interventions at a farm-level [41], regional

studies of ecological change over time [99], county-level yield

prediction for logistics [32] and international trade decisions

[75]. To achieve these, there are many sources of worldwide

satellite imagery freely available to the public. The most pop-

ular for agricultural purposes are: Sentinel [120, 25], Landsat

[100] and MODIS. Each of these satellite programs store and

Brandon Victor, Aiden Nibali and Zhen He work within the School of

Computing, Engineering and Mathematical Sciences at La Trobe University,

Melbourne, Victoria, Australia

manage enormous collections of historical worldwide imagery

e.g. Sentinel added 7.34 PiB of imagery to their archive in

2021 [108]. The availability of this data is recognised as a

key driver of research [80].

At the same time, deep learning has become the dominant

approach in all generic computer vision dataset competitions

[21, 70, 64]. On those tasks, deep learning outperforms tra-

ditional feature engineering and machine learning by a wide

margin. It is generally agreed that these successes are only

possible because of the availability of large training datasets

[40]. Thus, given the large volumes of data available for

remote sensing, a similar shift is expected in remote sensing

algorithms. But has this been the case for agricultural tasks?

This systematic review of 193 studies investigates and

quantiﬁes the use of deep learning on satellite images across

agricultural tasks. It was found that there were very few

examples of modern deep learning methods from before 2020,

after which they have become increasingly popular, with an

explosion of research in just the last few years. However, this

increase in popularity has been mostly for Land Use / Land

Cover (LULC) tasks, and to a lesser extent in yield prediction.

Research into other agricultural tasks have not seen the same

level of adoption. Consequently, there was a wide variety

of approaches taken for LULC tasks, but the most common

approach for the other tasks is still a pixel-based Random

Forest (RF) or Multilayer Perceptron (MLP). Nevertheless, it

was found that where spatial Convolutional Neural Networks

(CNNs) were used, they consistently outperformed traditional

machine learning methods (ML) across all tasks. However,

Long Short-Term Memory models (LSTMs) did not consis-

tently outperform traditional ML method for yield prediction.

There were few papers that included attention-based models

(both ViT [125] and custom architectures [36]), but there was

no consistent improvement observed over other modern deep

learning techniques in the reviewed studies.

Compared to ground-level images, satellite images have

lower spatial resolutions, higher spectral resolution and are

typically processed to obtain reﬂectances (a physical measure-

ment). Labels for remote sensing tasks are often annotated for

objects (parcels of many pixels; e.g. a ﬁeld), rather than being

annotated at either image (classiﬁcation) or pixel (segmenta-

tion) level. Differences such as these between satellite images

and ground-based have encouraged researchers to explore

novel data interpretations to use deep learning on satellite

images in ways not typically seen in generic computer vision

research. For example, a single satellite image might naturally

be used in a 2DCNN, but might also be ﬂattened and used in

an MLP, or the spectral data might be considered a sequence

and be used in a 3DCNN. To describe these differences and

arXiv:2210.01272v3 [cs.CV] 14 Jan 2025

quantify their utilisation, we introduce a taxonomy of data

interpretations in Section V-A.

Large labelled datasets are critical for training robust deep

learning models. But, while there is an abundance of images,

the corresponding labels can be much more difﬁcult to come

by. For some tasks, like crop segmentation, the labels can be

discerned directly from the image, but for most agricultural

tasks, the target quantity is not so directly visible. This can

be because the relationship between reﬂectance and the target

value is complicated by various soil and biochemical attributes,

as in the case of predicting Leaf Area Index (LAI), or because

the target quantity is only knowable by analysing a time

sequence, as in yield prediction. For such tasks, collecting

data from ground level is necessary, but is more expensive to

obtain. So, this review, analyses the data sources for each task

and highlights the publicly available data.

The ultimate goal of most agricultural research is to help

improve the yield and quality of our crops. But, the processes

which turn sunlight, water, carbon dioxide, nutrients and

minerals into the food we eat are varied and complex. They

can manifest as broad visible changes, or as subtle chemical

changes. Agricultural research can target any one of those

pathways. By using a systematic search, this review identiﬁes

which agricultural quantities researchers are attempting to

measure from satellite images using deep learning in practice

(see Section VI). However, there remains open questions.

Which quantities could or should be measured from space?

Are there subtle signals in satellite images that truly provide

information about the plants? Can deep learning uncover them

if there are? While there have been some successes using deep

learning on satellite images in crop segmentation and yield

prediction, difﬁcult challenges remain for other tasks.

In summary, the contributions of this review are:

1) A gentle introduction to the use of satellite images, and

how this differs to generic computer vision tasks.

2) A taxonomy of data shapes and interpretations, and a

quantiﬁcation of how often each is used for each task.

3) A tabulated list of references which includes this taxon-

omy, identifying which methods were used and which

worked best in each study (see Supplementary materi-

als).

4) Quantitative analysis of the performance of various deep

learning approaches on agricultural tasks.

5) An investigation of what datasets and data sources are

available.

6) Identiﬁcation of the breadth of agricultural tasks using

satellite images, including general information, speciﬁc

challenges and suggestions to help adopt/improve deep

learning for each task.

II. SEARCH STRATEGY

To create an initial list of papers, we used a search query

for Clarivate’s Web of Science. To broadly ﬁnd papers at the

intersection of deep learning, satellite images and agriculture,

we used both generic and speciﬁc terms for each (see Table

I). For deep learning, this was speciﬁc algorithm names. For

agriculture this was speciﬁc crop names from the Cropland

Data Layer [123]. The resultant tagged library of studies is

available as supplementary materials.

The initial search yielded 770 studies. We performed an

initial rapid pass through the collection of studies to ﬁlter out

studies that were not at the intersection of deep learning, satel-

lite imagery and agriculture, ultimately yielding 193 studies.

The majority of these studies were for crop segmentation and

yield prediction, thus, the studies for those tasks were further

ﬁltered as follows:

•2020 and earlier: study is included if it has at least x cita-

tions on Google Scholar (x= 50 for crop segmentation;

x= 25 for yield prediction)

•Jan 2021 - October 2022: all were included.

We did not include methods using UAV imagery because we

were interested in methods for resolving the tension between

object size and pixel size in satellite imagery. For crop segmen-

tation studies (Section VI-A), we only include studies which

used multiple agricultural classes. Soil monitoring studies

(Section VI-B) often only implied an agricultural signiﬁcance,

but, since soil has such a strong inﬂuence on agriculture, and

relatively few studies, we include all found soil monitoring

studies, even if they did not explicitly have an agricultural

motivation.

Although this review is systematic, it is not exhaustive,

and not just because of the above ﬁltering. By limiting the

review to studies indexed by Clarivate’s Web of Science, we

are deliberately selecting for higher proﬁle works than if we

included searches across all published literature. We rely on

the manual ﬁltering stage to ensure that we only include

relevant works. And while the search terms may not reveal all

possible relevant studies, we believe that they are sufﬁcient to

return a representative sample of all relevant studies.

There was also some inconsistency in terminology in the

reviewed studies. In the interest of clarity, and to assist anyone

unfamiliar with these terms, the variations are summarised in

Table II.

III. SATELLITE IMAGES

Objects imaged by satellites are typically signiﬁcantly

smaller than the ground spatial distance (GSD) covered by

each pixel. For example, the colour of each pixel in a satel-

lite image of farmland might be aggregated from hundreds,

thousands or even millions of individual plants. This massive

difference in scale between object and pixel sizes has encour-

aged researchers to focus on understanding the contents of

individual pixels as a combination of various surface types.

This naturally encouraged per-pixel algorithms [7, 8], rather

than the typical computer vision approaches which primarily

use the structured pattern of multiple spatially-related pixels

to understand an image [33, 132].

While the spatial resolution relative to the imaged objects

is much worse for satellite imagery, the spectral resolution is

often signiﬁcantly better. Almost all satellite imagery have at

least 4 colour channels (red, green, blue and near-infrared),

many have more than 10 colour channels (e.g. Sentinel-2),

and some have over 100 different colour channels [112],

providing signiﬁcantly more information per pixel than typical

TABLE I: The search terms used in the query for Clarivate’s Web of Science. There is an "AND" between each top-level

concept (i.e. [Deep Learning] AND [Satellite] AND [Agriculture]), and an "OR" between each term under that. The list of

speciﬁc crops comes from the CDL [123]. The search interface enforces a limit to the number of "All" search terms, so

only the abstract and topic were searched for speciﬁc agriculture terms. The full list of agricultural terms is available in the

supplementary materials section.

Deep Learning Satellite Agriculture

All All All Abstract Topic

Deep Learn*, CNN, RNN, LSTM,

GRU, Transformer, Neural Network,

Deep Belief Network, Autoencoder

Satellite Farm,

Agri*,

Crop

Wheat, Corn, Maize, Orchard, Coffee, Vineyard,

Soy, Rice, Cotton, Sorghum, Peanut*, Tobacco, Bar-

ley, Grain, Rye, Oat, Millet, Speltz, Canola, ... [+52

more]

*wheat, *ﬂower*,

*berries, *melon*,

*berry

TABLE II: Some deﬁnitions for (sometimes inconsistent) terminology found in the literature.

Words Idea

Radiative transfer; reﬂectance; backscatter Reﬂectance is the proportion of light which reﬂects off of a surface. This is a physical property

of the surface, and can be measured in a laboratory. Radiative transfer models describe the

physical process of reﬂectance, while backscatter is reﬂectance that is the result of artiﬁcial

lighting, typically microwaves.

Sub-pixel fractional estimation; Linear Unmix-

ing Model; Linear Mixture Model

A model of a pixel as being some proportion of just a few types of land cover, and thus every

pixel’s colour is explainable as an (often linear) combination of these cover types (see Section

III)

Downscale; upsample; ﬁner resolution Downscaling and upsampling have the same meaning because there is a conﬂict in terminology

between remote sensing scientists and computer scientists, with inverse meanings. In this review,

we have used “coarser” or “ﬁner” to avoid confusion.

Multitemporal images; time series; Satellite Im-

age Time Series (SITS); temporal data

Indicates the use of temporal data. Generally, stacked images of the same location over

weeks/months (see Section V)

Multi-layer perceptron (MLP); Artiﬁcial Neural

Network (ANN); Deep Neural Network (DNN)

Although ANN can technically refer to any Neural Network, it is typically used to refer to a

small MLP. Generally, DNN refers to an MLP, but a DCNN refers to a CNN speciﬁcally.

Model inversion Training a statistical model to predict the inputs of a theoretical model from either ground-

measured outputs, or outputs of the theoretical model itself. A good summary of the ways this

is used is given in [136].

Object-based; ﬁeld-based; parcel-based; super-

pixel

Using aggregated colour information across a whole object or ﬁeld or parcel or superpixel for

prediction.

ground-based sources. Additionally, satellite image sensors

are calibrated to obtain functions for converting from sensor

brightness to reﬂectance - a physical property of the imaged

surface - which allows quantitative analysis of the Earth’s

surface which is (mostly) independent of illumination and

sensor.

At its core, reﬂectance is simply a ratio between reﬂected

and incident light

ρ=r

i(1)

But determining each of these values is confounded by com-

plex shape geometries, atmospheric effects, sensor calibration

errors, unexpected solar variation and the stochastic nature of

photons.

Theoretically, with sufﬁciently precise measurements all

surfaces could be uniquely identiﬁed by matching each pixel

to a spectral signature measured in a lab. Indeed, this ideal is

the basis of many hand-crafted models (e.g. Linear Mixture

Model [2]). But, such a precise sensor does not exist, and

signiﬁcant noise is introduced by the lack of spatial and

spectral resolutions, on top of the above errors calculating

reﬂectance. These signiﬁcant sources of noise have lead to

the dominance of machine learning algorithms to learn the

varied appearances of surfaces from data [72, 58]. Such

machine learning algorithms require many training examples

to discover the existence/degree of a relationship between

reﬂectance and the variable of interest.

Fortunately, there are several sources of freely available

satellite imagery with worldwide coverage to train these al-

gorithms. In the reviewed studies, the most common were:

•Moderate Resolution Imaging Spectroradiometer

(MODIS) imagery at 250-1000m resolution which has

been publicly available since 2000, along with many

model-based maps, such as land surface temperatures,

evapotranspiration and leaf area index (LAI).

•Landsat imagery which has been freely available to the

public since 2008 [148], of which most reviewed studies

used Landsat-8 imagery at 30m resolution.

•Sentinel imagery from the Sentinel program of the Eu-

ropean Space Agency which has provided optical im-

agery at 10-60m resolution and Synthetic Aperture Radar

(SAR) imagery at 5-40m resolution since 2014.

The resolution of these data sources can dictate the resolu-

tion at which analysis can be performed; for example, county-

level yield prediction always uses MODIS imagery and ﬁeld-

level yield prediction always uses Landsat/Sentinel imagery.

These are obvious pairings because MODIS pixels are larger

than individual ﬁelds, and images of entire counties using

Landsat/Sentinel imagery would require a signiﬁcant amount

of disk space and computation time. At the coarser resolutions,

there was a strong preference in the reviewed articles to pose

the problem as just time series analysis, rather than a spatio-

temporal one. Further, in several works [53, 109, 83] and

datasets like LUCAS [19], the problem is posed as a single-

pixel problem, only providing labels for a set of sparsely

distributed points. Although this doesn’t preclude the use of

CNNs [52], such a dataset discourages it.

Spatial resolution in satellite imagery has increased over

the years, such that some commercial satellite providers now

sell images with resolution as ﬁne as 34cm per pixel, a

resolution sufﬁciently ﬁne to detect individual trees from

satellite images [39, 68, 31, 69]. This increased resolution

has encouraged satellite imagery analysis to utilise the spatial

information - as in generic computer vision - as well as the

higher spectral resolution and reﬂectance calibration typically

used for satellite images (e.g. [27, 105, 130]). We note that

although spatio-temporal input has the richest information,

it is not always available. For example, very-high resolution

commercial satellite imagery is expensive and sparsely col-

lected, thus most studies using commercial satellite imagery

operated on a relatively small number of individual images

(e.g. [94, 18, 107]).

IV. DEEP LEARNING

In many domains, machine learning has found accurate

relationships in spite of many variations in appearance and

much noise. The data-driven nature of machine learning tech-

niques handles such variations and models arbitrarily complex

relationships while simultaneously including tools to prevent

overﬁtting to the noise. We found that in single-pixel problems,

Random Forests (RFs), Support Vector Machines (SVMs) and

Multi-layer Perceptrons (MLPs) were generally close competi-

tors, with each method being more accurate in different studies

in roughly equal proportions (e.g. [30, 54, 109]).

Compared to other machine learning methods, deep learning

is known to be able to construct signiﬁcantly more complex

models [40], allowing them to be more robust to noisy training

data. Additionally, deep learning models learn to create their

own features, which greatly reduces the need for manual

feature engineering. This comes at the price of requiring larger

datasets to observe this improved performance. In all studies

reviewed in all tasks except yield prediction, modern deep

learning methods outperformed traditional machine learning

methods. In yield prediction, 2DCNNs consistently outper-

formed traditional machine learning methods, but LSTMs did

not.

In the literature, various algorithms are called “deep learn-

ing”. In this review, we refer to three main types of modern

deep learning: CNNs, RNNs and Attention. With a decade

since AlexNet [62], and an explosion of research, Convo-

lutional Neural Networks (CNNs) are the current de facto

standard in generic computer vision tasks. Recurrent Neural

Networks (RNNs) are a common deep learning method for

sequence modelling, but almost all studies use an extension

of RNNs; either Long Short-term Memory (LSTM) or Gated

Recurrent Unit (GRU). “Attention” can mean many different

things; here we will use it to mean, speciﬁcally, multi-head

attention as described by [125], as this is the basis for the

recently popularised Vision Transformers [24] which have

outperformed CNNs on recent ImageNet competitions. In this

review we use “deep learning” to mean any neural network

method, and “modern deep learning” to exclude MLP-only

Fig. 1: Depending on the images available, a problem can

be posed as: a) a relationship between a single pixel input

(blue cell) from a single image (yellow grid of cells) and

each prediction (red cell), or it can include contextual pixels

from spatial (b) or temporal domains (c) or both (d), known

as spatio-temporal (ST) data. For example, a model which

operates on a sequence of co-located Sentinel-2 images would

be said to use spatio-temporal input data.

algorithms. We will not discuss the technical details of these

deep learning algorithms in this review; instead we will

mention the most signiﬁcant modern advances and refer the

reader to existing explanations for more details [40, 13, 55].

The ImageNet classiﬁcation dataset [21] has had an enor-

mous inﬂuence on the trajectory of computer vision research.

It is common - when deep learning is applied to a new

domain - for authors to use architectures that achieved a high

rank in the ImageNet competition. In particular, AlexNet [62],

VGG [111] and ResNet [44] have received the most attention.

Similarly for segmentation models, models that performed

well on the MS COCO and PASCALVOC datasets have been

adopted. The most popular segmentation architectures are

based on UNet [98] and DeepLabv3 [17].

Two of the most signiﬁcant innovations of modern deep

learning are focused around training deeper models: skip con-

nections [44] and inter-layer normalisation (e.g. BatchNorm

[50], LayerNorm [4], etc). These two ideas have been almost

universally adopted by all popular modern deep learning

architectures, and with modern programming libraries these

are easily incorporated into custom architectures created by

individual studies (e.g. [5, 34]). Many works reviewed also

used Dropout [46]; another popular addition to training deep

learning models for training more robust models.

Type P-f

Colour

Feature Stack

RF, SVM, or MLP

Type P-s

Spectral Sequence

1DCNN

3x1 kernel

23456

(a) Type P (pixel data) can be interpreted as either a feature stack

(P-f) or as a spectral sequence (P-s). For example, using a single

pixel from a Landsat image.

(b) Type T (temporal data) can be interpreted as a single feature

stack (T-f), as a sequence of feature stacks (T-s) or as an “image”

with spectral and temporal dimensions (T-i).

two spatial and one spectral dimensions (S-c), as an image with

just two spatial dimensions (S-i), or as a bundle of unstructured

pixels that ﬁt within an object boundary (S-o)

Type ST-c

Spatial/Temporal

Image cube

(d) Type ST (spatio-temporal data) can be interpreted as a data

cube with two spatial and one temporal dimensions (ST-c), or as

temporal sequence of pixels within an object boundary (ST-o)

Fig. 2: The data can be interpreted in different ways to allow the use of different models. Each of these data shape/interpretations

pairs is given a name like type X-x to denote the original type and its interpretation. Subﬁgures here match those from Figure

1. i.e. ahere show the interpretations of Figure 1a. The most notable distinctions are how spectral data is interpreted; as either

a bag of features (P-f, T-s, S-i), or as a dimension in its own right (P-s, T-i, S-c). Although these depictions show spectral

information, several studies replaced the spectral information with a set of other features: vegetation indices, topographical,

atmospheric, soil, etc.

V. COMMON METHODS

A. Taxonomy

Satellite images are quantised measurements of our real

world along multiple dimensions: one spectral, two spatial and

one temporal. In the best case every prediction is based on a 4-

dimensional data cube of spatio-temporal (ST) data, but such

data can be computationally expensive to use or impractical

to obtain, so many works operate on data without a spatial or

temporal (or both) dimension. The shape of input data puts

important emphasis and limitations on models trained on such

data: for example, LSTMs are usually applied by processing

each temporal sequence of pixels independently, which can

observe changes over time, but wouldn’t be able to use spatial

contextual clues in its prediction. Thus, we create a taxonomy

of input types to help understand how satellite images are

being used differently across studies and agricultural tasks.

Typically, studies will describe their input data, but not how

they interpret that data for models. For example, a model might

be called "pixel-based" because it operates on one pixel at a

time. However, there are actually two ways that a single pixel

in remote sensing can be interpreted. Is it a bag of features, or

is it a spectral sequence? The latter interpretation is relatively

common, but there has not previously been any consistent

language or system to identify this distinction. Thus we start

from the typical terminology of pixel-based (type P), spatial

(type S), temporal (type T) and spatiotemporal (type ST) (see

Figure1). Then, in Figure 2 we describe our taxonomy of

different interpretations of those data shapes which authors

have used to structure their data for use in modern deep

learning algorithms.

We name the interpretations by their initial data shape as

the ﬁrst character and their interpreted shape as the second

character (i.e. [Shape]-[Interpretation]). The initial data shapes

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1AsystematicreviewoftheuseofDeepLearninginSatelliteImageryforAgricultureBrandonVictor,AidenNibali,ZhenHeAbstract—Agriculturalresearchisessentialforincreasingfoodproductiontomeettheneedsofarapidlygrowinghumanpopulation.Collectinglargequantitiesofagriculturaldatahelpstoimprovedecisionmakingforbetterfo...

展开>> 收起<<

1 A systematic review of the use of Deep Learning in Satellite Imagery for Agriculture.pdf

共24页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 A systematic review of the use of Deep Learning in Satellite Imagery for Agriculture

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: