1 A systematic review of the use of Deep Learning in Satellite Imagery for Agriculture

2025-04-28 0 0 3.99MB 24 页 10玖币
侵权投诉
1
A systematic review of the use of Deep Learning in
Satellite Imagery for Agriculture
Brandon Victor , Aiden Nibali , Zhen He
Abstract—Agricultural research is essential for increasing food
production to meet the needs of a rapidly growing human
population. Collecting large quantities of agricultural data helps
to improve decision making for better food security at various
levels: from international trade and policy decisions, down to
individual farmers. At the same time, deep learning has seen
a wave of popularity across many different research areas and
data modalities. And satellite imagery has become available in
unprecedented quantities, driving much research from the wider
remote sensing community. The data hungry nature of deep
learning models and this huge data volume seem like a perfect
match. But has deep learning been adopted for agricultural tasks
using satellite images? This systematic review of 193 studies
analyses the tasks that have reaped benefits from deep learning
algorithms, and those that have not. It was found that while
Land Use / Land Cover research has embraced deep learning
algorithms, research on other agricultural tasks has not. This
poor adoption appears to be due to a critical lack of labelled
datasets for these other tasks. Thus, we give suggestions for
collecting larger datasets. Additionally, satellite images differ
from ground-based images in a number of ways, resulting in a
proliferation of interesting data interpretations unique to satellite
images. So, this review also introduces a taxonomy of data
input shapes and how they are interpreted in order to facilitate
easier communication of algorithm types and enable quantitative
analysis.
Index Terms—Systematic Review, Deep learning, Satellite im-
agery, Agriculture, Computer Vision
I. INTRODUCTION
THERE are big agricultural challenges coming. The human
population is expected to increase significantly in the next
decades [122], which will require an estimated global yield
increase of 25-70% [47]. At the same time, a changing climate
brings additional challenges [118] and the need to reduce the
environmental impact of agriculture [38].
Collecting more agricultural data is one path to improving
global food production and distribution. Remote sensing is
an extremely useful tool for such because it provides a non-
destructive and non-intrusive way to monitor agricultural fields
simultaneously at a fine level of detail and across wide areas
and times. This makes it a technology that can be used for
fast and targeted interventions at a farm-level [41], regional
studies of ecological change over time [99], county-level yield
prediction for logistics [32] and international trade decisions
[75]. To achieve these, there are many sources of worldwide
satellite imagery freely available to the public. The most pop-
ular for agricultural purposes are: Sentinel [120, 25], Landsat
[100] and MODIS. Each of these satellite programs store and
Brandon Victor, Aiden Nibali and Zhen He work within the School of
Computing, Engineering and Mathematical Sciences at La Trobe University,
Melbourne, Victoria, Australia
manage enormous collections of historical worldwide imagery
e.g. Sentinel added 7.34 PiB of imagery to their archive in
2021 [108]. The availability of this data is recognised as a
key driver of research [80].
At the same time, deep learning has become the dominant
approach in all generic computer vision dataset competitions
[21, 70, 64]. On those tasks, deep learning outperforms tra-
ditional feature engineering and machine learning by a wide
margin. It is generally agreed that these successes are only
possible because of the availability of large training datasets
[40]. Thus, given the large volumes of data available for
remote sensing, a similar shift is expected in remote sensing
algorithms. But has this been the case for agricultural tasks?
This systematic review of 193 studies investigates and
quantifies the use of deep learning on satellite images across
agricultural tasks. It was found that there were very few
examples of modern deep learning methods from before 2020,
after which they have become increasingly popular, with an
explosion of research in just the last few years. However, this
increase in popularity has been mostly for Land Use / Land
Cover (LULC) tasks, and to a lesser extent in yield prediction.
Research into other agricultural tasks have not seen the same
level of adoption. Consequently, there was a wide variety
of approaches taken for LULC tasks, but the most common
approach for the other tasks is still a pixel-based Random
Forest (RF) or Multilayer Perceptron (MLP). Nevertheless, it
was found that where spatial Convolutional Neural Networks
(CNNs) were used, they consistently outperformed traditional
machine learning methods (ML) across all tasks. However,
Long Short-Term Memory models (LSTMs) did not consis-
tently outperform traditional ML method for yield prediction.
There were few papers that included attention-based models
(both ViT [125] and custom architectures [36]), but there was
no consistent improvement observed over other modern deep
learning techniques in the reviewed studies.
Compared to ground-level images, satellite images have
lower spatial resolutions, higher spectral resolution and are
typically processed to obtain reflectances (a physical measure-
ment). Labels for remote sensing tasks are often annotated for
objects (parcels of many pixels; e.g. a field), rather than being
annotated at either image (classification) or pixel (segmenta-
tion) level. Differences such as these between satellite images
and ground-based have encouraged researchers to explore
novel data interpretations to use deep learning on satellite
images in ways not typically seen in generic computer vision
research. For example, a single satellite image might naturally
be used in a 2DCNN, but might also be flattened and used in
an MLP, or the spectral data might be considered a sequence
and be used in a 3DCNN. To describe these differences and
arXiv:2210.01272v3 [cs.CV] 14 Jan 2025
2
quantify their utilisation, we introduce a taxonomy of data
interpretations in Section V-A.
Large labelled datasets are critical for training robust deep
learning models. But, while there is an abundance of images,
the corresponding labels can be much more difficult to come
by. For some tasks, like crop segmentation, the labels can be
discerned directly from the image, but for most agricultural
tasks, the target quantity is not so directly visible. This can
be because the relationship between reflectance and the target
value is complicated by various soil and biochemical attributes,
as in the case of predicting Leaf Area Index (LAI), or because
the target quantity is only knowable by analysing a time
sequence, as in yield prediction. For such tasks, collecting
data from ground level is necessary, but is more expensive to
obtain. So, this review, analyses the data sources for each task
and highlights the publicly available data.
The ultimate goal of most agricultural research is to help
improve the yield and quality of our crops. But, the processes
which turn sunlight, water, carbon dioxide, nutrients and
minerals into the food we eat are varied and complex. They
can manifest as broad visible changes, or as subtle chemical
changes. Agricultural research can target any one of those
pathways. By using a systematic search, this review identifies
which agricultural quantities researchers are attempting to
measure from satellite images using deep learning in practice
(see Section VI). However, there remains open questions.
Which quantities could or should be measured from space?
Are there subtle signals in satellite images that truly provide
information about the plants? Can deep learning uncover them
if there are? While there have been some successes using deep
learning on satellite images in crop segmentation and yield
prediction, difficult challenges remain for other tasks.
In summary, the contributions of this review are:
1) A gentle introduction to the use of satellite images, and
how this differs to generic computer vision tasks.
2) A taxonomy of data shapes and interpretations, and a
quantification of how often each is used for each task.
3) A tabulated list of references which includes this taxon-
omy, identifying which methods were used and which
worked best in each study (see Supplementary materi-
als).
4) Quantitative analysis of the performance of various deep
learning approaches on agricultural tasks.
5) An investigation of what datasets and data sources are
available.
6) Identification of the breadth of agricultural tasks using
satellite images, including general information, specific
challenges and suggestions to help adopt/improve deep
learning for each task.
II. SEARCH STRATEGY
To create an initial list of papers, we used a search query
for Clarivate’s Web of Science. To broadly find papers at the
intersection of deep learning, satellite images and agriculture,
we used both generic and specific terms for each (see Table
I). For deep learning, this was specific algorithm names. For
agriculture this was specific crop names from the Cropland
Data Layer [123]. The resultant tagged library of studies is
available as supplementary materials.
The initial search yielded 770 studies. We performed an
initial rapid pass through the collection of studies to filter out
studies that were not at the intersection of deep learning, satel-
lite imagery and agriculture, ultimately yielding 193 studies.
The majority of these studies were for crop segmentation and
yield prediction, thus, the studies for those tasks were further
filtered as follows:
2020 and earlier: study is included if it has at least x cita-
tions on Google Scholar (x= 50 for crop segmentation;
x= 25 for yield prediction)
Jan 2021 - October 2022: all were included.
We did not include methods using UAV imagery because we
were interested in methods for resolving the tension between
object size and pixel size in satellite imagery. For crop segmen-
tation studies (Section VI-A), we only include studies which
used multiple agricultural classes. Soil monitoring studies
(Section VI-B) often only implied an agricultural significance,
but, since soil has such a strong influence on agriculture, and
relatively few studies, we include all found soil monitoring
studies, even if they did not explicitly have an agricultural
motivation.
Although this review is systematic, it is not exhaustive,
and not just because of the above filtering. By limiting the
review to studies indexed by Clarivate’s Web of Science, we
are deliberately selecting for higher profile works than if we
included searches across all published literature. We rely on
the manual filtering stage to ensure that we only include
relevant works. And while the search terms may not reveal all
possible relevant studies, we believe that they are sufficient to
return a representative sample of all relevant studies.
There was also some inconsistency in terminology in the
reviewed studies. In the interest of clarity, and to assist anyone
unfamiliar with these terms, the variations are summarised in
Table II.
III. SATELLITE IMAGES
Objects imaged by satellites are typically significantly
smaller than the ground spatial distance (GSD) covered by
each pixel. For example, the colour of each pixel in a satel-
lite image of farmland might be aggregated from hundreds,
thousands or even millions of individual plants. This massive
difference in scale between object and pixel sizes has encour-
aged researchers to focus on understanding the contents of
individual pixels as a combination of various surface types.
This naturally encouraged per-pixel algorithms [7, 8], rather
than the typical computer vision approaches which primarily
use the structured pattern of multiple spatially-related pixels
to understand an image [33, 132].
While the spatial resolution relative to the imaged objects
is much worse for satellite imagery, the spectral resolution is
often significantly better. Almost all satellite imagery have at
least 4 colour channels (red, green, blue and near-infrared),
many have more than 10 colour channels (e.g. Sentinel-2),
and some have over 100 different colour channels [112],
providing significantly more information per pixel than typical
3
TABLE I: The search terms used in the query for Clarivate’s Web of Science. There is an "AND" between each top-level
concept (i.e. [Deep Learning] AND [Satellite] AND [Agriculture]), and an "OR" between each term under that. The list of
specific crops comes from the CDL [123]. The search interface enforces a limit to the number of "All" search terms, so
only the abstract and topic were searched for specific agriculture terms. The full list of agricultural terms is available in the
supplementary materials section.
Deep Learning Satellite Agriculture
All All All Abstract Topic
Deep Learn*, CNN, RNN, LSTM,
GRU, Transformer, Neural Network,
Deep Belief Network, Autoencoder
Satellite Farm,
Agri*,
Crop
Wheat, Corn, Maize, Orchard, Coffee, Vineyard,
Soy, Rice, Cotton, Sorghum, Peanut*, Tobacco, Bar-
ley, Grain, Rye, Oat, Millet, Speltz, Canola, ... [+52
more]
*wheat, *flower*,
*berries, *melon*,
*berry
TABLE II: Some definitions for (sometimes inconsistent) terminology found in the literature.
Words Idea
Radiative transfer; reflectance; backscatter Reflectance is the proportion of light which reflects off of a surface. This is a physical property
of the surface, and can be measured in a laboratory. Radiative transfer models describe the
physical process of reflectance, while backscatter is reflectance that is the result of artificial
lighting, typically microwaves.
Sub-pixel fractional estimation; Linear Unmix-
ing Model; Linear Mixture Model
A model of a pixel as being some proportion of just a few types of land cover, and thus every
pixel’s colour is explainable as an (often linear) combination of these cover types (see Section
III)
Downscale; upsample; finer resolution Downscaling and upsampling have the same meaning because there is a conflict in terminology
between remote sensing scientists and computer scientists, with inverse meanings. In this review,
we have used “coarser” or “finer” to avoid confusion.
Multitemporal images; time series; Satellite Im-
age Time Series (SITS); temporal data
Indicates the use of temporal data. Generally, stacked images of the same location over
weeks/months (see Section V)
Multi-layer perceptron (MLP); Artificial Neural
Network (ANN); Deep Neural Network (DNN)
Although ANN can technically refer to any Neural Network, it is typically used to refer to a
small MLP. Generally, DNN refers to an MLP, but a DCNN refers to a CNN specifically.
Model inversion Training a statistical model to predict the inputs of a theoretical model from either ground-
measured outputs, or outputs of the theoretical model itself. A good summary of the ways this
is used is given in [136].
Object-based; field-based; parcel-based; super-
pixel
Using aggregated colour information across a whole object or field or parcel or superpixel for
prediction.
ground-based sources. Additionally, satellite image sensors
are calibrated to obtain functions for converting from sensor
brightness to reflectance - a physical property of the imaged
surface - which allows quantitative analysis of the Earth’s
surface which is (mostly) independent of illumination and
sensor.
At its core, reflectance is simply a ratio between reflected
and incident light
ρ=r
i(1)
But determining each of these values is confounded by com-
plex shape geometries, atmospheric effects, sensor calibration
errors, unexpected solar variation and the stochastic nature of
photons.
Theoretically, with sufficiently precise measurements all
surfaces could be uniquely identified by matching each pixel
to a spectral signature measured in a lab. Indeed, this ideal is
the basis of many hand-crafted models (e.g. Linear Mixture
Model [2]). But, such a precise sensor does not exist, and
significant noise is introduced by the lack of spatial and
spectral resolutions, on top of the above errors calculating
reflectance. These significant sources of noise have lead to
the dominance of machine learning algorithms to learn the
varied appearances of surfaces from data [72, 58]. Such
machine learning algorithms require many training examples
to discover the existence/degree of a relationship between
reflectance and the variable of interest.
Fortunately, there are several sources of freely available
satellite imagery with worldwide coverage to train these al-
gorithms. In the reviewed studies, the most common were:
Moderate Resolution Imaging Spectroradiometer
(MODIS) imagery at 250-1000m resolution which has
been publicly available since 2000, along with many
model-based maps, such as land surface temperatures,
evapotranspiration and leaf area index (LAI).
Landsat imagery which has been freely available to the
public since 2008 [148], of which most reviewed studies
used Landsat-8 imagery at 30m resolution.
Sentinel imagery from the Sentinel program of the Eu-
ropean Space Agency which has provided optical im-
agery at 10-60m resolution and Synthetic Aperture Radar
(SAR) imagery at 5-40m resolution since 2014.
The resolution of these data sources can dictate the resolu-
tion at which analysis can be performed; for example, county-
level yield prediction always uses MODIS imagery and field-
level yield prediction always uses Landsat/Sentinel imagery.
These are obvious pairings because MODIS pixels are larger
than individual fields, and images of entire counties using
Landsat/Sentinel imagery would require a significant amount
of disk space and computation time. At the coarser resolutions,
there was a strong preference in the reviewed articles to pose
the problem as just time series analysis, rather than a spatio-
temporal one. Further, in several works [53, 109, 83] and
datasets like LUCAS [19], the problem is posed as a single-
pixel problem, only providing labels for a set of sparsely
4
distributed points. Although this doesn’t preclude the use of
CNNs [52], such a dataset discourages it.
Spatial resolution in satellite imagery has increased over
the years, such that some commercial satellite providers now
sell images with resolution as fine as 34cm per pixel, a
resolution sufficiently fine to detect individual trees from
satellite images [39, 68, 31, 69]. This increased resolution
has encouraged satellite imagery analysis to utilise the spatial
information - as in generic computer vision - as well as the
higher spectral resolution and reflectance calibration typically
used for satellite images (e.g. [27, 105, 130]). We note that
although spatio-temporal input has the richest information,
it is not always available. For example, very-high resolution
commercial satellite imagery is expensive and sparsely col-
lected, thus most studies using commercial satellite imagery
operated on a relatively small number of individual images
(e.g. [94, 18, 107]).
IV. DEEP LEARNING
In many domains, machine learning has found accurate
relationships in spite of many variations in appearance and
much noise. The data-driven nature of machine learning tech-
niques handles such variations and models arbitrarily complex
relationships while simultaneously including tools to prevent
overfitting to the noise. We found that in single-pixel problems,
Random Forests (RFs), Support Vector Machines (SVMs) and
Multi-layer Perceptrons (MLPs) were generally close competi-
tors, with each method being more accurate in different studies
in roughly equal proportions (e.g. [30, 54, 109]).
Compared to other machine learning methods, deep learning
is known to be able to construct significantly more complex
models [40], allowing them to be more robust to noisy training
data. Additionally, deep learning models learn to create their
own features, which greatly reduces the need for manual
feature engineering. This comes at the price of requiring larger
datasets to observe this improved performance. In all studies
reviewed in all tasks except yield prediction, modern deep
learning methods outperformed traditional machine learning
methods. In yield prediction, 2DCNNs consistently outper-
formed traditional machine learning methods, but LSTMs did
not.
In the literature, various algorithms are called “deep learn-
ing”. In this review, we refer to three main types of modern
deep learning: CNNs, RNNs and Attention. With a decade
since AlexNet [62], and an explosion of research, Convo-
lutional Neural Networks (CNNs) are the current de facto
standard in generic computer vision tasks. Recurrent Neural
Networks (RNNs) are a common deep learning method for
sequence modelling, but almost all studies use an extension
of RNNs; either Long Short-term Memory (LSTM) or Gated
Recurrent Unit (GRU). “Attention” can mean many different
things; here we will use it to mean, specifically, multi-head
attention as described by [125], as this is the basis for the
recently popularised Vision Transformers [24] which have
outperformed CNNs on recent ImageNet competitions. In this
review we use “deep learning” to mean any neural network
method, and “modern deep learning” to exclude MLP-only
Fig. 1: Depending on the images available, a problem can
be posed as: a) a relationship between a single pixel input
(blue cell) from a single image (yellow grid of cells) and
each prediction (red cell), or it can include contextual pixels
from spatial (b) or temporal domains (c) or both (d), known
as spatio-temporal (ST) data. For example, a model which
operates on a sequence of co-located Sentinel-2 images would
be said to use spatio-temporal input data.
algorithms. We will not discuss the technical details of these
deep learning algorithms in this review; instead we will
mention the most significant modern advances and refer the
reader to existing explanations for more details [40, 13, 55].
The ImageNet classification dataset [21] has had an enor-
mous influence on the trajectory of computer vision research.
It is common - when deep learning is applied to a new
domain - for authors to use architectures that achieved a high
rank in the ImageNet competition. In particular, AlexNet [62],
VGG [111] and ResNet [44] have received the most attention.
Similarly for segmentation models, models that performed
well on the MS COCO and PASCALVOC datasets have been
adopted. The most popular segmentation architectures are
based on UNet [98] and DeepLabv3 [17].
Two of the most significant innovations of modern deep
learning are focused around training deeper models: skip con-
nections [44] and inter-layer normalisation (e.g. BatchNorm
[50], LayerNorm [4], etc). These two ideas have been almost
universally adopted by all popular modern deep learning
architectures, and with modern programming libraries these
are easily incorporated into custom architectures created by
individual studies (e.g. [5, 34]). Many works reviewed also
used Dropout [46]; another popular addition to training deep
learning models for training more robust models.
5
Type P-f
Colour
Feature Stack
RF, SVM, or MLP
Type P-s
Spectral Sequence
1
1DCNN
3x1 kernel
23456
(a) Type P (pixel data) can be interpreted as either a feature stack
(P-f) or as a spectral sequence (P-s). For example, using a single
pixel from a Landsat image.
(b) Type T (temporal data) can be interpreted as a single feature
stack (T-f), as a sequence of feature stacks (T-s) or as an “image”
with spectral and temporal dimensions (T-i).
(c) Type S (spatial data) can be interpreted as a “data cube” with
two spatial and one spectral dimensions (S-c), as an image with
just two spatial dimensions (S-i), or as a bundle of unstructured
pixels that fit within an object boundary (S-o)
Type ST-c
Spatial/Temporal
Image cube
(d) Type ST (spatio-temporal data) can be interpreted as a data
cube with two spatial and one temporal dimensions (ST-c), or as
temporal sequence of pixels within an object boundary (ST-o)
Fig. 2: The data can be interpreted in different ways to allow the use of different models. Each of these data shape/interpretations
pairs is given a name like type X-x to denote the original type and its interpretation. Subfigures here match those from Figure
1. i.e. ahere show the interpretations of Figure 1a. The most notable distinctions are how spectral data is interpreted; as either
a bag of features (P-f, T-s, S-i), or as a dimension in its own right (P-s, T-i, S-c). Although these depictions show spectral
information, several studies replaced the spectral information with a set of other features: vegetation indices, topographical,
atmospheric, soil, etc.
V. COMMON METHODS
A. Taxonomy
Satellite images are quantised measurements of our real
world along multiple dimensions: one spectral, two spatial and
one temporal. In the best case every prediction is based on a 4-
dimensional data cube of spatio-temporal (ST) data, but such
data can be computationally expensive to use or impractical
to obtain, so many works operate on data without a spatial or
temporal (or both) dimension. The shape of input data puts
important emphasis and limitations on models trained on such
data: for example, LSTMs are usually applied by processing
each temporal sequence of pixels independently, which can
observe changes over time, but wouldn’t be able to use spatial
contextual clues in its prediction. Thus, we create a taxonomy
of input types to help understand how satellite images are
being used differently across studies and agricultural tasks.
Typically, studies will describe their input data, but not how
they interpret that data for models. For example, a model might
be called "pixel-based" because it operates on one pixel at a
time. However, there are actually two ways that a single pixel
in remote sensing can be interpreted. Is it a bag of features, or
is it a spectral sequence? The latter interpretation is relatively
common, but there has not previously been any consistent
language or system to identify this distinction. Thus we start
from the typical terminology of pixel-based (type P), spatial
(type S), temporal (type T) and spatiotemporal (type ST) (see
Figure1). Then, in Figure 2 we describe our taxonomy of
different interpretations of those data shapes which authors
have used to structure their data for use in modern deep
learning algorithms.
We name the interpretations by their initial data shape as
the first character and their interpreted shape as the second
character (i.e. [Shape]-[Interpretation]). The initial data shapes
摘要:

1AsystematicreviewoftheuseofDeepLearninginSatelliteImageryforAgricultureBrandonVictor,AidenNibali,ZhenHeAbstract—Agriculturalresearchisessentialforincreasingfoodproductiontomeettheneedsofarapidlygrowinghumanpopulation.Collectinglargequantitiesofagriculturaldatahelpstoimprovedecisionmakingforbetterfo...

展开>> 收起<<
1 A systematic review of the use of Deep Learning in Satellite Imagery for Agriculture.pdf

共24页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:24 页 大小:3.99MB 格式:PDF 时间:2025-04-28

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 24
客服
关注