
regions. PDSA is essential for identification of submerged regions, sanity check of large building
structures, debris identification, and search-and-rescue (S&R) operations.
Flood Forecasting
: Flood segmentation techniques can be critically important for flooding-related
Early Warning Systems (EWS). According to research in [
12
], Indians given a flood warning are
twice as likely to evacuate safely than Indians without any notice. which require constant monitoring
of river or sea water levels. Comparison of current levels with historical evidence of flood-prone
water levels can help understand when to trigger warnings appropriately.
Constraints
: Developing countries are plagued by resource and economic constraints. Failure of
macro- and micro- infrastructure planning in Nicaragua led to re-construction on top of an earthquake
faultline [
7
]. Weak social safety and insurance policies inflate recovery time [
27
]. Economic
vulnerability renders countries like Haiti, Ethiopia, Nepal, El Salvador in a near-permanent state of
emergency alert [
7
]. In these countries, processing and analysis of large-scale visual data from UAVs
for PDSA in Flood response is a manual process that requires multi-team intervention, which poses a
serious bottleneck in search-and-response speed. Deployment of EWSs is infeasible because human
monitoring of video feeds is too cumbersome and expensive.
AI Technology
: To reduce the burden of manual analysis on crisis responders, Deep learning is
well-suited to scale, automate and expedite these operations. The last few years have witnessed
a tremendous rise in CNN-based image classification and segmentation research [
21
]. However,
CNNs suffer from a well-known problem — large inductive biases. Conceretely, CNNs assume
locality and translation equivariance, which hurt the interpretability of pure CNN-based algorithms.
Recently, visual transformers have garnered attention for image classification, segmentation and
object detection tasks [2, 20, 26, 3] for challenging these assumptions with comparable accuracy.
Contributions
: In this work, we propose a hybrid fused CNN-Transformer: FloodTransformer to
tackle flood water segmentation on the Water Segmentation Open Collection (WSOC) dataset [
22
].
First, we achieve state-of-the-art results and are the first work (to the best of our knowledge) to apply
new transformer-driven research to the flood data domain. Second, our approach is extendable —
we demonstrate the ability of our model to generalize well on unseen data sources. Further local
calibration, if required at all, simply requires weight fine-tuning with previous, region-specific, flood
scene data. Third, our model does not suffer from data scarcity - it only requires image data input and
not complex sensor data which is hard to collect [
13
]. Last, the transformer-based encoder applies
recent DL innovations to the flood data domain. Although the hybrid method still uses CNNs in
the decoder network, the aforementioned spatial inductive biases no longer occur throughout the
entire network. Dependencies between patch embeddings are learnt from scratch. This improves the
robustness of our approach.
2 Methodology
To achieve the Flood Scene Understanding, we introduce a Deep Learning model for Flood image
segmentation and quantify the impact of flooding with a custom metric called Flooding Capacity.
2.1 Method
Inspired by Zhang et al. [
23
], we propose FloodTransformer to solve segmentation for the flood data
domain. It is a fusion architecture of Visual Transformer [
25
] and Convolution Neural Networks
(CNNs) and its model architecture is displayed in Figure 1.
Complex flooding imagery may contain heterogeneous objects, flooding patterns and backgrounds.
Using the self-attention module of the visual Transformer module from [
25
] and global vector
representation learned from the CNN network, FloodTransformer fuses the trained embeddings to
learn long-term spatial relationships between the aforementioned entities in images of flood affected
areas. Using Hadamard bilinear product [
23
], the fusion module fuses information via embeddings
from both parallel streams into a dense representation. The combination of multi-level fusion maps
generates the segmentation output of the model. We summarize each component below, per Zhang et
al. [23].
Transformer Module
: We use the encoder-decoder network using Visual Transformer [
25
]. The
input image
x∈RH×W×3
is sliced into N patches, where
N=H
F×W
F
and F is usually set to 16 or
2