High -Fidelity Visual Structural Inspection s through Transformers and Learnable Resizers Kareem Eltouny10000 -0003 -3918 -2297 Seyedomid Sajedi10000 -0002 -4552 -4794 and Xiao

2025-05-06 0 0 702.88KB 8 页 10玖币
侵权投诉
High-Fidelity Visual Structural Inspections through
Transformers and Learnable Resizers
Kareem Eltouny1[0000-0003-3918-2297], Seyedomid Sajedi1[0000-0002-4552-4794], and Xiao
Liang1*[0000-0003-4788-8759]
1 University at Buffalo, the State University of New York, Buffalo NY 14260, USA
*liangx@buffalo.edu
Equal contribution
Abstract. Visual inspection is the predominant technique for evaluating the
condition of civil infrastructure. The recent advances in unmanned aerial vehi-
cles (UAVs) and artificial intelligence have made the visual inspections faster,
safer, and more reliable. Camera-equipped UAVs are becoming the new stand-
ard in the industry by collecting massive amounts of visual data for human in-
spectors. Meanwhile, there has been significant research on autonomous visual
inspections using deep learning algorithms, including semantic segmentation.
While UAVs can capture high-resolution images of buildings' façades, high-
resolution segmentation is extremely challenging due to the high computational
memory demands. Typically, images are uniformly downsized at the price of
losing fine local details. Contrarily, breaking the images into multiple smaller
patches can cause a loss of global contextual information. We propose a hybrid
strategy that can adapt to different inspections tasks by managing the global and
local semantics trade-off. The framework comprises a compound, high-
resolution deep learning architecture equipped with an attention-based segmen-
tation model and learnable downsampler-upsampler modules designed for op-
timal efficiency and information retention. The framework also utilizes vision
transformers on a grid of image crops aiming for high precision learning with-
out downsizing. An augmented inference technique is used to boost the perfor-
mance and reduce the possible loss of context due to grid cropping. Compre-
hensive experiments have been performed on 3D physics-based graphics mod-
els synthetic environments in the Quake City dataset. The proposed framework
is evaluated using several metrics on three segmentation tasks: component type,
component damage state, and global damage (crack, rebar, spalling).
Keywords: Semantic segmentation, crack detection, ResNeSt, Swin Trans-
formers
1 Introduction
Innovations in sensing technology and deep learning have made a significant impact
on autonomous structural visual inspections [1-3]. More recently, Camera-equipped
Unmanned Aerial Vehicles (UAV) are proving to be an important asset in visual in-
spections by rapidly capturing high-resolution visual data of structures while increas-
2
ing the safety of human workers. The massive amounts of collected footage are often
post-processed by an off-the-shelf deep learning model. However, challenges arise
when a trade-off between computational efficiency and high-fidelity precision needs
to be achieved. In many situations, computational resource constraints necessitate the
downsizing of high-resolution data for various tasks, including semantic segmenta-
tion, causing a loss in critical information.
Different tasks may need different levels of context in visual structural inspections.
In this study, we propose a twin-model framework to tackle the challenges arising
from the different natures of these tasks. Trainable Resizing for high-resolution Seg-
mentation Network (TRS-Net) is developed for component detection and damage
severity segmentation, which highly rely on global context information. DmgFormer
is dedicated to crack, rebar, and spalling segmentation, tasks that are highly sensitive
to resolution loss. The proposed twin models are customized based on the latest ad-
vances in computer vision research on super-resolution, image segmentation, and
transformer architectures.
2 Autonomous visual inspections framework
2.1 Components and damage state segmentation
Fig. 1. TRS-Net architecture. (Conv: 2D convolution, BN: batch normalization, ReLU: recti-
fied linear unit, E-i: encoder block, D-i,j: decoder block, n:classes).
It is a common practice to uniformly downsample high-resolution images before in-
serting them into deep learning segmentation models [4], and if needed, interpolation
is used for upsampling the predicted masks. By giving equal significance to all pixels
in the image, these operations may result in an unsatisfactory segmentation perfor-
mance. In contrast, feeding high-resolution images directly to state-of-the-art segmen-
摘要:

High-FidelityVisualStructuralInspectionsthroughTransformersandLearnableResizersKareemEltouny1†[0000-0003-3918-2297],SeyedomidSajedi1†[0000-0002-4552-4794],andXiaoLiang1*[0000-0003-4788-8759]1UniversityatBuffalo,theStateUniversityofNewYork,BuffaloNY14260,USA*liangx@buffalo.edu†EqualcontributionAbstra...

展开>> 收起<<
High -Fidelity Visual Structural Inspection s through Transformers and Learnable Resizers Kareem Eltouny10000 -0003 -3918 -2297 Seyedomid Sajedi10000 -0002 -4552 -4794 and Xiao.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:702.88KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注