High -Fidelity Visual Structural Inspection s through Transformers and Learnable Resizers Kareem Eltouny10000 -0003 -3918 -2297 Seyedomid Sajedi10000 -0002 -4552 -4794 and Xiao

2025-05-06 0 0 702.88KB 8 页 10玖币

侵权投诉

High-Fidelity Visual Structural Inspections through

Transformers and Learnable Resizers

Kareem Eltouny1†[0000-0003-3918-2297], Seyedomid Sajedi1†[0000-0002-4552-4794], and Xiao

Liang1*[0000-0003-4788-8759]

1 University at Buffalo, the State University of New York, Buffalo NY 14260, USA

*liangx@buffalo.edu

† Equal contribution

Abstract. Visual inspection is the predominant technique for evaluating the

condition of civil infrastructure. The recent advances in unmanned aerial vehi-

cles (UAVs) and artificial intelligence have made the visual inspections faster,

safer, and more reliable. Camera-equipped UAVs are becoming the new stand-

ard in the industry by collecting massive amounts of visual data for human in-

spectors. Meanwhile, there has been significant research on autonomous visual

inspections using deep learning algorithms, including semantic segmentation.

While UAVs can capture high-resolution images of buildings' façades, high-

resolution segmentation is extremely challenging due to the high computational

memory demands. Typically, images are uniformly downsized at the price of

losing fine local details. Contrarily, breaking the images into multiple smaller

patches can cause a loss of global contextual information. We propose a hybrid

strategy that can adapt to different inspections tasks by managing the global and

local semantics trade-off. The framework comprises a compound, high-

resolution deep learning architecture equipped with an attention-based segmen-

tation model and learnable downsampler-upsampler modules designed for op-

timal efficiency and information retention. The framework also utilizes vision

transformers on a grid of image crops aiming for high precision learning with-

out downsizing. An augmented inference technique is used to boost the perfor-

mance and reduce the possible loss of context due to grid cropping. Compre-

hensive experiments have been performed on 3D physics-based graphics mod-

els synthetic environments in the Quake City dataset. The proposed framework

is evaluated using several metrics on three segmentation tasks: component type,

component damage state, and global damage (crack, rebar, spalling).

Keywords: Semantic segmentation, crack detection, ResNeSt, Swin Trans-

formers

1 Introduction

Innovations in sensing technology and deep learning have made a significant impact

on autonomous structural visual inspections [1-3]. More recently, Camera-equipped

Unmanned Aerial Vehicles (UAV) are proving to be an important asset in visual in-

spections by rapidly capturing high-resolution visual data of structures while increas-

ing the safety of human workers. The massive amounts of collected footage are often

post-processed by an off-the-shelf deep learning model. However, challenges arise

when a trade-off between computational efficiency and high-fidelity precision needs

to be achieved. In many situations, computational resource constraints necessitate the

downsizing of high-resolution data for various tasks, including semantic segmenta-

tion, causing a loss in critical information.

Different tasks may need different levels of context in visual structural inspections.

In this study, we propose a twin-model framework to tackle the challenges arising

from the different natures of these tasks. Trainable Resizing for high-resolution Seg-

mentation Network (TRS-Net) is developed for component detection and damage

severity segmentation, which highly rely on global context information. DmgFormer

is dedicated to crack, rebar, and spalling segmentation, tasks that are highly sensitive

to resolution loss. The proposed twin models are customized based on the latest ad-

vances in computer vision research on super-resolution, image segmentation, and

transformer architectures.

2 Autonomous visual inspections framework

2.1 Components and damage state segmentation

Fig. 1. TRS-Net architecture. (Conv: 2D convolution, BN: batch normalization, ReLU: recti-

fied linear unit, E-i: encoder block, D-i,j: decoder block, n:classes).

It is a common practice to uniformly downsample high-resolution images before in-

serting them into deep learning segmentation models [4], and if needed, interpolation

is used for upsampling the predicted masks. By giving equal significance to all pixels

in the image, these operations may result in an unsatisfactory segmentation perfor-

mance. In contrast, feeding high-resolution images directly to state-of-the-art segmen-

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

High-FidelityVisualStructuralInspectionsthroughTransformersandLearnableResizersKareemEltouny1†[0000-0003-3918-2297],SeyedomidSajedi1†[0000-0002-4552-4794],andXiaoLiang1*[0000-0003-4788-8759]1UniversityatBuffalo,theStateUniversityofNewYork,BuffaloNY14260,USA*liangx@buffalo.edu†EqualcontributionAbstra...

展开>> 收起<<

High -Fidelity Visual Structural Inspection s through Transformers and Learnable Resizers Kareem Eltouny10000 -0003 -3918 -2297 Seyedomid Sajedi10000 -0002 -4552 -4794 and Xiao.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

High -Fidelity Visual Structural Inspection s through Transformers and Learnable Resizers Kareem Eltouny10000 -0003 -3918 -2297 Seyedomid Sajedi10000 -0002 -4552 -4794 and Xiao

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: