Multi-Scale Wavelet Transformer for Face Forgery Detection Jie Liu Jingjing Wang Peng Zhang

2025-05-02 0 0 3.59MB 17 页 10玖币

侵权投诉

Multi-Scale Wavelet Transformer for Face

Forgery Detection

Jie Liu?, Jingjing Wang?, Peng Zhang

, Chunmao Wang, Di Xie, and Shiliang PuB

Hikvision Research Institute

{liujie54, wangjingjing9, zhangpeng45, wangchunmao, xiedi,

pushiliang.hri}@hikvision.com

Abstract. Currently, many face forgery detection methods aggregate

spatial and frequency features to enhance the generalization ability and

gain promising performance under the cross-dataset scenario. However,

these methods only leverage one level frequency information which lim-

its their expressive ability. To overcome these limitations, we propose

a multi-scale wavelet transformer framework for face forgery detection.

Speciﬁcally, to take full advantage of the multi-scale and multi-frequency

wavelet representation, we gradually aggregate the multi-scale wavelet

representation at diﬀerent stages of the backbone network. To better fuse

the frequency feature with the spatial features, frequency-based spatial

attention is designed to guide the spatial feature extractor to concentrate

more on forgery traces. Meanwhile, cross-modality attention is proposed

to fuse the frequency features with the spatial features. These two at-

tention modules are calculated through a uniﬁed transformer block for

eﬃciency. A wide variety of experiments demonstrate that the proposed

method is eﬃcient and eﬀective for both within and cross datasets.

1 Introduction

Due to the various image-editing software and publicly available deep generator

models, it is easy to manipulate existing faces and make forged faces very re-

alistic and indistinguishable from genuine ones. These photo-realistic fake faces

may be abused for malicious purposes, raising severe security and privacy issues

in our society. Therefore, it is extremely necessary to develop eﬀective meth-

ods for face forgery detection. To defend against the possible malicious usage of

face forgery, various face forgery detection methods have been proposed. Previ-

ous researchers [1,2] mainly designed methods based on texture artifacts caused

by the face forgery techniques in the spatial domain. Due to the fast evolution

of face forgery techniques, these artifacts are gradually concealed. Therefore,

although these methods achieved high within-dataset detection accuracy, their

performance dropped severely in the cross-dataset scenario, especially when con-

fronted with new face forgery methods.

?Equal contribution.

arXiv:2210.03899v1 [cs.CV] 8 Oct 2022

2 Liu, J. et al.

Level Sub-bands Deepfakes(DF) Face2Face(F2F) FaceSwap(FS) NeuralTextures(NT)

- Ori-Img 1.301 1.092 1.307 1.296

Level-1

LL 1.281 1.008 1.265 1.208

LH 2.688 2.709 2.970 2.959

HL 2.716 2.778 2.720 2.857

HH 2.582 2.914 3.258 2.758

Level-2

LL 1.208 0.958 1.165 1.162

LH 2.817 2.840 2.882 3.106

HL 2.598 2.686 2.549 2.871

HH 3.184 2.929 3.162 3.127

Level-3

LL 1.189 1.055 1.136 1.246

LH 2.473 2.493 2.510 2.826

HL 2.135 2.409 2.166 2.837

HH 2.774 2.917 2.936 2.985

Table 1: EMD of multi-level frequency components. Cropping the face in the

ﬁrst frame of every video in FF++ dataset, and then calculating the EMD

of the original images or sub-bands frequency features between the fake and

corresponding real images. These sub-bands are obtained by three level discrete

wavelet transform.

To make the algorithm generalize well to unseen forgery methods, recently,

many face forgery detection methods attempt to aggregate information from

frequency domains. Yu et al. [3] utilized channel diﬀerence images and the spec-

trum obtained by DCT to detect fake faces. Other researchers leveraged Discrete

Fourier Transform (DFT) [4] and Discrete Cosine Transform (DCT) and block

DCT [5] for frequency information extracting. However, these methods only uti-

lized one level frequency information. And we found that multi-level frequency

features have more discriminable details between real and fake images. Only

using one level frequency may be less eﬀective for extracting the abundant fre-

quency information, which limits the expressive ability of the obtained features.

As we all know, Discrete Wavelet Transform (DWT) is often used to obtain

multi-level frequency, so we choose Haar DWT to extract frequency features.

The ﬁlter fLL,fLH ,fH L, and fH H of DWT are 1

21 1

1 1,1

21 1

−1−1,1

21−1

1−1,

and 1

21−1

−1 1 , and they are used to calculate the frequency (LL, LH, HL,

HH) of an image I. The LL, LH, HL, and HH are deﬁned as LL =fLL ∗I,

LH =fLH ∗I,HL =fHL ∗I,HH =fHH ∗I. DWT divides an image into four

frequency components with half resolution of the original image: a low-frequency

component (LL) and three high-frequency components (LH, HL, HH). And the

LL can be further decomposed into four frequency components recursively. In

this way, we can get multi-level wavelet representations. Earth Mover’s Distance

(EMD) [6] is used to measure the dissimilarity between two multidimensional

distributions, whose formula is deﬁned in [6]. The total EMD distance of FF++

MSWT for Face Forgery Detection 3

dataset is calculated by three level frequency components between the real and

fake data, whose results are shown in Table 1. We observe that the distance of

high-frequency information between real and fake facial images is bigger than

low-frequency one at each level, which demonstrates that diﬀerent level high fre-

quencies are all useful so that fusing multi-level high frequencies can make the

representations more expressive for face forgery detection.

fakerealfakereal

Gray LH HL HH Gray LH HL HH

Fig. 1: High-frequency sub-bands are obtained by DWT. The images in the 1st

and 3rd lines are fake images, and the others are real images. In this ﬁgure, we

show the fake facial images and their corresponding real images. The 1st and 5th

column are the gray images, and the column 2 to 4 and 6 to 8 are high frequency

sub-bands corresponding to the cropped red box. The forged pixels have fewer

high-frequency details (LH, HL, HH) compared with the real ones.

We also visualize the examples of the real and fake high frequency by DWT

in Figure 1 and 2. In Figure 1, we enlarge the local region of the ﬁrst level DWT,

so we can see that there are more details in low-level high frequency. Figure 2

shows the whole high-frequency sub-bands of the three-level DWT, and there is

more global semantic information in high-level frequency. So the low-level and

high-level high-frequency features are all important for facial forgery detection.

Taking the above considerations, we take the multi-scale analysis of wavelet

decomposition into consideration and propose a multi-scale wavelet transformer

framework for face forgery detection named MSWT. Speciﬁcally, we gradually

aggregate the multi-scale wavelet features at diﬀerent stages of the backbone

network to take full advantage of multi-level high-frequency representation. To

better fuse the frequency feature with the spatial features, frequency-based spa-

tial attention is designed to guide the spatial feature extractor to concentrate

more on forgery traces. Meanwhile, cross-modality attention is proposed to fuse

4 Liu, J. et al.

Gray LH L1 HL L1 HH L1 LH L2 HL L2 HH L2 LH L3 HL L3 HH L3

Fig. 2: The images in 1st and 2nd lines are the fake and the real facial images,

respectively. Columns 2 to 10 are level 1, 2, and 3 high frequency (LH, HL, and

HH) sub-bands by DWT. There is more details in lower levels, and more global

semantic structure in higher levels.

the RGB spatial features and the frequency features. These two attention mod-

ules are calculated through a uniﬁed transformer block for eﬃciency named

frequency and spatial feature fusion (FSF) module. The main contributions are

summarized as follows:

–To make full use of frequency features, we are the ﬁrst to utilize the multi-

scale properties of wavelet decomposition to improve the feature fusion of

spatial and frequency domains, and propose a multi-scale wavelet trans-

former framework for face forgery detection.

–To better capture the manipulation trace, frequency-based spatial attention

is designed to guide spatial feature extractor to focus on forgery regions.

–To better fuse the frequency features with the RGB spatial features, cross-

modality attention is introduced.

–Experiments demonstrate that the proposed method works well on both

within-dataset and cross-dataset testing compared with other approaches.

2 Related Work

2.1 Forgery Detection

Forgery Detection based on Spatial Feature. In order to resist manipu-

lated faces and protect media security, many forgery detection algorithms have

been proposed in academia. Because deep learning can learn good feature repre-

sentation, some methods are proposed to extract RGB spatial features based on

deep learning. These approaches mainly include consisrency-based [7], attention-

based [8], and domain generalization methods [9]. Zhao et al. [8] proposed a

method based on multi-attention and textural feature enhancement to enlarge

artifacts in shallow features and capture discriminative details for face forgery

detection, fusing the low-level and high-level features by attention maps. Zhao

et al. [7] proposed patch-wise consistency learning between patches from the fea-

ture maps, which utilizes consistency loss to learn and optimize the consistency

of the patches from real or fake regions. Wodajo et al. [10] proposed a convolu-

tional vision transformer for deepfake video detection, and the network consists

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Multi-ScaleWaveletTransformerforFaceForgeryDetectionJieLiu?,JingjingWang?,PengZhang,ChunmaoWang,DiXie,andShiliangPuBHikvisionResearchInstitutefliujie54,wangjingjing9,zhangpeng45,wangchunmao,xiedi,pushiliang.hrig@hikvision.comAbstract.Currently,manyfaceforgerydetectionmethodsaggregatespatialandfreque...

展开>> 收起<<

Multi-Scale Wavelet Transformer for Face Forgery Detection Jie Liu Jingjing Wang Peng Zhang.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Multi-Scale Wavelet Transformer for Face Forgery Detection Jie Liu Jingjing Wang Peng Zhang

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: