UIA-ViT Unsupervised Inconsistency-Aware Method based on Vision Transformer for Face Forgery Detection

2025-05-06 0 0 3.88MB 17 页 10玖币

侵权投诉

UIA-ViT: Unsupervised Inconsistency-Aware

Method based on Vision Transformer

for Face Forgery Detection

Wanyi Zhuang, Qi Chu∗, Zhentao Tan, Qiankun Liu, Haojie Yuan,

Changtao Miao, Zixiang Luo, and Nenghai Yu

CAS Key Laboratory of Electromagnetic Space Information,

University of Science and Technology of China

wy970824@mail.ustc.edu.cn, qchu@ustc.edu.cn, {tzt,liuqk3,doubihj,

miaoct,zxluo}@mail.ustc.edu.cn, ynh@ustc.edu.cn.

Abstract. Intra-frame inconsistency has been proved to be eﬀective for

the generalization of face forgery detection. However, learning to focus on

these inconsistency requires extra pixel-level forged location annotations.

Acquiring such annotations is non-trivial. Some existing methods gener-

ate large-scale synthesized data with location annotations, which is only

composed of real images and cannot capture the properties of forgery re-

gions. Others generate forgery location labels by subtracting paired real

and fake images, yet such paired data is diﬃcult to collected and the

generated label is usually discontinuous. To overcome these limitations,

we propose a novel Unsupervised Inconsistency-Aware method based on

Vision Transformer, called UIA-ViT, which only makes use of video-level

labels and can learn inconsistency-aware feature without pixel-level anno-

tations. Due to the self-attention mechanism, the attention map among

patch embeddings naturally represents the consistency relation, mak-

ing the vision Transformer suitable for the consistency representation

learning. Based on vision Transformer, we propose two key components:

Unsupervised Patch Consistency Learning (UPCL) and Progressive Con-

sistency Weighted Assemble (PCWA). UPCL is designed for learning the

consistency-related representation with progressive optimized pseudo an-

notations. PCWA enhances the ﬁnal classiﬁcation embedding with previ-

ous patch embeddings optimized by UPCL to further improve the detec-

tion performance. Extensive experiments demonstrate the eﬀectiveness

of the proposed method.

1 Introduction

Face forgery technologies[6,4,29] have been greatly promoted with the develop-

ment of image generation and manipulation. The forged facial images can even

deceive human beings, which may be abused for malicious purposes, leading to

serious security and privacy concerns, e.g. fake news and evidence. Thus, it’s of

great signiﬁcance to develop powerful techniques to detect fake faces.

⋆Corresponding authors.

arXiv:2210.12752v1 [cs.CV] 23 Oct 2022

2 Wanyi Zhuang et al.

Early face forgery detection methods[2,20,31] regard this task as a binary

classiﬁcation problem and achieve admirable performance in the intra-dataset

detection with the help of deep neural networks. However, they fail easily when

generalizing to other unseen forgery datasets where the identities, manipulation

types, compression rate, etc. are quite diﬀerent. To improve the generalization of

detection, common forged artifacts or inconsistency produced by face manipula-

tion techniques are explored by recent methods, such as eye blinking frequency

[13], aﬃne warping [14], image blending [12], temporal inconsistency [35,27],

intra-frame inconsistency [34,1] and so on. Among them, intra-frame inconsis-

tency has been proved to be able to eﬀectively improve the generalization perfor-

mance of the detection, since the common face forgery strategy (manipulation

and blending) causes the inconsistency between the forged region and the origi-

nal background. However, learning to focus on these inconsistency requires extra

pixel-level forged location annotations. Acquiring such annotations is non-trivial.

Generating the large-scale synthesized data (e.g. simulated stitched images[34])

with pixel-level forged location annotations seems to be an intuitive solution.

Although it can produce accurate pixel-level location annotations, models can

not capture the properties of forgery regions, since the generated data is only

composed of real images. Other works [1,27] attempt to generate annotated

forged location labels by subtracting forgery image with its corresponding real

image. However, these paired images are usually unavailable, especially in the

real-world scenes. Even though such paired data can be collected, the forgery re-

gion annotations tend to be discontinuous and inaccurate, which are sub-optimal

for intra-frame consistency supervision. Therefore, we propose an unsupervised

inconsistency-aware method that extracts intra-frame inconsistency cues with-

out pixel-level forged location annotations.

The key of unsupervised inconsistency-aware learning is how to realize forgery

location estimation. In this paper, we apply the widely used multivariate Gaus-

sian estimation (MVG)[23,9] to represent the real/fake features and generate

pseudo annotations through it. Based on this idea, we can force the model

to focus on intra-inconsistency using pseudo annotations. In addition, diﬀer-

ent from the previous works [27,1] which specially design a module to ob-

tain the consistency-related representation, we ﬁnd that Vision Transformer

[5] naturally provides the consistency representation from the attention map

among patch embeddings, thanks to their self-attention mechanism. Therefore,

we apply it to build the detection network and propose two key components:

UPCL (Unsupervised Patch Consistency Learning) and PCWA (Progressive

Consistency Weighted Assemble).

UPCL is a training strategy for learning the consistency-related representa-

tions through an unsupervised forgery location method. We approximately es-

timate forgery location maps by comparing the Mahalanobis distances between

the MVGs of real/fake features and the patch embedding from the middle layer

of Vision Transformer (ViT) [5]. During training, forgery location maps are pro-

gressively optimized. To model the consistency constraint, we use the Multi-head

Unsupervised Inconsistency-Aware Method for Face Forgery Detection 3

Attention Map existed in ViT itself as the representation and constrain them in

multi-layers for better learning.

PCWA is a feature enhancement module and can take full advantage of the

consistency representation through the proposed UPCL module. In details, we

utilize the Attention Map between classiﬁcation embedding and patch embed-

dings to progressively weighted average the patch embedding of ﬁnal layer, and

concatenate it with classiﬁcation embedding before feed them into ﬁnal MLP for

forgery classiﬁcation. The layers providing these Attention Maps are optimized

by UPCL for further improvement.

The main contributions of this work are summarized as follows:

–We propose an unsupervised patch consistency learning strategy based on

vision Transformer to make it possible for face forgery detection to focus on

intra-frame inconsistency without pixel-level annotations. It greatly improves

the generalization of detection without additional overhead.

–We take full advantage of feature representations under the proposed learn-

ing strategy to progressively combine global classiﬁcation features and local

patch features, by weighted averaging the latter using the Attention Map

between classiﬁcation embedding and patch embeddings.

–Extensive experiments demonstrate the superior generalization ability of pro-

posed method and the eﬀectiveness of unsupervised learning strategy.

2 Related Work

2.1 Face Forgery Detection

Early face manipulation methods usually produce obvious artifacts or incon-

sistencies on generated face images. Such ﬂaws are important cues for early

face forgery detection works. For example, Li et al.[13] observed that the eye

blinking frequency of the forgery video is lower than the normal. Later methods

extended it to check the inconsistency of 3D head poses to help forgery videos

detection[32]. Similarly, Matern et al.[17] used hand-crafted visual features in

eyes, noses, teeth to distinguish the fake faces.

Besides seeking for visual artifacts, frequency clues has also been introduced

in forgery detection to improve detection accuracy, such as Two-branch[16], F3-

Net[22], FDFL[11]. Meanwhile, attention mechanism proved to be eﬀective in

recent studies like Xception+Reg [3] and Multi-attention[33] [19]. Although these

methods have achieved perfect performance in the intra-dataset detection, they

suﬀer big performance drop while generalizing to other unseen forgery datasets.

To overcome the diﬃculties on generalizing to unseen forgeries, works have

been done to discover universal properties shared by diﬀerent forgery methods.

Some works focused on inevitable procedures in forgery, such as aﬃne warping

(DSP-FWA [14]) and blending (Face X-ray [12]). While others observed that cer-

tain type of inconsistency exists in diﬀerent kinds of forgery videos, such as tem-

poral inconsistency (LipForensics [7], FTCN+TT [35]) and intra-frame inconsis-

tency (Local-related [1], PCL+I2G [34], Dual [27]). However, in order to learn

4 Wanyi Zhuang et al.

corresponding artifacts or inconsistency cues, Face X-ray[12] and PCL+I2G[34]

try to generate the large-scale datasets with annotated forged location for their

pixel-level supervised learning. The generation process is time consuming and

cannot capture the properties of forgery regions. Local-related[1] and DCL[27]

try to generate annotated forged location labels by subtracting forgery image

with its corresponding real image. However, these paired images are usually un-

available, especially in the real-world scenes. Even though such paired data can

be collected, the forgery region annotations tend to be discontinuous and inac-

curate, which are sub-optimal for intra-frame consistency supervision. To tackle

these issues, we introduce the unsupervised inconsistency-aware method to learn

inconsistency cues for general face forgery detection.

2.2 Transformer

Transformers [30] are proposed for machine translation and have become the

state of the art method in NLP tasks for their strong ability in modeling long-

range context information. Vision Transformer (ViT)[5] adjusted Transformers

for computer vision tasks, by modeling image as a sequences of image patches.

Several works leveraging transformers to boost face forgery detection have been

done: Miao et al.[18] extend transformer using bag-of-feature to learn local

forgery features. Khan et al.[8] propose a video transformer to extract spatial

features with the temporal information for detecting forgery. Zheng et al.[35]

design a light-weight temporal transformer after their proposed fully temporal

convolution network to explore the temporal coherence for general manipulated

video detection. In this paper, we also extend transformer to dig the relation-

ships among diﬀerent regions and capture more local consistency information

for general forgery image detection.

3 Method

In this section, we introduce the details of the proposed Vision Transformer

based unsupervised inconsistency-aware face forgery detection. As shown in

Fig.1, given an image I, our network splits Iinto ﬁxed-size patches, linearly

embeds each of them, adds position embeddings and feeds the resulting sequence

into the Transformer encoder. The patch embeddings FPfrom layer Kare ac-

cumulated for unsupervised approximate forgery location, and the estimated

location maps are used for modeling consistency constraint. The Attention Map

ΥPand ΥCare averaged from layer N−nto layer N.ΥPis used for patch

consistency learning, and ΥCis used as consistency weighted matrix in PCWA.

In the end, PCWA outputs consistency-aware feature F= UIA(I), and an MLP

head is used to do the ﬁnal prediction.

3.1 Unsupervised Patch Consistency Learning

Unsupervised Approximate Forgery Location. We apply the widely used

multivariate Gaussian estimation (MVG) to represent the real/fake image patch

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

UIA-ViT:UnsupervisedInconsistency-AwareMethodbasedonVisionTransformerforFaceForgeryDetectionWanyiZhuang,QiChu∗,ZhentaoTan,QiankunLiu,HaojieYuan,ChangtaoMiao,ZixiangLuo,andNenghaiYuCASKeyLaboratoryofElectromagneticSpaceInformation,UniversityofScienceandTechnologyofChinawy970824@mail.ustc.edu.cn,qchu@...

展开>> 收起<<

UIA-ViT Unsupervised Inconsistency-Aware Method based on Vision Transformer for Face Forgery Detection.pdf

共17页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

UIA-ViT Unsupervised Inconsistency-Aware Method based on Vision Transformer for Face Forgery Detection

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: