UIA-ViT Unsupervised Inconsistency-Aware Method based on Vision Transformer for Face Forgery Detection

2025-05-06 0 0 3.88MB 17 页 10玖币
侵权投诉
UIA-ViT: Unsupervised Inconsistency-Aware
Method based on Vision Transformer
for Face Forgery Detection
Wanyi Zhuang, Qi Chu, Zhentao Tan, Qiankun Liu, Haojie Yuan,
Changtao Miao, Zixiang Luo, and Nenghai Yu
CAS Key Laboratory of Electromagnetic Space Information,
University of Science and Technology of China
wy970824@mail.ustc.edu.cn, qchu@ustc.edu.cn, {tzt,liuqk3,doubihj,
miaoct,zxluo}@mail.ustc.edu.cn, ynh@ustc.edu.cn.
Abstract. Intra-frame inconsistency has been proved to be effective for
the generalization of face forgery detection. However, learning to focus on
these inconsistency requires extra pixel-level forged location annotations.
Acquiring such annotations is non-trivial. Some existing methods gener-
ate large-scale synthesized data with location annotations, which is only
composed of real images and cannot capture the properties of forgery re-
gions. Others generate forgery location labels by subtracting paired real
and fake images, yet such paired data is difficult to collected and the
generated label is usually discontinuous. To overcome these limitations,
we propose a novel Unsupervised Inconsistency-Aware method based on
Vision Transformer, called UIA-ViT, which only makes use of video-level
labels and can learn inconsistency-aware feature without pixel-level anno-
tations. Due to the self-attention mechanism, the attention map among
patch embeddings naturally represents the consistency relation, mak-
ing the vision Transformer suitable for the consistency representation
learning. Based on vision Transformer, we propose two key components:
Unsupervised Patch Consistency Learning (UPCL) and Progressive Con-
sistency Weighted Assemble (PCWA). UPCL is designed for learning the
consistency-related representation with progressive optimized pseudo an-
notations. PCWA enhances the final classification embedding with previ-
ous patch embeddings optimized by UPCL to further improve the detec-
tion performance. Extensive experiments demonstrate the effectiveness
of the proposed method.
1 Introduction
Face forgery technologies[6,4,29] have been greatly promoted with the develop-
ment of image generation and manipulation. The forged facial images can even
deceive human beings, which may be abused for malicious purposes, leading to
serious security and privacy concerns, e.g. fake news and evidence. Thus, it’s of
great significance to develop powerful techniques to detect fake faces.
Corresponding authors.
arXiv:2210.12752v1 [cs.CV] 23 Oct 2022
2 Wanyi Zhuang et al.
Early face forgery detection methods[2,20,31] regard this task as a binary
classification problem and achieve admirable performance in the intra-dataset
detection with the help of deep neural networks. However, they fail easily when
generalizing to other unseen forgery datasets where the identities, manipulation
types, compression rate, etc. are quite different. To improve the generalization of
detection, common forged artifacts or inconsistency produced by face manipula-
tion techniques are explored by recent methods, such as eye blinking frequency
[13], affine warping [14], image blending [12], temporal inconsistency [35,27],
intra-frame inconsistency [34,1] and so on. Among them, intra-frame inconsis-
tency has been proved to be able to effectively improve the generalization perfor-
mance of the detection, since the common face forgery strategy (manipulation
and blending) causes the inconsistency between the forged region and the origi-
nal background. However, learning to focus on these inconsistency requires extra
pixel-level forged location annotations. Acquiring such annotations is non-trivial.
Generating the large-scale synthesized data (e.g. simulated stitched images[34])
with pixel-level forged location annotations seems to be an intuitive solution.
Although it can produce accurate pixel-level location annotations, models can
not capture the properties of forgery regions, since the generated data is only
composed of real images. Other works [1,27] attempt to generate annotated
forged location labels by subtracting forgery image with its corresponding real
image. However, these paired images are usually unavailable, especially in the
real-world scenes. Even though such paired data can be collected, the forgery re-
gion annotations tend to be discontinuous and inaccurate, which are sub-optimal
for intra-frame consistency supervision. Therefore, we propose an unsupervised
inconsistency-aware method that extracts intra-frame inconsistency cues with-
out pixel-level forged location annotations.
The key of unsupervised inconsistency-aware learning is how to realize forgery
location estimation. In this paper, we apply the widely used multivariate Gaus-
sian estimation (MVG)[23,9] to represent the real/fake features and generate
pseudo annotations through it. Based on this idea, we can force the model
to focus on intra-inconsistency using pseudo annotations. In addition, differ-
ent from the previous works [27,1] which specially design a module to ob-
tain the consistency-related representation, we find that Vision Transformer
[5] naturally provides the consistency representation from the attention map
among patch embeddings, thanks to their self-attention mechanism. Therefore,
we apply it to build the detection network and propose two key components:
UPCL (Unsupervised Patch Consistency Learning) and PCWA (Progressive
Consistency Weighted Assemble).
UPCL is a training strategy for learning the consistency-related representa-
tions through an unsupervised forgery location method. We approximately es-
timate forgery location maps by comparing the Mahalanobis distances between
the MVGs of real/fake features and the patch embedding from the middle layer
of Vision Transformer (ViT) [5]. During training, forgery location maps are pro-
gressively optimized. To model the consistency constraint, we use the Multi-head
Unsupervised Inconsistency-Aware Method for Face Forgery Detection 3
Attention Map existed in ViT itself as the representation and constrain them in
multi-layers for better learning.
PCWA is a feature enhancement module and can take full advantage of the
consistency representation through the proposed UPCL module. In details, we
utilize the Attention Map between classification embedding and patch embed-
dings to progressively weighted average the patch embedding of final layer, and
concatenate it with classification embedding before feed them into final MLP for
forgery classification. The layers providing these Attention Maps are optimized
by UPCL for further improvement.
The main contributions of this work are summarized as follows:
We propose an unsupervised patch consistency learning strategy based on
vision Transformer to make it possible for face forgery detection to focus on
intra-frame inconsistency without pixel-level annotations. It greatly improves
the generalization of detection without additional overhead.
We take full advantage of feature representations under the proposed learn-
ing strategy to progressively combine global classification features and local
patch features, by weighted averaging the latter using the Attention Map
between classification embedding and patch embeddings.
Extensive experiments demonstrate the superior generalization ability of pro-
posed method and the effectiveness of unsupervised learning strategy.
2 Related Work
2.1 Face Forgery Detection
Early face manipulation methods usually produce obvious artifacts or incon-
sistencies on generated face images. Such flaws are important cues for early
face forgery detection works. For example, Li et al.[13] observed that the eye
blinking frequency of the forgery video is lower than the normal. Later methods
extended it to check the inconsistency of 3D head poses to help forgery videos
detection[32]. Similarly, Matern et al.[17] used hand-crafted visual features in
eyes, noses, teeth to distinguish the fake faces.
Besides seeking for visual artifacts, frequency clues has also been introduced
in forgery detection to improve detection accuracy, such as Two-branch[16], F3-
Net[22], FDFL[11]. Meanwhile, attention mechanism proved to be effective in
recent studies like Xception+Reg [3] and Multi-attention[33] [19]. Although these
methods have achieved perfect performance in the intra-dataset detection, they
suffer big performance drop while generalizing to other unseen forgery datasets.
To overcome the difficulties on generalizing to unseen forgeries, works have
been done to discover universal properties shared by different forgery methods.
Some works focused on inevitable procedures in forgery, such as affine warping
(DSP-FWA [14]) and blending (Face X-ray [12]). While others observed that cer-
tain type of inconsistency exists in different kinds of forgery videos, such as tem-
poral inconsistency (LipForensics [7], FTCN+TT [35]) and intra-frame inconsis-
tency (Local-related [1], PCL+I2G [34], Dual [27]). However, in order to learn
4 Wanyi Zhuang et al.
corresponding artifacts or inconsistency cues, Face X-ray[12] and PCL+I2G[34]
try to generate the large-scale datasets with annotated forged location for their
pixel-level supervised learning. The generation process is time consuming and
cannot capture the properties of forgery regions. Local-related[1] and DCL[27]
try to generate annotated forged location labels by subtracting forgery image
with its corresponding real image. However, these paired images are usually un-
available, especially in the real-world scenes. Even though such paired data can
be collected, the forgery region annotations tend to be discontinuous and inac-
curate, which are sub-optimal for intra-frame consistency supervision. To tackle
these issues, we introduce the unsupervised inconsistency-aware method to learn
inconsistency cues for general face forgery detection.
2.2 Transformer
Transformers [30] are proposed for machine translation and have become the
state of the art method in NLP tasks for their strong ability in modeling long-
range context information. Vision Transformer (ViT)[5] adjusted Transformers
for computer vision tasks, by modeling image as a sequences of image patches.
Several works leveraging transformers to boost face forgery detection have been
done: Miao et al.[18] extend transformer using bag-of-feature to learn local
forgery features. Khan et al.[8] propose a video transformer to extract spatial
features with the temporal information for detecting forgery. Zheng et al.[35]
design a light-weight temporal transformer after their proposed fully temporal
convolution network to explore the temporal coherence for general manipulated
video detection. In this paper, we also extend transformer to dig the relation-
ships among different regions and capture more local consistency information
for general forgery image detection.
3 Method
In this section, we introduce the details of the proposed Vision Transformer
based unsupervised inconsistency-aware face forgery detection. As shown in
Fig.1, given an image I, our network splits Iinto fixed-size patches, linearly
embeds each of them, adds position embeddings and feeds the resulting sequence
into the Transformer encoder. The patch embeddings FPfrom layer Kare ac-
cumulated for unsupervised approximate forgery location, and the estimated
location maps are used for modeling consistency constraint. The Attention Map
ΥPand ΥCare averaged from layer Nnto layer N.ΥPis used for patch
consistency learning, and ΥCis used as consistency weighted matrix in PCWA.
In the end, PCWA outputs consistency-aware feature F= UIA(I), and an MLP
head is used to do the final prediction.
3.1 Unsupervised Patch Consistency Learning
Unsupervised Approximate Forgery Location. We apply the widely used
multivariate Gaussian estimation (MVG) to represent the real/fake image patch
摘要:

UIA-ViT:UnsupervisedInconsistency-AwareMethodbasedonVisionTransformerforFaceForgeryDetectionWanyiZhuang,QiChu∗,ZhentaoTan,QiankunLiu,HaojieYuan,ChangtaoMiao,ZixiangLuo,andNenghaiYuCASKeyLaboratoryofElectromagneticSpaceInformation,UniversityofScienceandTechnologyofChinawy970824@mail.ustc.edu.cn,qchu@...

展开>> 收起<<
UIA-ViT Unsupervised Inconsistency-Aware Method based on Vision Transformer for Face Forgery Detection.pdf

共17页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:17 页 大小:3.88MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 17
客服
关注