T2CI-GAN Text to Compressed Image generation using Generative Adversarial Network

2025-05-02 0 0 1.68MB 16 页 10玖币

侵权投诉

T2CI-GAN: Text to Compressed Image

generation using Generative Adversarial

Network

Bulla Rajesh1,2[0000−0002−5731−9755], Nandakishore Dusa1, Mohammed

Javed1[0000−0002−3019−7401], Shiv Ram Dubey1[0000−0002−4532−8996], and P.

Nagabhushan1,2

1Department of IT, IIIT Allahabad, Prayagraj, U.P, 211015, Idia

2Department of CSE, Vignan University, Guntur, A.P, 522213, India

{rsi2018007, iwm2016002, javed, srdubey, pnagabhushan}@iiita.ac.in

Abstract. The problem of generating textual descriptions for the vi-

sual data has gained research attention in the recent years. In contrast

to that the problem of generating visual data from textual descriptions

is still very challenging, because it requires the combination of both Nat-

ural Language Processing (NLP) and Computer Vision techniques. The

existing methods utilize the Generative Adversarial Networks (GANs)

and generate the uncompressed images from textual description. How-

ever, in practice, most of the visual data are processed and transmitted

in the compressed representation. Hence, the proposed work attempts to

generate the visual data directly in the compressed representation form

using Deep Convolutional GANs (DCGANs) to achieve the storage and

computational eﬃciency. We propose GAN models for compressed image

generation from text. The ﬁrst model is directly trained with JPEG com-

pressed DCT images (compressed domain) to generate the compressed

images from text descriptions. The second model is trained with RGB

images (pixel domain) to generate JPEG compressed DCT representa-

tion from text descriptions. The proposed models are tested on an open

source benchmark dataset Oxford-102 Flower images using both RGB

and JPEG compressed versions, and accomplished the state-of-the-art

performance in the JPEG compressed domain. The code will be publicly

released at GitHub after acceptance of paper.

Keywords: Compressed Domain ·Deep Learning ·DCT Coeﬃcients ·

T2CI-GAN ·JPEG Compression ·Compressed Domain Pattern Recog-

nition ·Text to Compressed Image.

1 Introduction

Generating visually realistic images based on the natural text descriptions is an

interesting research problem that warrants knowledge of both language process-

ing and computer vision. Unlike the problem of image captioning that generates

arXiv:2210.03734v1 [cs.CV] 1 Oct 2022

2 Bulla. Rajesh et al.

Fig. 1. JPEG Compression and Decompression architecture and extraction of JPEG

Compressed DCT image which is used in the proposed approach.

text descriptions from image, the challenge here is to generate semantically suit-

able images based on proper understanding of the text descriptions. Many inter-

esting techniques have been proposed in the literature to explore the problem of

generating pixel images from the given input texts [20], [27], [26], [16]. Moreover,

a very recent attempt by [11] is aimed to generate images in the compressed for-

mat. The whole idea here is to avoid synthesis of RGB images and subsequent

compression stage. In fact, in the current digital scenario, more and more images

and image frames (videos) are being stored and transmitted in compressed rep-

resentation. The compressed data in the internet world has reached more than

90% [19] of traﬃc. On the other hand, diﬀerent compressed domain technologies

are being explored both by the software giants, like Uber [4] and Xerox [17], and

academia [9], [13], [23], [2], that can directly process and analyse compressed

data without decompression and re-compression. Some of the prominent works

in compressed document images are discussed in [7,8,10] and [19,18]. This gives

us strong motivation for exploring the idea of generating compressed images

directly from natural text descriptions, and that is attempted in this research

paper.

Recently, Generative Adversarial Network (GAN) models have been success-

fully used for generating realistic images from diverse inputs such as layouts [5],

texts[25], and scenes [1]. However, early GAN models [20] have generated im-

ages of low resolutions from the input text. In [20], the GAN model was used

to generate image from a single sentence. This method was implemented in two

stages. Initially the text sentence was encoded into a feature matrix using deep

CNNs and RNNs to extract the signiﬁcant features. Then those features were

utilized to generate a picture. In order to improve the quality, a stacked GAN

was reported in [27]. It generated the output picture using two GANs. In the

ﬁrst step, GAN-1 produced a low resolution image with basic shape and colors

along with the background generated from a random noise vector. In the second

Title Suppressed Due to Excessive Length 3

step, GAN-2 improvised the produced image by adding details and making some

required corrections. MirrorGAN was reported in [16] for text to image trans-

lation through re-description. This model has reported the improved semantic

consistency between text and produced output image. In [26], authors proposed

a Semantics Disentangling Generative Adversarial Network (SD-GAN) which

exploited the semantics of text description. However, all the GAN based tech-

niques discussed above were trained using RGB pixel images meant to generate

RGB images. Hence, our work is focused on employing the signiﬁcant features of

GAN for generating compressed images directly from the given text descriptions.

In the recent literature, a GAN model was proposed for generating direct

compressed images from noise vector [11]. Since JPEG compression was the

most used format, the authors attempted to generate direct JPEG compressed

images rather than generating RGB images and compressing them separately.

Their GAN framework consists of Generator, Decoder and Discriminator sub

networks. The Generator consists of locally connected layers, quantization lay-

ers, and chroma subsampling layers. These locally connected layers perform the

block based operations similar to JPEG compression methods to generate JPEG

compressed images. In between the Generator and the Discriminator, a Decoder

was used to decompress the image to facilitate the comparison with ground truth

RGB image by the Discriminator network. In speciﬁc, this decoder performed

de-quantization and Inverse Discrete Cosine Transformation (IDCT) followed

by YCbCr to RGB transformations on the compressed images generated by the

Generator. Unlike [11] which generates the compressed images from noise, our

model generates the compressed images based on the given input text descrip-

tions.

Overall, this research paper propose two novel GAN models for generating

compressed images from text descriptions. The ﬁrst GAN model is trained di-

rectly with JPEG compressed DCT images to generate compressed images from

text description. The second GAN model is trained with RGB images to gener-

ate compressed images from text descriptions. The proposed models have been

tested on Oxford-102 Flower images benchmark dataset using both the RGB and

JPEG compressed versions, reporting state-of-the-art performance in the com-

pressed domain. Rest of the paper is organized as follows: Section II presents the

preliminaries of used concepts. Section III discusses the proposed methodology

and GAN architectures. Section IV reports the detailed experimental results and

analysis. Finally, Section V concludes the paper with a summary.

2 Preliminaries

In this section, a brief description of JPEG compression, GAN model and GloVe

model is presented.

2.1 JPEG Compression

JPEG compression algorithm achieves compression by discarding the high fre-

quency components. Firstly, the RGB channels of the image are converted into

4 Bulla. Rajesh et al.

YCbCr format to separate the luminance (Y) and chrominance (CbCr) channels

as, Y= (0.299 ×r+ 0.587 ×g+ 0.114 ×b) (1)

Cb = (−0.1687 ×r−0.3313 ×g+ 0.5×b+ 128) (2)

Cr = (0.5×r−0.4187 ×g−0.0813 ×b+ 128).(3)

Then each channel is divided into 8×8 non-overlapping pixel blocks. Forward Dis-

crete Cosine Transform (DCT) is applied on each block in each channel to convert

the 8 ×8 pixel block (let’s say P(x, y)) from spatial domain to frequency do-

main. Each DCT block, i.e., F(u, v), is quantized to keep only the low frequency

coeﬃcients. Then Diﬀerential pulse code modulation (DPCM) is applied on the

DC components and Run Length Encoding (RLE) on AC components. Huﬀman

Coding is used to encode the DC and AC components in smaller number of bits.

In order to perform the decompression, Entropy decoding, De-Quantization, and

Inverse DCT (IDCT) are applied in the given order on the compressed image to

obtain the uncompressed image. The compression and decompression stages are

illustrated in Fig. 1. In the proposed work, the JPEG compressed DCT images

are directly extracted from the JPEG compressed stream and used for training

the deep learning model. The decompression is done only for the performance

analysis, otherwise it is not required in practice.

2.2 Generative Adversarial Network (GAN)

Generative Adversarial Network (GAN) [3] is a deep learning model built with

two networks, including Generator and Discriminator. The Generator (G) gen-

erates new images in the training images distribution and the Discriminator (D)

classiﬁes the images between actual and generated images into real and fake cat-

egories, respectively. These two sub models are trained alternatively such that

Generator(G) tries to fool the Discriminator by generating data similar to real

domain, whereas the Discriminator is optimized to distinguish the generated im-

ages from the real images. Overall, the Generator and the Discriminator play a

two player min-max game. The objective function of the GAN is given as follows:

min

Gmax

DF(G, D) = Ey∼kd[log D(x)]+

Ez∼kz[log(1 −D(G(z))] (4)

where yindicates real image sampled from kd(true data distribution), zindicates

noise vector sampled from kz(uniform or Gaussian distribution).

The Conditional GAN model [12] makes use of some additional information

along with the noise. Both Generator (G) and Discriminator (D) use this ad-

ditional information which is referred as conditioning variable ‘c’ that can be

text or any other data. Thus, the Generator on Conditional GAN generates the

images conditioned on variable ‘c’ as depicted in Fig. 2.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

T2CI-GAN:TexttoCompressedImagegenerationusingGenerativeAdversarialNetworkBullaRajesh1;2[0000000257319755],NandakishoreDusa1,MohammedJaved1[0000000230197401],ShivRamDubey1[0000000245328996],andP.Nagabhushan1;21DepartmentofIT,IIITAllahabad,Prayagraj,U.P,211015,Idia2DepartmentofCSE,VignanUniversity,Gun...

展开>> 收起<<

T2CI-GAN Text to Compressed Image generation using Generative Adversarial Network.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

T2CI-GAN Text to Compressed Image generation using Generative Adversarial Network

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: