GREEN LEARNING INTRODUCTION EXAMPLES AND OUTLOOK A P REPRINT C.-C. Jay Kuo

2025-05-06 0 0 2.44MB 37 页 10玖币

侵权投诉

GREEN LEARNING: INTRODUCTION, EXAMPLES AND OUTLOOK

A PREPRINT

C.-C. Jay Kuo

University of Southern California

Los Angeles, California, USA

Azad M. Madni

University of Southern California

Los Angeles, California, USA

October 4, 2022

ABSTRACT

Rapid advances in artiﬁcial intelligence (AI) in the last decade have largely been built upon the

wide applications of deep learning (DL). However, the high carbon footprint yielded by larger and

larger DL networks becomes a concern for sustainability. Furthermore, DL decision mechanism is

somewhat obscure and can only be veriﬁed by test data. Green learning (GL) has been proposed

as an alternative paradigm to address these concerns. GL is characterized by low carbon footprints,

small model sizes, low computational complexity, and logical transparency. It offers energy-effective

solutions in cloud centers as well as mobile/edge devices. GL also provides a clear and logical

decision-making process to gain people’s trust. Several statistical tools have been developed to

achieve this goal in recent years. They include unsupervised representation learning, supervised

feature learning, and supervised decision learning. We have seen a few successful GL examples with

performance comparable with state-of-the-art DL solutions. This paper offers an introduction to GL,

its demonstrated applications, and future outlook.

Keywords machine learning, green learning, trust learning, deep learning.

1 Introduction

There have been rapid advances in artiﬁcial intelligence (AI) and machine learning (ML) in the last decade. The

breakthrough has largely been built upon the construction of larger datasets and the design of complex neural networks.

Representative neural networks include the convolutional neural network (CNN) [Gu et al., 2018], the recurrent neural

network (RNN), [Pascanu et al., 2013, Salehinejad et al., 2017, Su et al., 2022] the long short-term memory network

(LSTM) [Greff et al., 2016, Hochreiter and Schmidhuber, 1997, Su and Kuo, 2019], etc. Deep neural networks (DNNs)

have attracted a lot of attention from academia and industry since its resurgence in 2012 [Krizhevsky et al., 2012].

As networks become larger and deeper, this discipline is named deep learning (DL) [LeCun et al., 2015]. DL has

made great impacts in various application domains, include computer vision, natural language processing, autonomous

driving, robotics navigation, etc.

DL is characterized by two design choices: the network architecture and the loss function. Once these two choices

are speciﬁed, model parameters can be automatically determined via an end-to-end optimization algorithm called

backpropagation. When the number of training samples is less than the number of model parameters, it is common to

adopt pre-trained networks to build larger networks for better performance; e.g., ResNet [He et al., 2015] or DenseNet

[Huang et al., 2017] pretrained by ImageNet [Deng et al., 2009]. Another emerging trend is the adoption of the

transformer architecture [Han et al., 2022, Jaderberg et al., 2015, Khan et al., 2021, Vaswani et al., 2017, Wolf et al.,

2020].

Despite its rapid advance, the DL paradigm faces several challenges. DL networks are mathematically intractable,

vulnerable to adversarial attacks [Akhtar and Mian, 2018], and demand heavy supervision. Efforts have been made

in explaining the behavior of DL networks with certain success, e.g., [Chan et al., 2022, Damian et al., 2022, Ma

et al., 2022, Oymak et al., 2021, Soltanolkotabi et al., 2018, Wright and Ma, 2022]. Adversarial training has been

developed to provide a tradeoff between robustness and accuracy [Zhang et al., 2019a]. Self-supervised [Misra and

arXiv:2210.00965v1 [cs.LG] 3 Oct 2022

Green Learning: Introduction, Examples and Outlook A PREPRINT

Maaten, 2020] and semi-supervised learning [Sohn et al., 2020, Van Engelen and Hoos, 2020] have been explored to

reduce the supervision burden.

Two further concerns over DL technologies are less addressed. The ﬁrst one is about its high carbon footprint

[Lannelongue et al., 2021, Schwartz et al., 2020, Wu et al., 2022, Xu et al., 2021]. The training of DL networks is

computationally intensive. The training of larger complex networks on huge datasets imposes a threat on sustainability

[Sanh et al., 2019, Sharir et al., 2020, Strubell et al., 2019]. The second one is related to its trusworthiness. The

application of blackbox DL models to high stakes decisions is questioned [Arrieta et al., 2020, Poursabzi-Sangdeh

et al., 2021, Rudin, 2019]. Conclusions drawn from a set of input-output relationships could be misleading and counter

intuition. It is essential to justify an ML prediction procedure with logical reasoning to gain people’s trust.

To tackle the ﬁrst problem, one may optimize DL systems by taking performance and complexity into account jointly.

An alternative solution is to build a new learning paradigm of low carbon footprint from scratch. For the latter, since it

targets at green ML systems by design, it is called green learning (GL). The early development of GL was initiated by

an effort to understand the operation of computational neurons of CNNs in [Kuo, 2017, 2016, Kuo and Chen, 2018,

Kuo et al., 2019]. Through a sequence of investigations, building blocks of GL have been gradually developed, and

more applications have been demonstrated in recent years. As to the second problem, a clear and logical description of

the decision-making process is emphasized in the development of GL. GL adopts a modularized design. Each module

is statistically rooted with local optimization. GL avoids end-to-end global optimization for logical transparency and

computational efﬁciency. On the other hand, GL exploits ensembles heavily in its learning system to boost the overall

decision performance. GL yields probabilistic ML models that allow trust and risk assessment with certain performance

guarantees.

GL attempts to address the following problems to make the learning process efﬁcient and effective:

1. How to remove redundancy among source image pixels for concise representations?

2. How to generate more expressive representations?

3. How to select discriminant/relevant features based on labels?

4. How to achieve feature and decision combinations in the design of powerful classiﬁers/regressors?

5. How to design an architecture that enables rich ensembles for performance boosting?

New and powerful tools have been developed to address each of them in the last several years, e.g., the Saak [Kuo and

Chen, 2018] and Saab transforms [Kuo et al., 2019] for Problem 1, the PixelHop [Chen and Kuo, 2020], PixelHop++

[Chen et al., 2020a] and IPHop [Yang et al., 2022a] learning systems for Problem 2, the discriminant and relevant

feature tests [Yang et al., 2022b] for Problem 3, the subspace learning machine [Fu et al., 2022a] for Problem 4. The

original ideas scattered around in different papers. They will be systematically introduced here.

In this overview paper, we intend to elaborate on GL’s development, building modules and demonstrated applications.

We will also provide an outlook for future R&D opportunities. The rest of this paper is organized as follows. The

genesis of GL is reviewed in Sec. 2. A high-level sketch of GL is presented in Sec. 3. GL’s methodology and its

building tools are detailed in Sec. 4. Illustrative application examples of GL are shown in Sec. 5. Future technological

outlook is discussed in Sec. 6. Finally, concluding remarks are given in Sec. 7.

2 Genesis of Green Learning

The proven success of DL in a wide range of applications gives a clear indication of its power although it appears to be

a mystery. Research on GL was initiated by providing a high-level understanding of the superior performance of DL

[Kuo, 2016, 2017, Xu et al., 2017]. There was no attempt in giving a rigorous treatment but obtaining insights into a set

of basic questions such as:

• What is the role of nonlinear activation [Kuo, 2016]?

•

What are individual roles played by the convolutional layers and the fully-connected (FC) layers [Kuo et al.,

2019]?

• Is there a guideline in the network architecture design [Kuo et al., 2019]?

•

Is it possible to avoid the expensive backpropagation optimization process in ﬁlter weight determination [Kuo

and Chen, 2018, Kuo et al., 2019, Lin et al., 2022]?

As the understanding increases, it becomes apparent that one can develop a new learning pipeline without nonlinear

activation and backpropagation.

Green Learning: Introduction, Examples and Outlook A PREPRINT

Figure 1: Illustration of a computational neuron.

Figure 2: Illustration of the LeNet-5 architecture [LeCun et al., 1998], where the dimensions of intermediate layers are

shown under the network.

2.1 Anatomy of Computational Neurons

As shown in Fig. 1, a computational neuron consists of two stages in cascade: 1) an afﬁne transformation that maps an

-dimensional input vector to a scalar, and 2) a nonlinear activation operator. The afﬁne transformation is relatively easy

to understand. Yet, nonlinear activation obscures the function of a computational neuron. If the function of nonlinear

activation is well understood, one may remove it and compensate its function with another mechanism. As explained

in Sec. 2.2, one neural network achieves two objectives simultaneously [Kuo et al., 2019]: 1) ﬁnding an expressive

embedding of the input data, and 2) making decision (i.e., classiﬁcation or regression) based on data embedding.

An afﬁne transformation contains

weights and bias

for an input vector of dimension

. According to the analysis in

[Fu et al., 2022a, Lin et al., 2022], weights and biases play different roles. To determine proper weights of an afﬁne

transformation is equivalent to ﬁnding a discriminant 1D projection for partitioning. For example, one can use the linear

discriminant analysis (LDA) to ﬁnd a hyper-plane that partitions the whole space into two halves. The weight vector is

the surface normal of the hyper-plane. In other words, the afﬁne transformation deﬁnes a projection onto a line through

the inner product of an input vector and the weight vector. Then bias term deﬁnes the split point in the projected 1D

space (i.e., greater, equal and less than zero). The role of nonlinear activation was investigated for CNNs in [Kuo, 2016]

and multi-layer perceptrons (MLPs) in [Lin et al., 2022]. It is used to avoid the sign confusion problem caused by two

neurons in cascade.

2.2 Anatomy of Neural Networks

MLPs have been commonly used as a classiﬁer. The convolutional neural networks (CNNs) can be viewed as the

cascade of two sub-networks: the convolutional layers and the FC layers. One simple example called the LeNet-5

Green Learning: Introduction, Examples and Outlook A PREPRINT

[LeCun et al., 1998] is illustrated in Fig. 2. It is convenient to have a rough breakdown of their functions as done in

[Kuo et al., 2019]. That is, the ﬁrst sub-network is used to ﬁnd powerful embeddings while the second sub-network is

used for decision making. Admittedly, this anatomy is too simplistic for more complicated networks such as ResNet

[He et al., 2015], DenseNet [Huang et al., 2017] and Transformers [Vaswani et al., 2017]. Yet, this viewpoint is helpful

in deriving the GL framework. We will elaborate on the ﬁrst and the second sub-networks in Secs. 2.2.1 and 2.2.2

2.2.1 Feature Subnet

Modern DL networks have inputs, outputs, and intermediate embeddings. Their dimensions for image/video data are

generally large. Consider two examples below.

Example 1. An image in the MNIST dataset has gray-scale pixels of spatial resolution

32 ×32

. Its raw dimension

is 1,024. LeNet-5 [LeCun et al., 1998] was designed for the MNIST dataset. It has two cascaded convolutional-

pooling layers. At the output of the 2nd convolutional-pooling layers, one obtains an embedding space of dimension

5×5×16 = 400.

Example 2. After image size normalization, an image in the ImageNet dataset has color pixels of spatial resolution

224 ×224

. It has a raw dimension of

224 ×224 ×3 = 150,528

. AlexNet [Krizhevsky et al., 2012] and VGG-16

[Simonyan and Zisserman, 2014] were proposed to solve the object classiﬁcation problem in the ImageNet dataset.

The last convolutional layers of AlexNet and VGG have an embedding space of

13 ×13 ×256 = 43,264

and

7×7×512 = 25,088 dimensions, respectively.

The embedding dimension of the last convolutional layer is signiﬁcantly smaller than the input dimension from the

above two examples. Dimension reduction is essential to the simpliﬁcation of a decision pipeline. It is well known that

there are spatial correlations between neighboring pixels of images, which can be exploited for dimension reduction.

However, dimension reduction is only one of two main functions of the feature subnet. The feature subnet needs to ﬁnd

discriminant dimensions at the same time.

The dimension variation at different stages of LeNet-5 is shown in Fig. 2, where a red upward arrow denotes dimension

expansion and a blue downward arrow denotes dimension reduction. Dimensions increase with convolution operations

in the ﬁrst two layers, and dimension decreases with the (2x2)-to-(1x1) pooling operations. The increase in the

dimension of an intermediate layer is determined by the network architecture. By adding more neurons and reducing

the stride number, the intermediate layer will have more dimensions. More embedding variables allow the generation of

more expressive embeddings. The design of network architecture highly depends on speciﬁc applications (or datasets).

The search of an optimal network architecture [Elsken et al., 2019] and the associated loss function for new problems is

costly.

2.2.2 Decision Subnet

MLPs have the universal approximation capability to an arbitrary function [Cybenko, 1989, Hornik et al., 1989].

Without loss of generality, consider the mapping from an

-dimensional input to an arbitrary 1D function. The latter

can be approximated by a union of piece-wise low-order polynomials. The basic unit of a piece-wise constant (or linear)

approximation is a box (or a triangular-shaped) function of ﬁnite support. The form of activation determines the shape,

e.g., two step activations can be used to synthesize a box function. Users are referred to [Lin et al., 2021] for more

detail. For each partitioned interval, the mean value of observed output samples (or the labels of the majority class) can

be assigned in a regression problem (or a classiﬁcation problem). Decision making with space partitioning is essential

to all ML classiﬁers/regressors, e.g., the support vector machine/regression (SVM/R), decision trees, random forests,

gradient boosting classiﬁers/regressors. Yet, no nonlinear activation is required by them but neural networks. This

difference will be explained in Sec. 4.4.

2.2.3 Filter Weight Determination

One characteristics of DL systems is that ﬁlter weights can be automatically adjusted by propagating decision errors

backwards (i.e., backpropagation). As long as there exist negative gradients to lower the loss function, the training loss

will keep going down. Many advanced network architectures are devised to allow more paths to avoid the vanishing

gradient problem. Weight adjustment has different implications in the feature subnet and the decision subnet. For the

feature subnet, weights are adjusted to ﬁnd the most expressive embeddings under the architecture constraint. For the

decision subnet, weights are ﬁne-tuned to ﬁnd discriminant 1D projections for space partitioning. Sometimes, there

may not be sufﬁcient labeled data to train DL networks. To boost the overall performance, one may build a modularized

DL system, where ﬁlter weights of some modules are pre-trained by other datasets. One example is ResNet/DenseNet

pretrained by ImageNet. These pre-trained modules offer embeddings to interact with other modules of the DL system.

Green Learning: Introduction, Examples and Outlook A PREPRINT

2.3 Interpretable Feedforward-Designed CNN

The feedforward-designed convolutional neural network (FF-CNN) [Kuo et al., 2019] plays a transitional role from DL

to GL. It follows the standard CNN architecture but determines the ﬁlter weights in a feedforward one-pass manner.

Filter weights in the convolutional layers of FF-CNN are determined by the Saab transform without supervision.

Although a bias term is adopted by the Saab transform to address the sign confusion problem in [Kuo et al., 2019],

this term is removed in the later Saab transform implementation for the following reason. The sign confusion problem

actually only exists in neural network training because of the use of backpropagation optimization. In this context, one

determines both the embeddings from the previous layer and the ﬁlter weights of the current layer simultaneously based

on the desired output. In contrast, the input to the current layer is already given in the feedforward design, one need to

determine ﬁlter weights ﬁrst based on the statistics of the current input and then compute the ﬁlter responses. As a

result, there is no sign confusion problem in the FF-CNN.

Filter weights in the fully-connected (FC) layers of FF-CNN are determined by linear least-squared regression (LAG).

That is, in FF-CNN training, one clusters training samples of the same class into multiple sub-clusters and create a

pseudo-label for each sub-cluster. For example, there are 10 labels (i.e. 0, 1,

· · ·

, 9) for each image. One can have

12 sub-clusters for each digit and create 120 pseudo-labels. The ﬁrst FC layer maps from 400 latent variables in the

last convolution layer to 120 pseudo-labels. The determination of ﬁlter weights can be formulated as a least-squared

regression problem. FF-CNN has been used in the design of a privacy preservation framework in [Hu et al., 2020, Wang

et al., 2022a] and integrated with ensemble learning [Chen et al., 2019a] and semi-supervised learning [Chen et al.,

2019b].

3 High-Level Sketch of GL

3.1 Overview

As mentioned in Sec. 1, GL tools have been devised to achieve the following objectives:

1. Remove redundancy among source samples for concise representations.

2. Generate expressive representations.

3. Select discriminant/relevant features based on supervision (i.e., training labels).

4. Allow feature and decision combinations in classiﬁer/regressor design.

5. Propose a system architecture that enables ensembles for performance boosting.

These techniques are summarized in Table 1.

Table 1: A set of developed GL techniques.

Learning Need of Linear Examples

Techniques Supervision Operation

Subspace No Yes Saak Transform [Kuo and Chen, 2018],

Approximation Saab Transform [Kuo et al., 2019]

Expressive Maybe Maybe Attention, Multi-

Representation Generation Stage Transform [Chen et al., 2020a]

Ensemble-enabled Maybe Maybe PixelHop [Chen and Kuo, 2020]

Architecture PixelHop++ [Chen et al., 2020a]

Discriminant Yes No Discriminant

Features Selection Feature Test [Yang et al., 2022b]

Feature Space Yes No Subspace Learning

Partitioning Machine [Fu et al., 2022a]

It is worthwhile to comment on differences and relationship between GL, DL and classical ML. Traditional ML consists

of two building blocks: feature design and classiﬁcation. Feature design is typically based on human intuition and

domain knowledge. Feature extraction and decision are integrated without a clear boundary in DL. Once the parameters

of DL networks are determined by end-to-end optimization, feature design becomes a byproduct. Techniques No. 1-3

in Table 1 correspond to the feature design in GL. Techniques 1 and 2 can be automated with little human involvement.

Only hyper-parameters are provided by humans, which is similar to the network architecture design in DL. Technique 3

provides a feedback from labels to the learned representations so as to zoom into the most powerful subset. Unlike

traditional ML, it does not demand human intuition or domain knowledge. Finally, the last module in GL is the same as

the classiﬁer in traditional ML.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

GREENLEARNING:INTRODUCTION,EXAMPLESANDOUTLOOKAPREPRINTC.-C.JayKuoUniversityofSouthernCaliforniaLosAngeles,California,USAAzadM.MadniUniversityofSouthernCaliforniaLosAngeles,California,USAOctober4,2022ABSTRACTRapidadvancesinarticialintelligence(AI)inthelastdecadehavelargelybeenbuiltuponthewideapplica...

展开>> 收起<<

GREEN LEARNING INTRODUCTION EXAMPLES AND OUTLOOK A P REPRINT C.-C. Jay Kuo.pdf

共37页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

GREEN LEARNING INTRODUCTION EXAMPLES AND OUTLOOK A P REPRINT C.-C. Jay Kuo

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: