GREEN LEARNING INTRODUCTION EXAMPLES AND OUTLOOK A P REPRINT C.-C. Jay Kuo

2025-05-06 0 0 2.44MB 37 页 10玖币
侵权投诉
GREEN LEARNING: INTRODUCTION, EXAMPLES AND OUTLOOK
A PREPRINT
C.-C. Jay Kuo
University of Southern California
Los Angeles, California, USA
Azad M. Madni
University of Southern California
Los Angeles, California, USA
October 4, 2022
ABSTRACT
Rapid advances in artificial intelligence (AI) in the last decade have largely been built upon the
wide applications of deep learning (DL). However, the high carbon footprint yielded by larger and
larger DL networks becomes a concern for sustainability. Furthermore, DL decision mechanism is
somewhat obscure and can only be verified by test data. Green learning (GL) has been proposed
as an alternative paradigm to address these concerns. GL is characterized by low carbon footprints,
small model sizes, low computational complexity, and logical transparency. It offers energy-effective
solutions in cloud centers as well as mobile/edge devices. GL also provides a clear and logical
decision-making process to gain people’s trust. Several statistical tools have been developed to
achieve this goal in recent years. They include unsupervised representation learning, supervised
feature learning, and supervised decision learning. We have seen a few successful GL examples with
performance comparable with state-of-the-art DL solutions. This paper offers an introduction to GL,
its demonstrated applications, and future outlook.
Keywords machine learning, green learning, trust learning, deep learning.
1 Introduction
There have been rapid advances in artificial intelligence (AI) and machine learning (ML) in the last decade. The
breakthrough has largely been built upon the construction of larger datasets and the design of complex neural networks.
Representative neural networks include the convolutional neural network (CNN) [Gu et al., 2018], the recurrent neural
network (RNN), [Pascanu et al., 2013, Salehinejad et al., 2017, Su et al., 2022] the long short-term memory network
(LSTM) [Greff et al., 2016, Hochreiter and Schmidhuber, 1997, Su and Kuo, 2019], etc. Deep neural networks (DNNs)
have attracted a lot of attention from academia and industry since its resurgence in 2012 [Krizhevsky et al., 2012].
As networks become larger and deeper, this discipline is named deep learning (DL) [LeCun et al., 2015]. DL has
made great impacts in various application domains, include computer vision, natural language processing, autonomous
driving, robotics navigation, etc.
DL is characterized by two design choices: the network architecture and the loss function. Once these two choices
are specified, model parameters can be automatically determined via an end-to-end optimization algorithm called
backpropagation. When the number of training samples is less than the number of model parameters, it is common to
adopt pre-trained networks to build larger networks for better performance; e.g., ResNet [He et al., 2015] or DenseNet
[Huang et al., 2017] pretrained by ImageNet [Deng et al., 2009]. Another emerging trend is the adoption of the
transformer architecture [Han et al., 2022, Jaderberg et al., 2015, Khan et al., 2021, Vaswani et al., 2017, Wolf et al.,
2020].
Despite its rapid advance, the DL paradigm faces several challenges. DL networks are mathematically intractable,
vulnerable to adversarial attacks [Akhtar and Mian, 2018], and demand heavy supervision. Efforts have been made
in explaining the behavior of DL networks with certain success, e.g., [Chan et al., 2022, Damian et al., 2022, Ma
et al., 2022, Oymak et al., 2021, Soltanolkotabi et al., 2018, Wright and Ma, 2022]. Adversarial training has been
developed to provide a tradeoff between robustness and accuracy [Zhang et al., 2019a]. Self-supervised [Misra and
arXiv:2210.00965v1 [cs.LG] 3 Oct 2022
Green Learning: Introduction, Examples and Outlook A PREPRINT
Maaten, 2020] and semi-supervised learning [Sohn et al., 2020, Van Engelen and Hoos, 2020] have been explored to
reduce the supervision burden.
Two further concerns over DL technologies are less addressed. The first one is about its high carbon footprint
[Lannelongue et al., 2021, Schwartz et al., 2020, Wu et al., 2022, Xu et al., 2021]. The training of DL networks is
computationally intensive. The training of larger complex networks on huge datasets imposes a threat on sustainability
[Sanh et al., 2019, Sharir et al., 2020, Strubell et al., 2019]. The second one is related to its trusworthiness. The
application of blackbox DL models to high stakes decisions is questioned [Arrieta et al., 2020, Poursabzi-Sangdeh
et al., 2021, Rudin, 2019]. Conclusions drawn from a set of input-output relationships could be misleading and counter
intuition. It is essential to justify an ML prediction procedure with logical reasoning to gain people’s trust.
To tackle the first problem, one may optimize DL systems by taking performance and complexity into account jointly.
An alternative solution is to build a new learning paradigm of low carbon footprint from scratch. For the latter, since it
targets at green ML systems by design, it is called green learning (GL). The early development of GL was initiated by
an effort to understand the operation of computational neurons of CNNs in [Kuo, 2017, 2016, Kuo and Chen, 2018,
Kuo et al., 2019]. Through a sequence of investigations, building blocks of GL have been gradually developed, and
more applications have been demonstrated in recent years. As to the second problem, a clear and logical description of
the decision-making process is emphasized in the development of GL. GL adopts a modularized design. Each module
is statistically rooted with local optimization. GL avoids end-to-end global optimization for logical transparency and
computational efficiency. On the other hand, GL exploits ensembles heavily in its learning system to boost the overall
decision performance. GL yields probabilistic ML models that allow trust and risk assessment with certain performance
guarantees.
GL attempts to address the following problems to make the learning process efficient and effective:
1. How to remove redundancy among source image pixels for concise representations?
2. How to generate more expressive representations?
3. How to select discriminant/relevant features based on labels?
4. How to achieve feature and decision combinations in the design of powerful classifiers/regressors?
5. How to design an architecture that enables rich ensembles for performance boosting?
New and powerful tools have been developed to address each of them in the last several years, e.g., the Saak [Kuo and
Chen, 2018] and Saab transforms [Kuo et al., 2019] for Problem 1, the PixelHop [Chen and Kuo, 2020], PixelHop++
[Chen et al., 2020a] and IPHop [Yang et al., 2022a] learning systems for Problem 2, the discriminant and relevant
feature tests [Yang et al., 2022b] for Problem 3, the subspace learning machine [Fu et al., 2022a] for Problem 4. The
original ideas scattered around in different papers. They will be systematically introduced here.
In this overview paper, we intend to elaborate on GLs development, building modules and demonstrated applications.
We will also provide an outlook for future R&D opportunities. The rest of this paper is organized as follows. The
genesis of GL is reviewed in Sec. 2. A high-level sketch of GL is presented in Sec. 3. GLs methodology and its
building tools are detailed in Sec. 4. Illustrative application examples of GL are shown in Sec. 5. Future technological
outlook is discussed in Sec. 6. Finally, concluding remarks are given in Sec. 7.
2 Genesis of Green Learning
The proven success of DL in a wide range of applications gives a clear indication of its power although it appears to be
a mystery. Research on GL was initiated by providing a high-level understanding of the superior performance of DL
[Kuo, 2016, 2017, Xu et al., 2017]. There was no attempt in giving a rigorous treatment but obtaining insights into a set
of basic questions such as:
What is the role of nonlinear activation [Kuo, 2016]?
What are individual roles played by the convolutional layers and the fully-connected (FC) layers [Kuo et al.,
2019]?
Is there a guideline in the network architecture design [Kuo et al., 2019]?
Is it possible to avoid the expensive backpropagation optimization process in filter weight determination [Kuo
and Chen, 2018, Kuo et al., 2019, Lin et al., 2022]?
As the understanding increases, it becomes apparent that one can develop a new learning pipeline without nonlinear
activation and backpropagation.
2
Green Learning: Introduction, Examples and Outlook A PREPRINT
Figure 1: Illustration of a computational neuron.
Figure 2: Illustration of the LeNet-5 architecture [LeCun et al., 1998], where the dimensions of intermediate layers are
shown under the network.
2.1 Anatomy of Computational Neurons
As shown in Fig. 1, a computational neuron consists of two stages in cascade: 1) an affine transformation that maps an
n
-dimensional input vector to a scalar, and 2) a nonlinear activation operator. The affine transformation is relatively easy
to understand. Yet, nonlinear activation obscures the function of a computational neuron. If the function of nonlinear
activation is well understood, one may remove it and compensate its function with another mechanism. As explained
in Sec. 2.2, one neural network achieves two objectives simultaneously [Kuo et al., 2019]: 1) finding an expressive
embedding of the input data, and 2) making decision (i.e., classification or regression) based on data embedding.
An affine transformation contains
n
weights and bias
b
for an input vector of dimension
n
. According to the analysis in
[Fu et al., 2022a, Lin et al., 2022], weights and biases play different roles. To determine proper weights of an affine
transformation is equivalent to finding a discriminant 1D projection for partitioning. For example, one can use the linear
discriminant analysis (LDA) to find a hyper-plane that partitions the whole space into two halves. The weight vector is
the surface normal of the hyper-plane. In other words, the affine transformation defines a projection onto a line through
the inner product of an input vector and the weight vector. Then bias term defines the split point in the projected 1D
space (i.e., greater, equal and less than zero). The role of nonlinear activation was investigated for CNNs in [Kuo, 2016]
and multi-layer perceptrons (MLPs) in [Lin et al., 2022]. It is used to avoid the sign confusion problem caused by two
neurons in cascade.
2.2 Anatomy of Neural Networks
MLPs have been commonly used as a classifier. The convolutional neural networks (CNNs) can be viewed as the
cascade of two sub-networks: the convolutional layers and the FC layers. One simple example called the LeNet-5
3
Green Learning: Introduction, Examples and Outlook A PREPRINT
[LeCun et al., 1998] is illustrated in Fig. 2. It is convenient to have a rough breakdown of their functions as done in
[Kuo et al., 2019]. That is, the first sub-network is used to find powerful embeddings while the second sub-network is
used for decision making. Admittedly, this anatomy is too simplistic for more complicated networks such as ResNet
[He et al., 2015], DenseNet [Huang et al., 2017] and Transformers [Vaswani et al., 2017]. Yet, this viewpoint is helpful
in deriving the GL framework. We will elaborate on the first and the second sub-networks in Secs. 2.2.1 and 2.2.2
2.2.1 Feature Subnet
Modern DL networks have inputs, outputs, and intermediate embeddings. Their dimensions for image/video data are
generally large. Consider two examples below.
Example 1. An image in the MNIST dataset has gray-scale pixels of spatial resolution
32 ×32
. Its raw dimension
is 1,024. LeNet-5 [LeCun et al., 1998] was designed for the MNIST dataset. It has two cascaded convolutional-
pooling layers. At the output of the 2nd convolutional-pooling layers, one obtains an embedding space of dimension
5×5×16 = 400.
Example 2. After image size normalization, an image in the ImageNet dataset has color pixels of spatial resolution
224 ×224
. It has a raw dimension of
224 ×224 ×3 = 150,528
. AlexNet [Krizhevsky et al., 2012] and VGG-16
[Simonyan and Zisserman, 2014] were proposed to solve the object classification problem in the ImageNet dataset.
The last convolutional layers of AlexNet and VGG have an embedding space of
13 ×13 ×256 = 43,264
and
7×7×512 = 25,088 dimensions, respectively.
The embedding dimension of the last convolutional layer is significantly smaller than the input dimension from the
above two examples. Dimension reduction is essential to the simplification of a decision pipeline. It is well known that
there are spatial correlations between neighboring pixels of images, which can be exploited for dimension reduction.
However, dimension reduction is only one of two main functions of the feature subnet. The feature subnet needs to find
discriminant dimensions at the same time.
The dimension variation at different stages of LeNet-5 is shown in Fig. 2, where a red upward arrow denotes dimension
expansion and a blue downward arrow denotes dimension reduction. Dimensions increase with convolution operations
in the first two layers, and dimension decreases with the (2x2)-to-(1x1) pooling operations. The increase in the
dimension of an intermediate layer is determined by the network architecture. By adding more neurons and reducing
the stride number, the intermediate layer will have more dimensions. More embedding variables allow the generation of
more expressive embeddings. The design of network architecture highly depends on specific applications (or datasets).
The search of an optimal network architecture [Elsken et al., 2019] and the associated loss function for new problems is
costly.
2.2.2 Decision Subnet
MLPs have the universal approximation capability to an arbitrary function [Cybenko, 1989, Hornik et al., 1989].
Without loss of generality, consider the mapping from an
n
-dimensional input to an arbitrary 1D function. The latter
can be approximated by a union of piece-wise low-order polynomials. The basic unit of a piece-wise constant (or linear)
approximation is a box (or a triangular-shaped) function of finite support. The form of activation determines the shape,
e.g., two step activations can be used to synthesize a box function. Users are referred to [Lin et al., 2021] for more
detail. For each partitioned interval, the mean value of observed output samples (or the labels of the majority class) can
be assigned in a regression problem (or a classification problem). Decision making with space partitioning is essential
to all ML classifiers/regressors, e.g., the support vector machine/regression (SVM/R), decision trees, random forests,
gradient boosting classifiers/regressors. Yet, no nonlinear activation is required by them but neural networks. This
difference will be explained in Sec. 4.4.
2.2.3 Filter Weight Determination
One characteristics of DL systems is that filter weights can be automatically adjusted by propagating decision errors
backwards (i.e., backpropagation). As long as there exist negative gradients to lower the loss function, the training loss
will keep going down. Many advanced network architectures are devised to allow more paths to avoid the vanishing
gradient problem. Weight adjustment has different implications in the feature subnet and the decision subnet. For the
feature subnet, weights are adjusted to find the most expressive embeddings under the architecture constraint. For the
decision subnet, weights are fine-tuned to find discriminant 1D projections for space partitioning. Sometimes, there
may not be sufficient labeled data to train DL networks. To boost the overall performance, one may build a modularized
DL system, where filter weights of some modules are pre-trained by other datasets. One example is ResNet/DenseNet
pretrained by ImageNet. These pre-trained modules offer embeddings to interact with other modules of the DL system.
4
Green Learning: Introduction, Examples and Outlook A PREPRINT
2.3 Interpretable Feedforward-Designed CNN
The feedforward-designed convolutional neural network (FF-CNN) [Kuo et al., 2019] plays a transitional role from DL
to GL. It follows the standard CNN architecture but determines the filter weights in a feedforward one-pass manner.
Filter weights in the convolutional layers of FF-CNN are determined by the Saab transform without supervision.
Although a bias term is adopted by the Saab transform to address the sign confusion problem in [Kuo et al., 2019],
this term is removed in the later Saab transform implementation for the following reason. The sign confusion problem
actually only exists in neural network training because of the use of backpropagation optimization. In this context, one
determines both the embeddings from the previous layer and the filter weights of the current layer simultaneously based
on the desired output. In contrast, the input to the current layer is already given in the feedforward design, one need to
determine filter weights first based on the statistics of the current input and then compute the filter responses. As a
result, there is no sign confusion problem in the FF-CNN.
Filter weights in the fully-connected (FC) layers of FF-CNN are determined by linear least-squared regression (LAG).
That is, in FF-CNN training, one clusters training samples of the same class into multiple sub-clusters and create a
pseudo-label for each sub-cluster. For example, there are 10 labels (i.e. 0, 1,
· · ·
, 9) for each image. One can have
12 sub-clusters for each digit and create 120 pseudo-labels. The first FC layer maps from 400 latent variables in the
last convolution layer to 120 pseudo-labels. The determination of filter weights can be formulated as a least-squared
regression problem. FF-CNN has been used in the design of a privacy preservation framework in [Hu et al., 2020, Wang
et al., 2022a] and integrated with ensemble learning [Chen et al., 2019a] and semi-supervised learning [Chen et al.,
2019b].
3 High-Level Sketch of GL
3.1 Overview
As mentioned in Sec. 1, GL tools have been devised to achieve the following objectives:
1. Remove redundancy among source samples for concise representations.
2. Generate expressive representations.
3. Select discriminant/relevant features based on supervision (i.e., training labels).
4. Allow feature and decision combinations in classifier/regressor design.
5. Propose a system architecture that enables ensembles for performance boosting.
These techniques are summarized in Table 1.
Table 1: A set of developed GL techniques.
Learning Need of Linear Examples
Techniques Supervision Operation
Subspace No Yes Saak Transform [Kuo and Chen, 2018],
Approximation Saab Transform [Kuo et al., 2019]
Expressive Maybe Maybe Attention, Multi-
Representation Generation Stage Transform [Chen et al., 2020a]
Ensemble-enabled Maybe Maybe PixelHop [Chen and Kuo, 2020]
Architecture PixelHop++ [Chen et al., 2020a]
Discriminant Yes No Discriminant
Features Selection Feature Test [Yang et al., 2022b]
Feature Space Yes No Subspace Learning
Partitioning Machine [Fu et al., 2022a]
It is worthwhile to comment on differences and relationship between GL, DL and classical ML. Traditional ML consists
of two building blocks: feature design and classification. Feature design is typically based on human intuition and
domain knowledge. Feature extraction and decision are integrated without a clear boundary in DL. Once the parameters
of DL networks are determined by end-to-end optimization, feature design becomes a byproduct. Techniques No. 1-3
in Table 1 correspond to the feature design in GL. Techniques 1 and 2 can be automated with little human involvement.
Only hyper-parameters are provided by humans, which is similar to the network architecture design in DL. Technique 3
provides a feedback from labels to the learned representations so as to zoom into the most powerful subset. Unlike
traditional ML, it does not demand human intuition or domain knowledge. Finally, the last module in GL is the same as
the classifier in traditional ML.
5
摘要:

GREENLEARNING:INTRODUCTION,EXAMPLESANDOUTLOOKAPREPRINTC.-C.JayKuoUniversityofSouthernCaliforniaLosAngeles,California,USAAzadM.MadniUniversityofSouthernCaliforniaLosAngeles,California,USAOctober4,2022ABSTRACTRapidadvancesinarticialintelligence(AI)inthelastdecadehavelargelybeenbuiltuponthewideapplica...

展开>> 收起<<
GREEN LEARNING INTRODUCTION EXAMPLES AND OUTLOOK A P REPRINT C.-C. Jay Kuo.pdf

共37页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:37 页 大小:2.44MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 37
客服
关注