a gated peripheral–foveal convolutional neural net-work. It is a double-subnet neural network. The
former aims to encode the holistic in-formation and provide the attended regions. The latter aims to
extract fine-grained features on these key regions. Then, a gated information fusion network is
employed for the image aesthetic prediction. In [14], the authors propose a novel multimodal recurrent
attention CNN, which incorporates the visual information with the text information. This method
employs the recurrent attention network to focus on some key regions to extract visual features. In [29,
30], the contributions of different regions at object level to aesthetics are adaptively predicted. However,
it has been validated that feeding the weighted key regions to CNN to train the IAA model degrades the
performance of prediction according to our preliminary experiments, because the aesthetic assessment
is influenced by holistic information in the image. Weakening some regions results in the information
degradation for aesthetic assessment.
In [31], a hierarchical layout-aware graph convolutional network is involved to capture layout
information for unified IAA. However, although there is a strong correlation between image layouts
and perceived image quality, the image layout is neither the sufficient condition nor the necessary
condition to determine the image’s aesthetic quality. In fact, several typical failure cases presented in
[31] confirm the above statement. Figure 5 in the paper shows several failure cases. Some pictures
appear the good lay-outs that seem to meet the rule-of-thirds and are predicted to have a high rating.
However, the ground truths (GT) of these images are of low ratings. A picture seems not to meet the
photography composition principles and is assigned to a low rating. However, its GT is of high rating.
Generally, modeling IAA is supervised learning. Most of the research utilize the labeling data of the
images regarding aesthetics in the public photo dataset, such as CUHK-PQ [1] or AVA [28], to train the
model. However, these aesthetic data are almost labeled by the amateurs. Whether the labeling data
embody the latent principles of aesthetics is not clear. So, whether the IAA models trained on these
datasets are significant is also unclear. To make the labelled data embody the photo’s aesthetic
principles, the author in [25] aims to establish a photo dataset called XiheAA which are scored by an
experienced photographer, because it is assumed that the experienced photographers should have the
higher ability of reflecting the latent principles of aesthetics when they assess the photos. These labelled
images are used to train the IAA model. However, the IAA exhibit a highly-skewed score distribution.
in order to solve the imbalance issue in aesthetic assessment, in this paper, the author proposes a method
of repetitive self-revised learning (RSRL) to retrain the CNN-based aesthetic score prediction model
repetitively by transfer learning, so as to improve the performance of imbalance classification caused
by the overconcentration distribution of the scores. Moreover, in [32], the author focuses on the issue
of CNN-based RSRL to explore suitable metrics for Establishing an Optimal Model of IAA. Further,
the learned feature maps of the model are utilized to define the first fixation perspective (FFP) and the
assessment interest region (AIR), so as to analyze whether the aesthetics features are learned by the
optimal model. Although RSRL shows the effectiveness on the imbalance classification by several