逼真的超解析度GAN: Photo-Realistic Single Image Super-Resolution Using a GAN

10 min readFeb 13, 2022

Super-resolved image (left) is almost indistinguishable from original (right). [4× upscaling]

如何使放大的影像恢復更精細的紋理細節？

SRGAN為超解析度(Super-Resolution, SR)的經典方法(State-of-the-art, SOTA)為4倍放大的影像提升解析度，尤其是對於人所看的逼真的(Photo-Realistic)效果很好。SRGAN採用了Deep Residual Network (ResNet)搭配Skip-connection和Mean Squad Error loss function (MSE)作為模型基底，不同的是SRGAN使用了感知的損失函數(Perceptual loss function)，包含了Adversarial loss function用於discriminator訓練評價和Content loss使用訓練中使用VGG提取高階的物件特徵的相似度進行提升而不只在pixel級的相似度。實驗結果中平均觀點分數(Mean-opinion-score, MOS)有顯著提升，以及MOS分數中比較各SOTA更好且接近原圖。

SRGAN有3點貢獻:

SRResNet: 16 blocks deep ResNet + MSE 針對4倍放大影像的SSIM和PSNR的效果最好。
SRGAN: 使用GAN + Perceptual loss. 更換MSE而使用VGG計算輸入圖和原圖在物件特徵上的落差。
我們希望知道GAN所產生的結果是否對於人類的視覺來說更逼真，因此實驗中使用MOS統計人所打的分數，MOS評價中SRGAN針對4倍放大影像表現最好也最接近原圖。

From left to right: bicubic interpolation, deep residual network optimized for MSE, deep residual generative
adversarial network optimized for a loss more sensitive to human perception, original HR image. Corresponding PSNR and
SSIM are shown in brackets. [4× upscaling]

Illustration of patches from the natural image
manifold (red) and super-resolved patches obtained with
MSE (blue) and GAN (orange). The MSE-based solution
appears overly smooth due to the pixel-wise average of
possible solutions in the pixel space, while GAN drives the
reconstruction towards the natural image manifold producing perceptually more convincing solutions.

SRGAN方法

1. 定義目標

此篇論文的目標為訓練一個CNN Generator G, G的權重為(θ)，這個權重能使一對低精度(LR)和高精度(HR)的影像在特別設計的Perceptual loss落差最小。

2. SRGAN架構

SRGAN的最佳化問題(Min-max Problem):

- 最大化Discriminator D判別原圖和生成的高精度影像的能力。
- 最小化將低精度影像LR輸入G後產生的生成的高精度影像的落差。

The architecture of Generator and Discriminator Network with corresponding kernel size (k), number of feature maps
(n) and stride (s) indicated for each convolutional layer

SRGAN架構如上圖:
- Generator:
將最初的convolution結果skip和一連串的Residual Block後的Feature map相加。然後將前面的Feature map做2次2倍的Convolutional擴增搭配ParametricReLU activation function最後得到一張4被放大生成的高精度影像。
- Discriminator:
一連串Convolution+Batch Normalization+ Leaky Relu (α = 0.2) 的Down-sampling過程取得的Feautre Map做Fully Connection取得判斷值並在最後用Siogmoid正規化為0~1之間的判斷真假分數。

3. 感知的損失函數 (Perceptual Loss Function)

SRGAN的Perceptual Loss = Content Loss + 10³*Adversarial Loss

Content Loss: MSE Loss + VGG Loss
- MSE Loss: 兩張影像對應之Pixel相減後平方再開根號。
- VGG Loss: 兩張影像輸入VGG取得的高階特徵圖相減後平方。

Adversarial Loss: Discriminator D產生的結果值為Generator生成的影像能混淆D判斷真假的機率。

實驗

Datasets

Training Set: A random sample of 350,000 ImageNet database
Testing Set: BSD300, Set5, Set14, BSD100

設備

NVIDIA Tesla M40 GPU

評價指標(Metrics)

PSNR: 影像的雜訊程度
SSIM: 影像在亮度 (Luminance)、對比度 (Contrast)和結構 (Structure)的相似度
MOS: 由人看圖打1~5分統計出的分數結果。

實驗結果

消融測試(Ablation Test): 測試架構中Discriminator和Loss function組合的貢獻度。

Table 1: Performance of different loss functions for SR
ResNet and the adversarial networks on Set5 and Set14
benchmark data.

在Set5和Set7中SRResNet+MSE是PSNR和SSIM效果比較好的。
SRGAN則是在MOS比較好，因為VGG Loss針對高階物件特徵來產生高精度結果因此對於MOS以人的感覺來評分來說比較有利。其中VGG以VGG54取VGG中relu5_4的Feature Map來用效果又比VGG22更好。因此SRGAN取用VGG54。

與SOTA比較:
- 此篇論文比較了nearest, bicubic, SRCNN, SelfExSR, DRCN, ESPCN等超解析度的相關方法。
- SRResNet在PSNR和SSIM的定量分析(Quantitative)中表現最好。
- MOS的圖真實性分數則是SRGAN表現最好。

Color-coded distribution of MOS scores on
BSD100. For each method 2600 samples (100 images ×
26 raters) were assessed. Mean shown as red marker, where
the bins are centered around value i.

SRResNet (left: a,b), SRGAN-MSE (middle left: c,d), SRGAN-VGG2.2 (middle: e,f) and SRGAN-VGG54
(middle right: g,h) reconstruction results and corresponding reference HR image (right: i,j). [4× upscaling]

Table 2: Comparison of NN, bicubic, SRCNN [9], SelfExSR [31], DRCN [34], ESPCN [48], SRResNet, SRGAN-VGG54
and the original HR on benchmark data. Highest measures (PSNR [dB], SSIM, MOS) in bold. [4× upscaling]

看看

根據這篇論文的MOS統計結果，Perceptual Loss中VGG Loss提取影像中的高階物件特徵Feature Map來組成Loss，這方法對於Deep Learning中無論是Net或GAN的模型產出結果對人類視覺所看得效果有很大提升，是很重要的參考！
可參考VGG中取哪一層特徵圖對於成效的提升是更好的。基本上層數越高是取越高階的物件特徵。
消融測試可看出加上Discriminator和Perceptual Loss對SSIM、PSNR影響未必有所提升，但對於人的視覺影響上卻是正面的。這裡可以看到SSIM、PSNR對於影像的品質也未必是值越高一定代表品質越好。

作者與連結

作者：Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, Wenzhe Shi

發表：the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2017, pp. 4681–4690

論文: https://openaccess.thecvf.com/content_cvpr_2017/html/Ledig_Photo-Realistic_Single_Image_CVPR_2017_paper.html