逼真的超解析度GAN: Photo-Realistic Single Image Super-Resolution Using a GAN

Super-resolved image (left) is almost indistinguishable from original (right). [4× upscaling]


SRGAN為超解析度(Super-Resolution, SR)的經典方法(State-of-the-art, SOTA)為4倍放大的影像提升解析度,尤其是對於人所看的逼真的(Photo-Realistic)效果很好。SRGAN採用了Deep Residual Network (ResNet)搭配Skip-connection和Mean Squad Error loss function (MSE)作為模型基底,不同的是SRGAN使用了感知的損失函數(Perceptual loss function),包含了Adversarial loss function用於discriminator訓練評價和Content loss使用訓練中使用VGG提取高階的物件特徵的相似度進行提升而不只在pixel級的相似度。實驗結果中平均觀點分數(Mean-opinion-score, MOS)有顯著提升,以及MOS分數中比較各SOTA更好且接近原圖。


  1. SRResNet: 16 blocks deep ResNet + MSE 針對4倍放大影像的SSIM和PSNR的效果最好。
  2. SRGAN: 使用GAN + Perceptual loss. 更換MSE而使用VGG計算輸入圖和原圖在物件特徵上的落差。
  3. 我們希望知道GAN所產生的結果是否對於人類的視覺來說更逼真,因此實驗中使用MOS統計人所打的分數,MOS評價中SRGAN針對4倍放大影像表現最好也最接近原圖。
From left to right: bicubic interpolation, deep residual network optimized for MSE, deep residual generative
adversarial network optimized for a loss more sensitive to human perception, original HR image. Corresponding PSNR and
SSIM are shown in brackets. [4× upscaling]
Illustration of patches from the natural image
manifold (red) and super-resolved patches obtained with
MSE (blue) and GAN (orange). The MSE-based solution
appears overly smooth due to the pixel-wise average of
possible solutions in the pixel space, while GAN drives the
reconstruction towards the natural image manifold producing perceptually more convincing solutions.


1. 定義目標

此篇論文的目標為訓練一個CNN Generator G, G的權重為(θ),這個權重能使一對低精度(LR)和高精度(HR)的影像在特別設計的Perceptual loss落差最小。

2. SRGAN架構

  • SRGAN的最佳化問題(Min-max Problem):

- 最大化Discriminator D判別原圖和生成的高精度影像的能力。
- 最小化將低精度影像LR輸入G後產生的生成的高精度影像的落差。

The architecture of Generator and Discriminator Network with corresponding kernel size (k), number of feature maps
(n) and stride (s) indicated for each convolutional layer
  • SRGAN架構如上圖:
    - Generator:
    將最初的convolution結果skip和一連串的Residual Block後的Feature map相加。然後將前面的Feature map做2次2倍的Convolutional擴增搭配ParametricReLU activation function最後得到一張4被放大生成的高精度影像。
    - Discriminator:
    一連串Convolution+Batch Normalization+ Leaky Relu (α = 0.2) 的Down-sampling過程取得的Feautre Map做Fully Connection取得判斷值並在最後用Siogmoid正規化為0~1之間的判斷真假分數。

3. 感知的損失函數 (Perceptual Loss Function)

SRGAN的Perceptual Loss = Content Loss + 10³*Adversarial Loss

  • Content Loss: MSE Loss + VGG Loss
    - MSE Loss: 兩張影像對應之Pixel相減後平方再開根號。
    - VGG Loss: 兩張影像輸入VGG取得的高階特徵圖相減後平方。
MSE Loss
VGG Loss
  • Adversarial Loss: Discriminator D產生的結果值為Generator生成的影像能混淆D判斷真假的機率。



  • Training Set: A random sample of 350,000 ImageNet database
  • Testing Set: BSD300, Set5, Set14, BSD100


  • NVIDIA Tesla M40 GPU


  • PSNR: 影像的雜訊程度
  • SSIM: 影像在亮度 (Luminance)、對比度 (Contrast)和結構 (Structure)的相似度
  • MOS: 由人看圖打1~5分統計出的分數結果。


  • 消融測試(Ablation Test): 測試架構中Discriminator和Loss function組合的貢獻度。
Table 1: Performance of different loss functions for SR￾
ResNet and the adversarial networks on Set5 and Set14
benchmark data.

SRGAN則是在MOS比較好,因為VGG Loss針對高階物件特徵來產生高精度結果因此對於MOS以人的感覺來評分來說比較有利。其中VGG以VGG54取VGG中relu5_4的Feature Map來用效果又比VGG22更好。因此SRGAN取用VGG54。

  • 與SOTA比較:
    - 此篇論文比較了nearest, bicubic, SRCNN, SelfExSR, DRCN, ESPCN等超解析度的相關方法。
    - SRResNet在PSNR和SSIM的定量分析(Quantitative)中表現最好。
    - MOS的圖真實性分數則是SRGAN表現最好。
Color-coded distribution of MOS scores on
BSD100. For each method 2600 samples (100 images ×
26 raters) were assessed. Mean shown as red marker, where
the bins are centered around value i.
SRResNet (left: a,b), SRGAN-MSE (middle left: c,d), SRGAN-VGG2.2 (middle: e,f) and SRGAN-VGG54
(middle right: g,h) reconstruction results and corresponding reference HR image (right: i,j). [4× upscaling]
Table 2: Comparison of NN, bicubic, SRCNN [9], SelfExSR [31], DRCN [34], ESPCN [48], SRResNet, SRGAN-VGG54
and the original HR on benchmark data. Highest measures (PSNR [dB], SSIM, MOS) in bold. [4× upscaling]


  1. 根據這篇論文的MOS統計結果,Perceptual Loss中VGG Loss提取影像中的高階物件特徵Feature Map來組成Loss,這方法對於Deep Learning中無論是Net或GAN的模型產出結果對人類視覺所看得效果有很大提升,是很重要的參考!
  2. 可參考VGG中取哪一層特徵圖對於成效的提升是更好的。基本上層數越高是取越高階的物件特徵。
  3. 消融測試可看出加上Discriminator和Perceptual Loss對SSIM、PSNR影響未必有所提升,但對於人的視覺影響上卻是正面的。這裡可以看到SSIM、PSNR對於影像的品質也未必是值越高一定代表品質越好。


作者:Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, Wenzhe Shi

發表:the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2017, pp. 4681–4690

論文: https://openaccess.thecvf.com/content_cvpr_2017/html/Ledig_Photo-Realistic_Single_Image_CVPR_2017_paper.html



