Fourier Spectrum Discrepancies in Deep Network
Generated Images
Tarik Dzanic
Department of Ocean Engineering
Texas A&M University
College Station, TX 77843
tdzanic@tamu.edu
Karan Shah
Department of Computational Science and Engineering
Georgia Institute of Technology
Atlanta, GA 30332
shah@gatech.edu
Freddie D. Witherden
Department of Ocean Engineering
Texas A&M University
College Station, TX 77843
fdw@tamu.edu
Abstract
Advancements in deep generative models such as generative adversarial networks
and variational autoencoders have resulted in the ability to generate realistic images
that are visually indistinguishable from real images, which raises concerns about
their potential malicious usage. In this paper, we present an analysis of the high-
frequency Fourier modes of real and deep network generated images and show that
deep network generated images share an observable, systematic shortcoming in
replicating the attributes of these high-frequency modes. Using this, we propose a
detection method based on the frequency spectrum of the images which is able to
achieve an accuracy of up to 99.2% in classifying real and deep network generated
images from various GAN and VAE architectures on a dataset of 5000 images
with as few as 8 training examples. Furthermore, we show the impact of image
transformations such as compression, cropping, and resolution reduction on the
classification accuracy and suggest a method for modifying the high-frequency
attributes of deep network generated images to mimic real images.
1 Introduction
Figure 1: Left to right: real, StyleGAN [1], StyleGAN2 [2], PGGAN [3], VQ-VAE2 [4], and ALAE
[5] generated images.
In recent years, advances in deep generative models for image synthesis have caused widespread
concern regarding their potential malicious uses. Current state-of-the-art models can generate hyper-
realistic images that are visually indistinguishable from real images, as shown in Fig. 1. These
models can be used for unethical purposes, such as misinformation campaigns, fabricating evidence,
34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
arXiv:1911.06465v3 [eess.IV] 22 Oct 2020
or attacking biometric security systems, and as a result, recent research efforts have been focused on
developing methods for detecting such images [6, 7]. Various detection strategies have been employed,
from traditional image forensics methods such as analyzing encoding or acquisition fingerprints to,
more recently, machine learning based approaches such as statistical modeling for disparities in color
components, textures, or features [8, 9, 10, 11, 12]. Boulkenafet et al. developed a method for face
spoofing detection by analyzing the statistical distributions of the image spectra in other color spaces
such as HSV and YCbCr [9]. Li et al. expanded upon this method to detect images generated by
generative adversarial networks [10]. Marra et al. showed that generative adversarial networks leave
unique artificial fingerprints in the noise of their generated images that are dependent on the network
architecture which can be used to detect if an image was generated by a particular network [11]. In
concurrent work, Wang et al. trained a deep neural network for classifying deep network generated
images and showed that a classifier trained on one generative model can reasonably generalize to
other models as well [13]. Although these methods have shown promise in terms of accuracy with
minimal user input, they require large amounts of data to train, something which may not be feasible
in many applications. As such, there is a need for detection methodologies that can obtain similar
levels of accuracy with minimal training data that can generalize to unknown generative models.
In this work, we explore the high-frequency characteristics of real and deep network generated
images and propose a simple yet robust detection method for deep network generated images based
on these characteristics. We compare real images to images generated by various generative models
– generative adversarial networks [14] such as StyleGAN [1], StyleGAN2 [2], and PGGAN [3] as
well as variational and hybrid autoencoders [15] such as VQ-VAE2 [4] and ALAE [5]. Based on
a reduced-order model of the high-frequency spectrum of the images, we show that the relative
magnitude and the decay rate of the high-frequency spectrum is clearly distinguishable between real
and deep network generated images, indicating that the low-level attributes of images generated by
generative models display different properties. Furthermore, we show that at higher resolutions, these
differences are more easily observed, and we consider the impact of common image transformation
methods such as compression and cropping which can significantly affect the high-frequency spectra,
causing difficulties in detecting deep network generated images. We also show the results of binary
classification experiments based on these low-level attributes of images at different resolutions and
compression levels from datasets such as Flickr-Faces-HQ (FFHQ) and the aforementioned generative
models. Finally, we present a method for modifying the spectra of deep network generated images to
mimic the high-frequency characteristics of real images as a way of deceiving such classifiers.
2 Methodology
2.1 Fourier spectrum analysis
To analyze the characteristics of real and deep network generated images in the frequency domain, a
Fourier transform is required. For a discrete two-dimensional signal
f(p, q)
representing individual
color channels of an image of size m×n, the discrete Fourier transform F(kx, ky)is defined as
F(kx, ky) = 1
mn
m1
X
p=0
n1
X
q=0
f(p, q)ei2π(kxp/m +kyq/n),(1)
which is of the same dimension as the input signal. To construct a scale and rotation invariant
threshold for the highest frequencies, a transform in wavenumber space can be performed from
Cartesian coordinates kx, kyto normalized polar coordinates kr[0,1] and θ[0,2π).
F(kr, θ) = F(kx, ky) : kr=sk2
x+k2
y
1
4(m2+n2), θ = atan2 (ky, kx).(2)
Furthermore, the dimensionality can be reduced without significant loss in information by azimuthally
averaging the magnitude of the Fourier coefficients to obtain the reduced spectrum
c(kr)
, a quantita-
tive representation of the strength of the signal with respect to the radial wavenumber kr
c(kr) = 1
2πZ2π
0F(kr, θ)dθ. (3)
2
In practice, this averaging is approximated with binning along the radial direction to smooth the large
fluctuations in the Fourier spectrum at high frequencies. Although a classifier can be trained on the
reduced spectrum, a simpler and more robust classifier can be built by fitting a decay function to the
reduced spectrum and classifying using the parameters of this function. As the spectra of natural
images tend to behave as a power law [16], performing classification on a power law decay function
is considered in this work, modeled as
c(kr)b1kr
kTb2
, kr[kT,1],(4)
where the parameter
kT
denotes a threshold wavenumber above which the fitting is performed. With
this approximate form, the high-frequency spectrum is represented by two independent parameters:
b1
, which represents the magnitude of the high-frequency content, and
b2
, which represents the decay
rate of the high-frequency spectrum. These parameters, along with the reduced spectra, are used to
highlight differences in the high-frequency characteristics of real and deep network generated images.
2.2 Image transformations
In order to minimize storage and bandwidth requirements, deep network generated images are
typically resized, cropped, and/or compressed, procedures which can change the characteristics
of the image spectra. The image resolution dictates the maximum frequency in the frequency
domain, and higher resolutions yield more information at the highest frequencies. Compression may
particularly affect the high-frequency spectrum of an image since the high-frequency components of
an image correspond to the small scale features whereas the low-frequency components correspond
to the large features. Therefore, compression algorithms generally tolerate losses in the high-
frequency components as they have less impact on the way an image is seen compared to the low-
frequency components [17]. Common compression methods such as quantization and subsampling
can spuriously introduce or reduce high-frequency content, respectively [18].
Images with varying resolutions – both native and cropped – and compression levels were analyzed
in this work. Datasets of different resolutions were used in the experiments, and compression levels
were varied using lossy JPEG compression with Python Imaging Library (Pillow). A quality metric
is given to indicate the amount of compression. For the 100% quality images, the original provided
images were used, consisting of either lossless PNG or 100% quality JPEG images. Although the
latter is not considered lossless, negligible differences in the reduced spectra between the two image
formats were seen, and therefore both of these are referred to as uncompressed in this work. Two
additional compression levels were chosen corresponding to high quality (95%) and medium quality
(85%) compression. The latter was chosen qualitatively based on the visually noticeable presence of
compression artifacts while the former was chosen as it is a default setting in many applications.
2.3 Classification
2.3.1 Datasets
A binary classification task was performed between various real and deep network generated images.
Image samples were taken from datasets of real images and images generated by StyleGAN, Style-
GAN2, PGGAN, VQ-VAE2 and ALAE architectures at compression qualities of 100%, 95%, and
85%. These datasets, shown in Table 1, are denoted by
R,G,S,P,V
, and
A
, respectively, with the
subscript denoting the resolution. Additional datasets, denoted with the subscript 768, were created
by taking the native 10242resolution datasets and cropping them to a resolution of 7682.
For the majority of the datasets, 10% of the images were used for training while the remaining
90% were used for testing to highlight the relatively low number of training examples required for
classification. For the high-resolution VQ-VAE2 datasets (
V1024/V768
), only a small number of
high-resolution images were presented in the work by Razavi et al. [4], and therefore only 8 images
were available for training and 9 for testing. For the low-resolution VQ-VAE2 dataset (
V256
), a larger
amount of low-resolution images were provided and 100 of the 364 images were used for training
while the remaining were used for testing. In both cases, these images were duplicated to match the
size of the other datasets to give equal weight to the training and testing metrics.
3
2.3.2 Classifier
To emphasize the postulate that the low-level properties of real and deep network generated images
are fundamentally different, the classification was performed using only "simple" classifiers. A
k-nearest neighbors (KNN) classifier with
k= 5
was used for classification between real and deep
network generated images with respect to the decay parameters (
b1, b2
) of the grayscale component of
the images. Since the data was easily separable in many cases, negligible differences in classification
accuracy were obtained with other KNN hyperparameter choices and with different classifiers such
as a support vector machine with a variety of kernel choices. As the classification was performed
with respect to only two parameters, minimal training data was required and the computational cost
of training and classifying was insignificant.
Classification accuracy was determined by the ability of the classifier to predict if an image was real
or fake, and no weight was placed on discerning between the architecture that generated the images.
Overall classification accuracy was calculated using the real image datasets and the datasets from
each generative model. Individual classification accuracies for each generative model were separately
calculated from a subset of training and testing data using only real images and images from the
respective model. The pipeline for the classification task was as follows:
1. Perform the discrete Fourier transform of the image and normalize by the DC gain.
2.
Transform from Cartesian coordinates to normalized polar coordinates in the frequency
domain.
3.
Bin the magnitudes of the Fourier coefficients along the radial direction and average az-
imuthally to obtain the reduced spectrum.
4.
Fit the decay parameters
b1, b2
to the reduced spectrum above a threshold wavenumber
kT
.
5.
Train/apply the binary classifier to the decay parameters of the image to predict if the image
is real or fake.
3 Experiments and results
In this section, the reduced spectrum for the images from the datasets in Table 1 is shown as well as
the effects of resolution, cropping, and compression on the spectra. Additionally, experimental results
for the classification task between real and deep network generated images at different resolutions,
cropping levels, and compression qualities are presented.
3.1 Reduced spectrum
A comparison of the reduced spectrum statistics of the grayscale-converted
10242
pixel images from
the datasets in Table 1 is shown in Fig. 2, normalized by the spectrum at a threshold wavenumber
Table 1: Experimental datasets
Dataset Origin Dataset
Type Resolution Compression
Quality
Training
Samples
Testing
Samples
R1024 FFHQ Faces 10242[100, 95, 85] 100 900
G1024 Karras et al. [1] Faces 10242[100, 95, 85] 100 900
S1024 Karras et al. [2] Faces 10242[100, 95, 85] 100 900
P1024 Karras et al. [3] Faces 10242[100, 95, 85] 100 900
V1024 Razavi et al. [4] Faces 10242[100, 95, 85] 8 9
A1024 Pidhorskyi et al. [5] Faces 10242[100, 95, 85] 100 900
R256 Zhang et al. [19] Cats 2562[100, 95, 85] 100 900
G256 Karras et al. [1] Cats 2562[100, 95, 85] 100 900
S256 Karras et al. [2] Cats 2562[100, 95, 85] 100 900
P256 Karras et al. [3] Cats 2562[100, 95, 85] 100 900
V256 Razavi et al.[4] Animals 2562[100, 95, 85] 100 264
A256 Pidhorskyi et al. [5] Faces 2562[100, 95, 85] 100 900
4
kT= 0.75
. At the threshold wavenumber, the real images show a decay initially proportional to
approximately
k4
r
before leveling off near the end of the spectrum. In contrast, the deep network
generated images – with an exception of images generated by StyleGAN2 – do not show such
decay, exhibiting decay exponents of less than 1. As the threshold wavenumber was increased, the
StyleGAN2 images behaved similarly to the other deep network generated images. Similar results
were observed with the spectra of the individual color channels as with the grayscale-converted
images.
Figure 2: Normalized reduced spectra: mean (left) and ±1standard deviation (right).
3.1.1 Resolution and cropping
Figure 3: Mean normalized reduced spectra: 10242(left), cropped 7682(middle), and 2562(right).
The reduced spectrum was computed for the cropped
7682
and
2562
pixel images sampled from the
datasets in Table 1. These spectra were compared to the
10242
pixel image spectra from Fig. 2. In
comparison to the
10242
pixel images, the
7682
pixel image spectra behaved almost identically, as
shown in Fig. 3. However, the
2562
pixel image spectra behaved noticeably different, with the deep
network generated image spectra exhibiting lower decay rates whereas the real image spectra were
qualitatively similar to the higher resolution images. As the resolution was lowered, it became more
difficult to distinguish between the real and deep network generated image spectra as the maximum
frequency was reduced. However, the same observations as with the higher resolution images can be
drawn with the lower resolution images if the threshold wavenumber was increased as the tail of the
deep network generated image spectra began to flatten at the highest wavenumbers.
3.1.2 Compression
Figure 4: Mean normalized reduced spectra: 100% (left), 95% (middle), and 85% quality (right).
5
The effects of image compression on the reduced spectrum of the
10242
pixel image datasets in
Table 1 is shown in Fig. 4 for compression qualities of 100%, 95%, and 85%. Even at a small
compression ratio (95%), the high-frequency reduced spectra of the deep network generated images
were significantly modified, and their decay rate converged to the decay rate of the real images. At 85%
compression, the StyleGAN, StyleGAN2, and ALAE image spectra were essentially indistinguishable
from the real image spectra, with only slightly lower decay rates than the compressed real images.
In contrast, the reduced spectra of the VQ-VAE2 and PGGAN images were less affected by the
compression, with VQ-VAE2 images showing clearly distinguishable spectra even for compression
qualities as low as 60%. The relative effects of compression on the decay rates of the high-frequency
spectra can be directly attributed to the amount of high-frequency content. Lossy compression
methods, whose effects are proportional to the frequency and effectively modify the decay rate, would
have negligible impact if there was very little high-frequency content.
These observations indicate that compression, even in small amounts, acts to homogenize the spectral
content of images generated by certain architectures. In a model-unaware scenario where the classifier
does not know if the images are compressed, it would not be able to robustly distinguish between
uncompressed real images, compressed real images, and compressed StyleGAN, StyleGAN2, and
ALAE images, but would be able to easily distinguish their uncompressed counterparts. However,
VQ-VAE2 and certain PGGAN images would remain easily distinguishable regardless of compression
as their spectra’s decay rates are less affected, and a different method for mimicking the spectrum of
real images is needed.
3.2 Classification
The results of the KNN classifier for image resolutions of
10242
,
7682
(cropped), and
2562
with
compression qualities of 100% (uncompressed), 95%, and 85% are shown in Table 2. When
classifying uncompressed
10242
images (experiment A), the classifier was able to obtain a 99.2%
accuracy across all image types with a minimum and maximum accuracy of 97.4% (PGGAN) and
99.9% (StyleGAN), respectively, when classifying images generated by a single architecture.
For the uncompressed
10242
images (experiment A), the distribution of the data along the
b1b2
axes displayed distinct clusters corresponding to the various image types, as shown in Fig. 5a. Real
images exhibited a range of high-frequency content (
b1
) with notably high decay rates (
b2
). All
deep network generated images had significantly smaller decay rates than the real images, but the
high-frequency content of images from each generative model varied with VQ-VAE2 and ALAE
producing the lowest and highest amount of high-frequency content, respectively. In contrast to the
reduced spectrum statistics shown in Fig. 2 where the StyleGAN2 images were most similar to the
real images, the PGGAN images were instead more likely to be misclassified. This can be attributed
to the increased high-frequency content of the StyleGAN2 images which allowed the classifier to
distinguish the images from real images even though the decay rates were more similar. The lower
level of high-frequency content in the PGGAN images made the images more similar to certain real
images with low decay rates which caused instances of misclassification.
As the images were compressed (experiments B-C), the distribution of the decay rates of the real
and deep network generated images converged and many of the clusters were indistinguishable
at 85% compression quality, as shown in Fig. 5c, where the overall classification accuracy was
Table 2: Classification experiments and results
Experiment Resolution Compression
Quality
Overall
Class. Acc.
StyleGAN
Class. Acc.
StyleGAN2
Class. Acc.
PGGAN
Class. Acc.
VQ-VAE2
Class. Acc.
ALAE
Class. Acc.
A10242100 99.2% 99.9% 99.5% 97.4% 99.8% 99.8%
B1024295 94.4% 99.2% 88.5% 88.5% 100 % 99.7%
C1024285 83.9% 78.9% 65.9% 78.7% 99.6% 87.4%
D7682100 98.5% 100 % 99.1% 95.9% 99.9% 99.9%
E768295 93.0% 97.9% 85.4% 87.3% 100 % 99.5%
F768285 84.6% 77.1% 68.6% 79.3% 99.6% 85.7%
G2562100 88.8% 85.0% 87.4% 69.0% 92.0% 90.7%
H256295 88.1% 81.7% 83.4% 68.2% 92.2% 87.7%
I256285 87.4% 67.8% 79.3% 64.8% 87.7% 80.6%
6
reduced to 83.9%. Nearly identical observations were drawn from the cropped
7682
pixel images
(experiments D-F, not shown), and the overall and individual classification accuracies were generally
within 1-2% of their native resolution counterparts. Furthermore, as the decay fitting method in Eq. 4
is independent of the resolution, the classifier trained on
10242
pixel images was able to maintain a
similar classification accuracy when classifying
7682
pixel images, demonstrating the robustness of
the method.
Due to their small high-frequency content, the effects of compression on the VQ-VAE2 images were
minimal, and as such, the VQ-VAE2 images were correctly classified regardless of compression
quality for both
10242
and
7682
pixel images. The reason for the notable decrease of high-frequency
content in VQ-VAE2 images is not immediately evident. It is hypothesized that VAEs tend to
distribute probability mass diffusely over the data space, and thus their generated images tend to be
blurry [20, 21, 22]. Although this is not visually noticeable, it is noticeable in the frequency domain
as the high-frequency content associated with sharp edges is dramatically reduced.
When the classifier was tested on the
2562
images (experiments G-I), the classification accuracy was
significantly lower. Even for the uncompressed images, the data was not as clearly separable as with
the
10242
images, and the classifier was only able to obtain an 88.8% overall classification accuracy.
However, the overall distribution trend along the
b1b2
axes of the various image types was similar
at low and high resolutions as shown in Fig. 5a and Fig. 6a. Compression did not have as large of
an effect on the
2562
images, and thus the classifier performed only slightly worse (87.4%) at 85%
compression quality than on the uncompressed images. In contrast to the high-resolution experiments,
the effects of compression on the the decay rates were minimal.
(a) A: 10242, 100% quality. (b) B: 10242, 95% quality. (c) C: 10242, 85% quality.
Figure 5: Experiments A-C at 10242resolution.
(a) G: 2562, 100% quality. (b) H: 2562, 95% quality. (c) I: 2562, 85% quality.
Figure 6: Experiments G-I at 2562resolution.
3.3 Discussion
The cause of the disparities in the decay rates of the high-frequency content of deep network generated
images is of particular interest as this issue is evident in each one of the investigated algorithms. One
might believe that regularization has an effect on the high-frequency attributes as deep generative
networks are not incentivised to learn the high-frequency components (i.e. noise) of their input
data to discourage overfitting. However, we continued training the ALAE model with and without
regularization, and we observed negligible differences in the decay rates of images generated by the
model. In the work of Durall et al., discrepancies in the high-frequency content of deep network
7
generated images were attributed to the effects of up-convolution [23]. Although this is shown to
have an impact on the spectral composition of the images, it did not necessarily affect the decay rates
(see Fig. 5). Instead, these discrepancies are most likely explained by the work of Khayatkhoei and
Elgammal which presents an analysis on the spectral bias of convolution layers in a deep generative
network [24]. They showed that linear dependencies exist in a convolution layer’s filter spectrum
which results in correlations between frequency components that are generally more pronounced at
high frequencies. Consequently, these correlations can cause the flatlining of the frequency spectrum
at high frequencies as seen in Fig. 2.
4 Spectrum synthesis
As the results in the classification experiments show, classifiers based on the high-frequency character-
istics of images can easily distinguish between real and deep network generated images in many cases.
In some scenarios, compression can effectively disguise these deep network generated images to the
classifier, whereas in other scenarios, it has little effect. However, in scenarios where compression is
a viable spoofing tool, the amount of compression required generally introduces noticeable visual
artifacts.
To robustly disguise deep network generated images to classifiers based on high-frequency spectrum
characteristics, a post-processing method for modifying the spectra of the deep network generated
images to behave as real image spectra is proposed. Given the spectrum of a real image as a target,
the high-frequency components of a deep network generated image were scaled to match the real
image to produce a spoofed spectrum
¯
F(kr, θ)
, which was then transformed back to a spoofed image.
The scaling factor was defined as the ratio of the fitted decay functions of the target (real) and source
(deep network generated) images.
¯
F(kr, θ) = F(kr, θ)1φ(kr)+φ(kr)b1,t
b1,s kr
kT(b2,tb2,s )(5)
A smooth hyperbolic tangent blending function
φ(kr)
was used to leave the low-frequency compo-
nents of the image unaffected without introducing visual artifacts.
φ(kr) = 1
2(tanh(krkT) + 1) (6)
The effects of the spectrum synthesis method on the reduced spectrum of the example VQ-VAE2
image are shown in Fig. 7. Using the real image in Fig. 1 as the target, the spoofed spectrum matched
the real image spectrum very closely in both decay and magnitude and was visually indistinguishable
from the original image. When compared to its compressed counterpart, the spoofed image was
of noticeably higher quality since the spectrum synthesis method did not introduce compression
artifacts, as shown by the pixel difference plot in Fig. 7b. The spoofed image fell well within the
classification boundary for real images and effectively disguised the image to the classifier whereas
compression could not. Similar results were obtained with other generative models.
(a) Spoofed image
(b) Original/spoofed pixel
difference (scaled ×100)
(c) Reduced spectra normalized by DC
gain
Figure 7: Spectrum synthesis method.
8
5 Conclusion
In this work, we presented an analysis of the high-frequency modes of real images and images
generated by various generative models. We showed that the Fourier modes of deep network
generated images at the highest frequencies did not decay as seen in real images but instead stayed
approximately constant. By modeling the decay of the Fourier spectrum at high frequencies, we
observed that the high-frequency spectra of real and deep network generated images had distinct
characteristics: real images showed large decay rates and a range of magnitudes, whereas deep
network generated images showed small decay rates and the magnitude varied depending on the
generative model. These differences were more noticeable at higher resolutions, but lossy image
compression algorithms modified the high-frequency spectra and reduced those differences. When
highly compressed, images generated by certain architectures were indistinguishable from real images
in the frequency domain, but images generated by other architectures like VAEs were not affected
due to their low levels of high-frequency content.
We proposed a detection method for identifying deep network generated images based on their high-
frequency characteristics and performed binary classification experiments on datasets of real images
and images generated by StyleGAN, StyleGAN2, PGGAN, VQ-VAE2, and ALAE architectures.
This detection method achieved an accuracy of 99.2% on uncompressed, high-resolution images with
minimal training data, but the accuracy decreased with highly-compressed and/or low-resolution
images, although the classifier was able to robustly classify images at resolutions it was not trained
at. Finally, we presented a method for modifying the high-frequency spectra of a deep network
generated image to mimic the spectra of real images, effectively deceiving the classifier without any
visually noticeable changes in the image itself. In the future, this detection and synthesis method
will be applied to videos manipulated by deep generative models (i.e. deep fakes) to evaluate their
effectiveness.
Broader Impact
The most apparent impact of the present work is in the application of combating unethical uses of
deep network generated images. Since the proposed approach can robustly generalize to unknown
generative models and requires minimal training data, it can be easily implemented in browser plugins
and mobile applications to warn users that an image is likely fake. This work can be expanded upon
for detecting manipulated videos, a trending topic in current research, or towards adversarial training
of vision tasks. However, systematic implementation of the proposed method for spoofing images
would effectively nullify the capabilities of the classifier, and as a result, could create even more
realistic (and harder to detect) deep network generated images for malicious purposes.
Similarly, the current work can be used as a basis for improving the training process of generative
models; for example, a metric can be given for generated images in the frequency domain and used
for evaluating generative models, and a loss function in Fourier space, weighted towards the highest
modes, can be introduced to aid in improving these networks. Generative models for non-image data
could benefit from analysis in other domains as well.
A more general result of this work is the conclusion that generative models can have systematic
shortcomings that are not immediately evident until observed in other domains (e.g. frequency). In
fields where data is scarce and high-quality synthetic data is required to train models, fundamental
flaws in synthetic data can have hidden detrimental effects. This opens up questions about the
structure of synthetic data and what we perceive to be "high-quality" synthetic data.
Acknowledgments and Disclosure of Funding
The authors do not acknowledge any outside funding sources or competing interests.
References
[1]
Tero Karras, Samuli Laine, and Timo Aila. “A Style-Based Generator Architecture for Genera-
tive Adversarial Networks”. In: Proceedings of the 2019 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR). 2019, pp. 4401–4410.
9
[2]
Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila.
“Analyzing and Improving the Image Quality of StyleGAN”. In: Proceedings of the 2020
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June
2020.
[3]
Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. “Progressive Growing of GANs
for Improved Quality, Stability, and Variation”. In: International Conference on Learning
Representations. 2018.
[4]
Ali Razavi, Aaron van den Oord, and Oriol Vinyals. “Generating Diverse High-Fidelity Images
With VQ-VAE-2”. In: Advances in Neural Information Processing Systems. 2019, pp. 14866–
14876.
[5]
Stanislav Pidhorskyi, Donald A. Adjeroh, and Gianfranco Doretto. “Adversarial Latent Au-
toencoders”. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR). IEEE, June 2020.
[6]
Paweł Korus. “Digital Image Integrity – A Survey of Protection and Verification Techniques”.
In: Digital Signal Processing 71 (Dec. 2017), pp. 1–26.
[7]
Javier Galbally, Sebastien Marcel, and Julian Fierrez. “Biometric Antispoofing Methods: A
Survey in Face Recognition”. In: IEEE Access 2 (2014), pp. 1530–1552.
[8]
Alessandro Piva. “An Overview on Image Forensics”. In: ISRN Signal Processing 2013 (2013),
pp. 1–22.
[9]
Zinelabidine Boulkenafet, Jukka Komulainen, and Abdenour Hadid. “Face Spoofing Detection
Using Colour Texture Analysis”. In: IEEE Transactions on Information Forensics and Security
11.8 (Aug. 2016), pp. 1818–1830.
[10]
Haodong Li, Bin Li, Shunquan Tan, and Jiwu Huang. “Identification of Deep Network Gener-
ated Images Using Disparities in Color Components”. In: Signal Processing 174 (Sept. 2020),
p. 107616.
[11]
Francesco Marra, Diego Gragnaniello, Luisa Verdoliva, and Giovanni Poggi. “Do GANs Leave
Artificial Fingerprints?” In: 2019 IEEE Conference on Multimedia Information Processing
and Retrieval (MIPR). IEEE, Mar. 2019.
[12]
Xinsheng Xuan, Bo Peng, Wei Wang, and Jing Dong. “On the Generalization of GAN Image
Forensics”. In: Biometric Recognition. Springer International Publishing, 2019, pp. 134–141.
[13]
Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A. Efros. “CNN-
Generated Images Are Surprisingly Easy to Spot. . . for Now”. In: Proceedings of the 2020
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June
2020.
[14]
Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. “Generative Adversarial Nets”. In: Proceedings
of the 27th International Conference on Neural Information Processing Systems - Volume 2.
2014, pp. 2672–2680.
[15]
Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. 2013. arXiv:
1312.
6114 [cs.ML].
[16]
A. van der Schaaf and J.H. van Hateren. “Modelling the Power Spectra of Natural Images:
Statistics and Information”. In: Vision Research 36.17 (Sept. 1996), pp. 2759–2770.
[17]
Michael Parker. “Image and Video Compression Fundamentals”. In: Digital Signal Processing
101 (Second Edition). 2017. Chap. 25, pp. 329–346.
[18]
David Salomon and Giovanni Motta. Handbook of Data Compression. Springer London, 2010.
[19]
Weiwei Zhang, Jian Sun, and Xiaoou Tang. “Cat Head Detection - How to Effectively Ex-
ploit Shape and Texture Features”. In: Lecture Notes in Computer Science. Springer Berlin
Heidelberg, 2008, pp. 802–816.
[20]
Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin
Arjovsky, and Aaron Courville. Adversarially Learned Inference. 2016. arXiv:
1606.00704
[cs.LG].
[21]
Wenzhe Shi, Jose Caballero, Lucas Theis, Ferenc Huszar, Andrew Aitken, Christian Ledig,
and Zehan Wang. Is The Deconvolution Layer The Same As A Convolutional Layer? 2016.
arXiv: 1609.07009 [cs.CV].
10
[22]
Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther.
“Autoencoding Beyond Pixels Using a Learned Similarity Metric”. In: Proceedings of the 33rd
International Conference on Machine Learning - Volume 48. ICML. 2016, pp. 1558–1566.
[23]
Ricard Durall, Margret Keuper, and Janis Keuper. “Watch Your Up-Convolution: CNN Based
Generative Deep Neural Networks Are Failing to Reproduce Spectral Distributions”. In:
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR). IEEE, June 2020.
[24]
Mahyar Khayatkhoei and Ahmed Elgammal. Spatial Frequency Bias in Convolutional Genera-
tive Adversarial Networks. 2020. arXiv: 2010.01473 [cs.LG].
11