Fourier Spectrum Discrepancies in Deep Network

Generated Images

Tarik Dzanic

Department of Ocean Engineering

Texas A&M University

College Station, TX 77843

tdzanic@tamu.edu

Karan Shah

Department of Computational Science and Engineering

Georgia Institute of Technology

Atlanta, GA 30332

shah@gatech.edu

Freddie D. Witherden

Department of Ocean Engineering

Texas A&M University

College Station, TX 77843

fdw@tamu.edu

Abstract

Advancements in deep generative models such as generative adversarial networks

and variational autoencoders have resulted in the ability to generate realistic images

that are visually indistinguishable from real images, which raises concerns about

their potential malicious usage. In this paper, we present an analysis of the high-

frequency Fourier modes of real and deep network generated images and show that

deep network generated images share an observable, systematic shortcoming in

replicating the attributes of these high-frequency modes. Using this, we propose a

detection method based on the frequency spectrum of the images which is able to

achieve an accuracy of up to 99.2% in classifying real and deep network generated

images from various GAN and VAE architectures on a dataset of 5000 images

with as few as 8 training examples. Furthermore, we show the impact of image

transformations such as compression, cropping, and resolution reduction on the

classification accuracy and suggest a method for modifying the high-frequency

attributes of deep network generated images to mimic real images.

1 Introduction

Figure 1: Left to right: real, StyleGAN [1], StyleGAN2 [2], PGGAN [3], VQ-VAE2 [4], and ALAE

[5] generated images.

In recent years, advances in deep generative models for image synthesis have caused widespread

concern regarding their potential malicious uses. Current state-of-the-art models can generate hyper-

realistic images that are visually indistinguishable from real images, as shown in Fig. 1. These

models can be used for unethical purposes, such as misinformation campaigns, fabricating evidence,

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

arXiv:1911.06465v3 [eess.IV] 22 Oct 2020

or attacking biometric security systems, and as a result, recent research efforts have been focused on

developing methods for detecting such images [6, 7]. Various detection strategies have been employed,

from traditional image forensics methods such as analyzing encoding or acquisition fingerprints to,

more recently, machine learning based approaches such as statistical modeling for disparities in color

components, textures, or features [8, 9, 10, 11, 12]. Boulkenafet et al. developed a method for face

spoofing detection by analyzing the statistical distributions of the image spectra in other color spaces

such as HSV and YCbCr [9]. Li et al. expanded upon this method to detect images generated by

generative adversarial networks [10]. Marra et al. showed that generative adversarial networks leave

unique artificial fingerprints in the noise of their generated images that are dependent on the network

architecture which can be used to detect if an image was generated by a particular network [11]. In

concurrent work, Wang et al. trained a deep neural network for classifying deep network generated

images and showed that a classifier trained on one generative model can reasonably generalize to

other models as well [13]. Although these methods have shown promise in terms of accuracy with

minimal user input, they require large amounts of data to train, something which may not be feasible

in many applications. As such, there is a need for detection methodologies that can obtain similar

levels of accuracy with minimal training data that can generalize to unknown generative models.

In this work, we explore the high-frequency characteristics of real and deep network generated

images and propose a simple yet robust detection method for deep network generated images based

on these characteristics. We compare real images to images generated by various generative models

– generative adversarial networks [14] such as StyleGAN [1], StyleGAN2 [2], and PGGAN [3] as

well as variational and hybrid autoencoders [15] such as VQ-VAE2 [4] and ALAE [5]. Based on

a reduced-order model of the high-frequency spectrum of the images, we show that the relative

magnitude and the decay rate of the high-frequency spectrum is clearly distinguishable between real

and deep network generated images, indicating that the low-level attributes of images generated by

generative models display different properties. Furthermore, we show that at higher resolutions, these

differences are more easily observed, and we consider the impact of common image transformation

methods such as compression and cropping which can significantly affect the high-frequency spectra,

causing difficulties in detecting deep network generated images. We also show the results of binary

classification experiments based on these low-level attributes of images at different resolutions and

compression levels from datasets such as Flickr-Faces-HQ (FFHQ) and the aforementioned generative

models. Finally, we present a method for modifying the spectra of deep network generated images to

mimic the high-frequency characteristics of real images as a way of deceiving such classifiers.

2 Methodology

2.1 Fourier spectrum analysis

To analyze the characteristics of real and deep network generated images in the frequency domain, a

Fourier transform is required. For a discrete two-dimensional signal

f(p, q)

representing individual

color channels of an image of size m×n, the discrete Fourier transform F(kx, ky)is defined as

F(kx, ky) = 1

m−1

p=0

n−1

q=0

f(p, q)e−i2π(kxp/m +kyq/n),(1)

which is of the same dimension as the input signal. To construct a scale and rotation invariant

threshold for the highest frequencies, a transform in wavenumber space can be performed from

Cartesian coordinates kx, kyto normalized polar coordinates kr∈[0,1] and θ∈[0,2π).

F(kr, θ) = F(kx, ky) : kr=sk2

x+k2

4(m2+n2), θ = atan2 (ky, kx).(2)

Furthermore, the dimensionality can be reduced without significant loss in information by azimuthally

averaging the magnitude of the Fourier coefficients to obtain the reduced spectrum

c(kr)

, a quantita-

tive representation of the strength of the signal with respect to the radial wavenumber kr

c(kr) = 1

2πZ2π

0F(kr, θ)dθ. (3)

In practice, this averaging is approximated with binning along the radial direction to smooth the large

fluctuations in the Fourier spectrum at high frequencies. Although a classifier can be trained on the

reduced spectrum, a simpler and more robust classifier can be built by fitting a decay function to the

reduced spectrum and classifying using the parameters of this function. As the spectra of natural

images tend to behave as a power law [16], performing classification on a power law decay function

is considered in this work, modeled as

c(kr)≈b1kr

kTb2

, kr∈[kT,1],(4)

where the parameter

denotes a threshold wavenumber above which the fitting is performed. With

this approximate form, the high-frequency spectrum is represented by two independent parameters:

, which represents the magnitude of the high-frequency content, and

, which represents the decay

rate of the high-frequency spectrum. These parameters, along with the reduced spectra, are used to

highlight differences in the high-frequency characteristics of real and deep network generated images.

2.2 Image transformations

In order to minimize storage and bandwidth requirements, deep network generated images are

typically resized, cropped, and/or compressed, procedures which can change the characteristics

of the image spectra. The image resolution dictates the maximum frequency in the frequency

domain, and higher resolutions yield more information at the highest frequencies. Compression may

particularly affect the high-frequency spectrum of an image since the high-frequency components of

an image correspond to the small scale features whereas the low-frequency components correspond

to the large features. Therefore, compression algorithms generally tolerate losses in the high-

frequency components as they have less impact on the way an image is seen compared to the low-

frequency components [17]. Common compression methods such as quantization and subsampling

can spuriously introduce or reduce high-frequency content, respectively [18].

Images with varying resolutions – both native and cropped – and compression levels were analyzed

in this work. Datasets of different resolutions were used in the experiments, and compression levels

were varied using lossy JPEG compression with Python Imaging Library (Pillow). A quality metric

is given to indicate the amount of compression. For the 100% quality images, the original provided

images were used, consisting of either lossless PNG or 100% quality JPEG images. Although the

latter is not considered lossless, negligible differences in the reduced spectra between the two image

formats were seen, and therefore both of these are referred to as uncompressed in this work. Two

additional compression levels were chosen corresponding to high quality (95%) and medium quality

(85%) compression. The latter was chosen qualitatively based on the visually noticeable presence of

compression artifacts while the former was chosen as it is a default setting in many applications.

2.3 Classification

2.3.1 Datasets

A binary classification task was performed between various real and deep network generated images.

Image samples were taken from datasets of real images and images generated by StyleGAN, Style-

GAN2, PGGAN, VQ-VAE2 and ALAE architectures at compression qualities of 100%, 95%, and

85%. These datasets, shown in Table 1, are denoted by

R,G,S,P,V

, and

, respectively, with the

subscript denoting the resolution. Additional datasets, denoted with the subscript 768, were created

by taking the native 10242resolution datasets and cropping them to a resolution of 7682.

For the majority of the datasets, 10% of the images were used for training while the remaining

90% were used for testing to highlight the relatively low number of training examples required for

classification. For the high-resolution VQ-VAE2 datasets (

V1024/V768

), only a small number of

high-resolution images were presented in the work by Razavi et al. [4], and therefore only 8 images

were available for training and 9 for testing. For the low-resolution VQ-VAE2 dataset (

V256

), a larger

amount of low-resolution images were provided and 100 of the 364 images were used for training

while the remaining were used for testing. In both cases, these images were duplicated to match the

size of the other datasets to give equal weight to the training and testing metrics.

2.3.2 Classifier

To emphasize the postulate that the low-level properties of real and deep network generated images

are fundamentally different, the classification was performed using only "simple" classifiers. A

k-nearest neighbors (KNN) classifier with

k= 5

was used for classification between real and deep

network generated images with respect to the decay parameters (

b1, b2

) of the grayscale component of

the images. Since the data was easily separable in many cases, negligible differences in classification

accuracy were obtained with other KNN hyperparameter choices and with different classifiers such

as a support vector machine with a variety of kernel choices. As the classification was performed

with respect to only two parameters, minimal training data was required and the computational cost

of training and classifying was insignificant.

Classification accuracy was determined by the ability of the classifier to predict if an image was real

or fake, and no weight was placed on discerning between the architecture that generated the images.

Overall classification accuracy was calculated using the real image datasets and the datasets from

each generative model. Individual classification accuracies for each generative model were separately

calculated from a subset of training and testing data using only real images and images from the

respective model. The pipeline for the classification task was as follows:

1. Perform the discrete Fourier transform of the image and normalize by the DC gain.

Transform from Cartesian coordinates to normalized polar coordinates in the frequency

domain.

Bin the magnitudes of the Fourier coefficients along the radial direction and average az-

imuthally to obtain the reduced spectrum.

Fit the decay parameters

b1, b2

to the reduced spectrum above a threshold wavenumber

Train/apply the binary classifier to the decay parameters of the image to predict if the image

is real or fake.

3 Experiments and results

In this section, the reduced spectrum for the images from the datasets in Table 1 is shown as well as

the effects of resolution, cropping, and compression on the spectra. Additionally, experimental results

for the classification task between real and deep network generated images at different resolutions,

cropping levels, and compression qualities are presented.

3.1 Reduced spectrum

A comparison of the reduced spectrum statistics of the grayscale-converted

10242

pixel images from

the datasets in Table 1 is shown in Fig. 2, normalized by the spectrum at a threshold wavenumber

Table 1: Experimental datasets

Dataset Origin Dataset

Type Resolution Compression

Quality

Training

Samples

Testing

Samples

R1024 FFHQ Faces 10242[100, 95, 85] 100 900

G1024 Karras et al. [1] Faces 10242[100, 95, 85] 100 900

S1024 Karras et al. [2] Faces 10242[100, 95, 85] 100 900

P1024 Karras et al. [3] Faces 10242[100, 95, 85] 100 900

V1024 Razavi et al. [4] Faces 10242[100, 95, 85] 8 9

A1024 Pidhorskyi et al. [5] Faces 10242[100, 95, 85] 100 900

R256 Zhang et al. [19] Cats 2562[100, 95, 85] 100 900

G256 Karras et al. [1] Cats 2562[100, 95, 85] 100 900

S256 Karras et al. [2] Cats 2562[100, 95, 85] 100 900

P256 Karras et al. [3] Cats 2562[100, 95, 85] 100 900

V256 Razavi et al.[4] Animals 2562[100, 95, 85] 100 264

A256 Pidhorskyi et al. [5] Faces 2562[100, 95, 85] 100 900

kT= 0.75

. At the threshold wavenumber, the real images show a decay initially proportional to

approximately

k−4

before leveling off near the end of the spectrum. In contrast, the deep network

generated images – with an exception of images generated by StyleGAN2 – do not show such

decay, exhibiting decay exponents of less than 1. As the threshold wavenumber was increased, the

StyleGAN2 images behaved similarly to the other deep network generated images. Similar results

were observed with the spectra of the individual color channels as with the grayscale-converted

images.

Figure 2: Normalized reduced spectra: mean (left) and ±1standard deviation (right).

3.1.1 Resolution and cropping

Figure 3: Mean normalized reduced spectra: 10242(left), cropped 7682(middle), and 2562(right).

The reduced spectrum was computed for the cropped

7682

and

2562

pixel images sampled from the

datasets in Table 1. These spectra were compared to the

10242

pixel image spectra from Fig. 2. In

comparison to the

10242

pixel images, the

7682

pixel image spectra behaved almost identically, as

shown in Fig. 3. However, the

2562

pixel image spectra behaved noticeably different, with the deep

network generated image spectra exhibiting lower decay rates whereas the real image spectra were

qualitatively similar to the higher resolution images. As the resolution was lowered, it became more

difficult to distinguish between the real and deep network generated image spectra as the maximum

frequency was reduced. However, the same observations as with the higher resolution images can be

drawn with the lower resolution images if the threshold wavenumber was increased as the tail of the

deep network generated image spectra began to flatten at the highest wavenumbers.

3.1.2 Compression

Figure 4: Mean normalized reduced spectra: 100% (left), 95% (middle), and 85% quality (right).

The effects of image compression on the reduced spectrum of the

10242

pixel image datasets in

Table 1 is shown in Fig. 4 for compression qualities of 100%, 95%, and 85%. Even at a small

compression ratio (95%), the high-frequency reduced spectra of the deep network generated images

were significantly modified, and their decay rate converged to the decay rate of the real images. At 85%

compression, the StyleGAN, StyleGAN2, and ALAE image spectra were essentially indistinguishable

from the real image spectra, with only slightly lower decay rates than the compressed real images.

In contrast, the reduced spectra of the VQ-VAE2 and PGGAN images were less affected by the

compression, with VQ-VAE2 images showing clearly distinguishable spectra even for compression

qualities as low as 60%. The relative effects of compression on the decay rates of the high-frequency

spectra can be directly attributed to the amount of high-frequency content. Lossy compression

methods, whose effects are proportional to the frequency and effectively modify the decay rate, would

have negligible impact if there was very little high-frequency content.

These observations indicate that compression, even in small amounts, acts to homogenize the spectral

content of images generated by certain architectures. In a model-unaware scenario where the classifier

does not know if the images are compressed, it would not be able to robustly distinguish between

uncompressed real images, compressed real images, and compressed StyleGAN, StyleGAN2, and

ALAE images, but would be able to easily distinguish their uncompressed counterparts. However,

VQ-VAE2 and certain PGGAN images would remain easily distinguishable regardless of compression

as their spectra’s decay rates are less affected, and a different method for mimicking the spectrum of

real images is needed.

3.2 Classification

The results of the KNN classifier for image resolutions of

10242

7682

(cropped), and

2562

with

compression qualities of 100% (uncompressed), 95%, and 85% are shown in Table 2. When

classifying uncompressed

10242

images (experiment A), the classifier was able to obtain a 99.2%

accuracy across all image types with a minimum and maximum accuracy of 97.4% (PGGAN) and

99.9% (StyleGAN), respectively, when classifying images generated by a single architecture.

For the uncompressed

10242

images (experiment A), the distribution of the data along the

b1−b2

axes displayed distinct clusters corresponding to the various image types, as shown in Fig. 5a. Real

images exhibited a range of high-frequency content (

) with notably high decay rates (

−b2

). All

deep network generated images had significantly smaller decay rates than the real images, but the

high-frequency content of images from each generative model varied with VQ-VAE2 and ALAE

producing the lowest and highest amount of high-frequency content, respectively. In contrast to the

reduced spectrum statistics shown in Fig. 2 where the StyleGAN2 images were most similar to the

real images, the PGGAN images were instead more likely to be misclassified. This can be attributed

to the increased high-frequency content of the StyleGAN2 images which allowed the classifier to

distinguish the images from real images even though the decay rates were more similar. The lower

level of high-frequency content in the PGGAN images made the images more similar to certain real

images with low decay rates which caused instances of misclassification.

As the images were compressed (experiments B-C), the distribution of the decay rates of the real

and deep network generated images converged and many of the clusters were indistinguishable

at 85% compression quality, as shown in Fig. 5c, where the overall classification accuracy was

Table 2: Classification experiments and results

Experiment Resolution Compression

Quality

Overall

Class. Acc.

StyleGAN

Class. Acc.

StyleGAN2

Class. Acc.

PGGAN

Class. Acc.

VQ-VAE2

Class. Acc.

ALAE

Class. Acc.

A10242100 99.2% 99.9% 99.5% 97.4% 99.8% 99.8%

B1024295 94.4% 99.2% 88.5% 88.5% 100 % 99.7%

C1024285 83.9% 78.9% 65.9% 78.7% 99.6% 87.4%

D7682100 98.5% 100 % 99.1% 95.9% 99.9% 99.9%

E768295 93.0% 97.9% 85.4% 87.3% 100 % 99.5%

F768285 84.6% 77.1% 68.6% 79.3% 99.6% 85.7%

G2562100 88.8% 85.0% 87.4% 69.0% 92.0% 90.7%

H256295 88.1% 81.7% 83.4% 68.2% 92.2% 87.7%

I256285 87.4% 67.8% 79.3% 64.8% 87.7% 80.6%

reduced to 83.9%. Nearly identical observations were drawn from the cropped

7682

pixel images

(experiments D-F, not shown), and the overall and individual classification accuracies were generally

within 1-2% of their native resolution counterparts. Furthermore, as the decay fitting method in Eq. 4

is independent of the resolution, the classifier trained on

10242

pixel images was able to maintain a

similar classification accuracy when classifying

7682

pixel images, demonstrating the robustness of

the method.

Due to their small high-frequency content, the effects of compression on the VQ-VAE2 images were

minimal, and as such, the VQ-VAE2 images were correctly classified regardless of compression

quality for both

10242

and

7682

pixel images. The reason for the notable decrease of high-frequency

content in VQ-VAE2 images is not immediately evident. It is hypothesized that VAEs tend to

distribute probability mass diffusely over the data space, and thus their generated images tend to be

blurry [20, 21, 22]. Although this is not visually noticeable, it is noticeable in the frequency domain

as the high-frequency content associated with sharp edges is dramatically reduced.

When the classifier was tested on the

2562

images (experiments G-I), the classification accuracy was

significantly lower. Even for the uncompressed images, the data was not as clearly separable as with

the

10242

images, and the classifier was only able to obtain an 88.8% overall classification accuracy.

However, the overall distribution trend along the

b1−b2

axes of the various image types was similar

at low and high resolutions as shown in Fig. 5a and Fig. 6a. Compression did not have as large of

an effect on the

2562

images, and thus the classifier performed only slightly worse (87.4%) at 85%

compression quality than on the uncompressed images. In contrast to the high-resolution experiments,

the effects of compression on the the decay rates were minimal.

(a) A: 10242, 100% quality. (b) B: 10242, 95% quality. (c) C: 10242, 85% quality.

Figure 5: Experiments A-C at 10242resolution.

(a) G: 2562, 100% quality. (b) H: 2562, 95% quality. (c) I: 2562, 85% quality.

Figure 6: Experiments G-I at 2562resolution.

3.3 Discussion

The cause of the disparities in the decay rates of the high-frequency content of deep network generated

images is of particular interest as this issue is evident in each one of the investigated algorithms. One

might believe that regularization has an effect on the high-frequency attributes as deep generative

networks are not incentivised to learn the high-frequency components (i.e. noise) of their input

data to discourage overfitting. However, we continued training the ALAE model with and without

regularization, and we observed negligible differences in the decay rates of images generated by the

model. In the work of Durall et al., discrepancies in the high-frequency content of deep network

generated images were attributed to the effects of up-convolution [23]. Although this is shown to

have an impact on the spectral composition of the images, it did not necessarily affect the decay rates

(see Fig. 5). Instead, these discrepancies are most likely explained by the work of Khayatkhoei and

Elgammal which presents an analysis on the spectral bias of convolution layers in a deep generative

network [24]. They showed that linear dependencies exist in a convolution layer’s filter spectrum

which results in correlations between frequency components that are generally more pronounced at

high frequencies. Consequently, these correlations can cause the flatlining of the frequency spectrum

at high frequencies as seen in Fig. 2.

4 Spectrum synthesis

As the results in the classification experiments show, classifiers based on the high-frequency character-

istics of images can easily distinguish between real and deep network generated images in many cases.

In some scenarios, compression can effectively disguise these deep network generated images to the

classifier, whereas in other scenarios, it has little effect. However, in scenarios where compression is

a viable spoofing tool, the amount of compression required generally introduces noticeable visual

artifacts.

To robustly disguise deep network generated images to classifiers based on high-frequency spectrum

characteristics, a post-processing method for modifying the spectra of the deep network generated

images to behave as real image spectra is proposed. Given the spectrum of a real image as a target,

the high-frequency components of a deep network generated image were scaled to match the real

image to produce a spoofed spectrum

F(kr, θ)

, which was then transformed back to a spoofed image.

The scaling factor was defined as the ratio of the fitted decay functions of the target (real) and source

(deep network generated) images.

F(kr, θ) = F(kr, θ)1−φ(kr)+φ(kr)b1,t

b1,s kr

kT(b2,t−b2,s )(5)

A smooth hyperbolic tangent blending function

φ(kr)

was used to leave the low-frequency compo-

nents of the image unaffected without introducing visual artifacts.

φ(kr) = 1

2(tanh(kr−kT) + 1) (6)

The effects of the spectrum synthesis method on the reduced spectrum of the example VQ-VAE2

image are shown in Fig. 7. Using the real image in Fig. 1 as the target, the spoofed spectrum matched

the real image spectrum very closely in both decay and magnitude and was visually indistinguishable

from the original image. When compared to its compressed counterpart, the spoofed image was

of noticeably higher quality since the spectrum synthesis method did not introduce compression

artifacts, as shown by the pixel difference plot in Fig. 7b. The spoofed image fell well within the

classification boundary for real images and effectively disguised the image to the classifier whereas

compression could not. Similar results were obtained with other generative models.

(a) Spoofed image

(b) Original/spoofed pixel

difference (scaled ×100)

gain

Figure 7: Spectrum synthesis method.

5 Conclusion

In this work, we presented an analysis of the high-frequency modes of real images and images

generated by various generative models. We showed that the Fourier modes of deep network

generated images at the highest frequencies did not decay as seen in real images but instead stayed

approximately constant. By modeling the decay of the Fourier spectrum at high frequencies, we

observed that the high-frequency spectra of real and deep network generated images had distinct

characteristics: real images showed large decay rates and a range of magnitudes, whereas deep

network generated images showed small decay rates and the magnitude varied depending on the

generative model. These differences were more noticeable at higher resolutions, but lossy image

compression algorithms modified the high-frequency spectra and reduced those differences. When

highly compressed, images generated by certain architectures were indistinguishable from real images

in the frequency domain, but images generated by other architectures like VAEs were not affected

due to their low levels of high-frequency content.

We proposed a detection method for identifying deep network generated images based on their high-

frequency characteristics and performed binary classification experiments on datasets of real images

and images generated by StyleGAN, StyleGAN2, PGGAN, VQ-VAE2, and ALAE architectures.

This detection method achieved an accuracy of 99.2% on uncompressed, high-resolution images with

minimal training data, but the accuracy decreased with highly-compressed and/or low-resolution

images, although the classifier was able to robustly classify images at resolutions it was not trained

at. Finally, we presented a method for modifying the high-frequency spectra of a deep network

generated image to mimic the spectra of real images, effectively deceiving the classifier without any

visually noticeable changes in the image itself. In the future, this detection and synthesis method

will be applied to videos manipulated by deep generative models (i.e. deep fakes) to evaluate their

effectiveness.

Broader Impact

The most apparent impact of the present work is in the application of combating unethical uses of

deep network generated images. Since the proposed approach can robustly generalize to unknown

generative models and requires minimal training data, it can be easily implemented in browser plugins

and mobile applications to warn users that an image is likely fake. This work can be expanded upon

for detecting manipulated videos, a trending topic in current research, or towards adversarial training

of vision tasks. However, systematic implementation of the proposed method for spoofing images

would effectively nullify the capabilities of the classifier, and as a result, could create even more

realistic (and harder to detect) deep network generated images for malicious purposes.

Similarly, the current work can be used as a basis for improving the training process of generative

models; for example, a metric can be given for generated images in the frequency domain and used

for evaluating generative models, and a loss function in Fourier space, weighted towards the highest

modes, can be introduced to aid in improving these networks. Generative models for non-image data

could benefit from analysis in other domains as well.

A more general result of this work is the conclusion that generative models can have systematic

shortcomings that are not immediately evident until observed in other domains (e.g. frequency). In

fields where data is scarce and high-quality synthetic data is required to train models, fundamental

flaws in synthetic data can have hidden detrimental effects. This opens up questions about the

structure of synthetic data and what we perceive to be "high-quality" synthetic data.

Acknowledgments and Disclosure of Funding

The authors do not acknowledge any outside funding sources or competing interests.

References

[1]

Tero Karras, Samuli Laine, and Timo Aila. “A Style-Based Generator Architecture for Genera-

tive Adversarial Networks”. In: Proceedings of the 2019 IEEE/CVF Conference on Computer

Vision and Pattern Recognition (CVPR). 2019, pp. 4401–4410.

[2]

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila.

“Analyzing and Improving the Image Quality of StyleGAN”. In: Proceedings of the 2020

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June

2020.

[3]

Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. “Progressive Growing of GANs

for Improved Quality, Stability, and Variation”. In: International Conference on Learning

Representations. 2018.

[4]

Ali Razavi, Aaron van den Oord, and Oriol Vinyals. “Generating Diverse High-Fidelity Images

With VQ-VAE-2”. In: Advances in Neural Information Processing Systems. 2019, pp. 14866–

14876.

[5]

Stanislav Pidhorskyi, Donald A. Adjeroh, and Gianfranco Doretto. “Adversarial Latent Au-

toencoders”. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and

Pattern Recognition (CVPR). IEEE, June 2020.

[6]

Paweł Korus. “Digital Image Integrity – A Survey of Protection and Verification Techniques”.

In: Digital Signal Processing 71 (Dec. 2017), pp. 1–26.

[7]

Javier Galbally, Sebastien Marcel, and Julian Fierrez. “Biometric Antispoofing Methods: A

Survey in Face Recognition”. In: IEEE Access 2 (2014), pp. 1530–1552.

[8]

Alessandro Piva. “An Overview on Image Forensics”. In: ISRN Signal Processing 2013 (2013),

pp. 1–22.

[9]

Zinelabidine Boulkenafet, Jukka Komulainen, and Abdenour Hadid. “Face Spoofing Detection

Using Colour Texture Analysis”. In: IEEE Transactions on Information Forensics and Security

11.8 (Aug. 2016), pp. 1818–1830.

[10]

Haodong Li, Bin Li, Shunquan Tan, and Jiwu Huang. “Identification of Deep Network Gener-

ated Images Using Disparities in Color Components”. In: Signal Processing 174 (Sept. 2020),

p. 107616.

[11]

Francesco Marra, Diego Gragnaniello, Luisa Verdoliva, and Giovanni Poggi. “Do GANs Leave

Artificial Fingerprints?” In: 2019 IEEE Conference on Multimedia Information Processing

and Retrieval (MIPR). IEEE, Mar. 2019.

[12]

Xinsheng Xuan, Bo Peng, Wei Wang, and Jing Dong. “On the Generalization of GAN Image

Forensics”. In: Biometric Recognition. Springer International Publishing, 2019, pp. 134–141.

[13]

Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A. Efros. “CNN-

Generated Images Are Surprisingly Easy to Spot. . . for Now”. In: Proceedings of the 2020

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, June

2020.

[14]

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil

Ozair, Aaron Courville, and Yoshua Bengio. “Generative Adversarial Nets”. In: Proceedings

of the 27th International Conference on Neural Information Processing Systems - Volume 2.

2014, pp. 2672–2680.

[15]

Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. 2013. arXiv:

1312.

6114 [cs.ML].

[16]

A. van der Schaaf and J.H. van Hateren. “Modelling the Power Spectra of Natural Images:

Statistics and Information”. In: Vision Research 36.17 (Sept. 1996), pp. 2759–2770.

[17]

Michael Parker. “Image and Video Compression Fundamentals”. In: Digital Signal Processing

101 (Second Edition). 2017. Chap. 25, pp. 329–346.

[18]

David Salomon and Giovanni Motta. Handbook of Data Compression. Springer London, 2010.

[19]

Weiwei Zhang, Jian Sun, and Xiaoou Tang. “Cat Head Detection - How to Effectively Ex-

ploit Shape and Texture Features”. In: Lecture Notes in Computer Science. Springer Berlin

Heidelberg, 2008, pp. 802–816.

[20]

Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin

Arjovsky, and Aaron Courville. Adversarially Learned Inference. 2016. arXiv:

1606.00704

[cs.LG].

[21]

Wenzhe Shi, Jose Caballero, Lucas Theis, Ferenc Huszar, Andrew Aitken, Christian Ledig,

and Zehan Wang. Is The Deconvolution Layer The Same As A Convolutional Layer? 2016.

arXiv: 1609.07009 [cs.CV].

[22]

Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther.

“Autoencoding Beyond Pixels Using a Learned Similarity Metric”. In: Proceedings of the 33rd

International Conference on Machine Learning - Volume 48. ICML. 2016, pp. 1558–1566.

[23]

Ricard Durall, Margret Keuper, and Janis Keuper. “Watch Your Up-Convolution: CNN Based

Generative Deep Neural Networks Are Failing to Reproduce Spectral Distributions”. In:

Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition

(CVPR). IEEE, June 2020.

[24]

Mahyar Khayatkhoei and Ahmed Elgammal. Spatial Frequency Bias in Convolutional Genera-

tive Adversarial Networks. 2020. arXiv: 2010.01473 [cs.LG].