Analyzing and Improving the Image Quality of StyleGAN
Tero Karras
NVIDIA Samuli Laine
NVIDIA Miika Aittala
NVIDIA Janne Hellsten
NVIDIA
Jaakko Lehtinen
NVIDIA and Aalto University Timo Aila
NVIDIA
Abstract
The style-based GAN architecture (StyleGAN) yields
state-of-the-art results in data-driven unconditional gener-
ative image modeling. We expose and analyze several of
its characteristic artifacts, and propose changes in both
model architecture and training methods to address them.
In particular, we redesign the generator normalization, re-
visit progressive growing, and regularize the generator to
encourage good conditioning in the mapping from latent
codes to images. In addition to improving image quality,
this path length regularizer yields the additional benefit that
the generator becomes significantly easier to invert. This
makes it possible to reliably attribute a generated image to
a particular network. We furthermore visualize how well
the generator utilizes its output resolution, and identify a
capacity problem, motivating us to train larger models for
additional quality improvements. Overall, our improved
model redefines the state of the art in unconditional image
modeling, both in terms of existing distribution quality met-
rics as well as perceived image quality.
1. Introduction
The resolution and quality of images produced by gen-
erative methods, especially generative adversarial networks
(GAN) [16], are improving rapidly [23,31,5]. The current
state-of-the-art method for high-resolution image synthesis
is StyleGAN [24], which has been shown to work reliably
on a variety of datasets. Our work focuses on fixing its char-
acteristic artifacts and improving the result quality further.
The distinguishing feature of StyleGAN [24] is its un-
conventional generator architecture. Instead of feeding the
input latent code z∈ Z only to the beginning of a the net-
work, the mapping network ffirst transforms it to an inter-
mediate latent code w∈ W. Affine transforms then pro-
duce styles that control the layers of the synthesis network g
via adaptive instance normalization (AdaIN) [21,9,13,8].
Additionally, stochastic variation is facilitated by providing
additional random noise maps to the synthesis network. It
has been demonstrated [24,38] that this design allows the
intermediate latent space Wto be much less entangled than
the input latent space Z. In this paper, we focus all analy-
sis solely on W, as it is the relevant latent space from the
synthesis network’s point of view.
Many observers have noticed characteristic artifacts in
images generated by StyleGAN [3]. We identify two causes
for these artifacts, and describe changes in architecture and
training methods that eliminate them. First, we investigate
the origin of common blob-like artifacts, and find that the
generator creates them to circumvent a design flaw in its ar-
chitecture. In Section 2, we redesign the normalization used
in the generator, which removes the artifacts. Second, we
analyze artifacts related to progressive growing [23] that has
been highly successful in stabilizing high-resolution GAN
training. We propose an alternative design that achieves the
same goal — training starts by focusing on low-resolution
images and then progressively shifts focus to higher and
higher resolutions — without changing the network topol-
ogy during training. This new design also allows us to rea-
son about the effective resolution of the generated images,
which turns out to be lower than expected, motivating a ca-
pacity increase (Section 4).
Quantitative analysis of the quality of images produced
using generative methods continues to be a challenging
topic. Fr´
echet inception distance (FID) [20] measures dif-
ferences in the density of two distributions in the high-
dimensional feature space of an InceptionV3 classifier [39].
Precision and Recall (P&R) [36,27] provide additional vis-
ibility by explicitly quantifying the percentage of generated
images that are similar to training data and the percentage
of training data that can be generated, respectively. We use
these metrics to quantify the improvements.
Both FID and P&R are based on classifier networks that
have recently been shown to focus on textures rather than
shapes [12], and consequently, the metrics do not accurately
capture all aspects of image quality. We observe that the
perceptual path length (PPL) metric [24], originally intro-
duced as a method for estimating the quality of latent space
1
arXiv:1912.04958v2 [cs.CV] 23 Mar 2020
Figure 1. Instance normalization causes water droplet -like artifacts in StyleGAN images. These are not always obvious in the generated
images, but if we look at the activations inside the generator network, the problem is always there, in all feature maps starting from the
64x64 resolution. It is a systemic problem that plagues all StyleGAN images.
interpolations, correlates with consistency and stability of
shapes. Based on this, we regularize the synthesis network
to favor smooth mappings (Section 3) and achieve a clear
improvement in quality. To counter its computational ex-
pense, we also propose executing all regularizations less
frequently, observing that this can be done without com-
promising effectiveness.
Finally, we find that projection of images to the latent
space Wworks significantly better with the new, path-
length regularized StyleGAN2 generator than with the orig-
inal StyleGAN. This makes it easier to attribute a generated
image to its source (Section 5).
Our implementation and trained models are available at
https://github.com/NVlabs/stylegan2
2. Removing normalization artifacts
We begin by observing that most images generated by
StyleGAN exhibit characteristic blob-shaped artifacts that
resemble water droplets. As shown in Figure 1, even when
the droplet may not be obvious in the final image, it is
present in the intermediate feature maps of the generator.1
The anomaly starts to appear around 64×64 resolution,
is present in all feature maps, and becomes progressively
stronger at higher resolutions. The existence of such a con-
sistent artifact is puzzling, as the discriminator should be
able to detect it.
We pinpoint the problem to the AdaIN operation that
normalizes the mean and variance of each feature map sepa-
rately, thereby potentially destroying any information found
in the magnitudes of the features relative to each other. We
hypothesize that the droplet artifact is a result of the gener-
ator intentionally sneaking signal strength information past
instance normalization: by creating a strong, localized spike
that dominates the statistics, the generator can effectively
scale the signal as it likes elsewhere. Our hypothesis is sup-
ported by the finding that when the normalization step is
removed from the generator, as detailed below, the droplet
artifacts disappear completely.
1In rare cases (perhaps 0.1% of images) the droplet is missing, leading
to severely corrupted images. See Appendix Afor details.
2.1. Generator architecture revisited
We will first revise several details of the StyleGAN
generator to better facilitate our redesigned normalization.
These changes have either a neutral or small positive effect
on their own in terms of quality metrics.
Figure 2a shows the original StyleGAN synthesis net-
work g[24], and in Figure 2b we expand the diagram to full
detail by showing the weights and biases and breaking the
AdaIN operation to its two constituent parts: normalization
and modulation. This allows us to re-draw the conceptual
gray boxes so that each box indicates the part of the network
where one style is active (i.e., “style block”). Interestingly,
the original StyleGAN applies bias and noise within the
style block, causing their relative impact to be inversely pro-
portional to the current style’s magnitudes. We observe that
more predictable results are obtained by moving these op-
erations outside the style block, where they operate on nor-
malized data. Furthermore, we notice that after this change
it is sufficient for the normalization and modulation to op-
erate on the standard deviation alone (i.e., the mean is not
needed). The application of bias, noise, and normalization
to the constant input can also be safely removed without ob-
servable drawbacks. This variant is shown in Figure 2c, and
serves as a starting point for our redesigned normalization.
2.2. Instance normalization revisited
One of the main strengths of StyleGAN is the ability to
control the generated images via style mixing, i.e., by feed-
ing a different latent wto different layers at inference time.
In practice, style modulation may amplify certain feature
maps by an order of magnitude or more. For style mixing to
work, we must explicitly counteract this amplification on a
per-sample basis— otherwise the subsequent layers would
not be able to operate on the data in a meaningful way.
If we were willing to sacrifice scale-specific controls (see
video), we could simply remove the normalization, thus re-
moving the artifacts and also improving FID slightly [27].
We will now propose a better alternative that removes the
artifacts while retaining full controllability. The main idea
is to base normalization on the expected statistics of the in-
coming feature maps, but without explicit forcing.
2
1
Upsample
Const 4×4×512
Conv 3×3
Conv 3×3
Conv 3×3
+
+
+
+
AdaIN
AdaIN
AdaIN
AdaIN
4×4
8×8
A
A
A
A
B
B
B
B
b1
Upsample
Conv 3×3
Conv 3×3
Conv 3×3
Norm mean/std
+
+
+
+
AMod mean/std
Norm mean/std
Norm mean/std
AMod mean/std
Norm mean/std
AMod mean/std
b2
b3
b4
w2
w3
w4
c1
Style block
Style block
Style block
B
B
B
B
Upsample
Norm std
Mod std
Norm std
Norm std
Mod std
Mod std
Conv 3×3
Conv 3×3
Conv 3×3
c1
A
A
A
w2
w3
w4
b2
b3
+
+
b4+
B
B
B
c1
Upsample
Conv 3×3
A
w3
Mod
Demod
Conv 3×3
w2
AMod
Demod
Conv 3×3
w4
A
Demod
Mod
b2+B
b3+B
b4+B
(a) StyleGAN (b) StyleGAN (detailed) (c) Revised architecture (d) Weight demodulation
Figure 2. We redesign the architecture of the StyleGAN synthesis network. (a) The original StyleGAN, where A denotes a learned
affine transform from Wthat produces a style and B is a noise broadcast operation. (b) The same diagram with full detail. Here we have
broken the AdaIN to explicit normalization followed by modulation, both operating on the mean and standard deviation per feature map.
We have also annotated the learned weights (w), biases (b), and constant input (c), and redrawn the gray boxes so that one style is active
per box. The activation function (leaky ReLU) is always applied right after adding the bias. (c) We make several changes to the original
architecture that are justified in the main text. We remove some redundant operations at the beginning, move the addition of band B to
be outside active area of a style, and adjust only the standard deviation per feature map. (d) The revised architecture enables us to replace
instance normalization with a “demodulation” operation, which we apply to the weights associated with each convolution layer.
Recall that a style block in Figure 2c consists of modula-
tion, convolution, and normalization. Let us start by consid-
ering the effect of a modulation followed by a convolution.
The modulation scales each input feature map of the convo-
lution based on the incoming style, which can alternatively
be implemented by scaling the convolution weights:
w0
ijk =si·wijk,(1)
where wand w0are the original and modulated weights,
respectively, siis the scale corresponding to the ith input
feature map, and jand kenumerate the output feature maps
and spatial footprint of the convolution, respectively.
Now, the purpose of instance normalization is to essen-
tially remove the effect of sfrom the statistics of the con-
volution’s output feature maps. We observe that this goal
can be achieved more directly. Let us assume that the in-
put activations are i.i.d. random variables with unit standard
deviation. After modulation and convolution, the output ac-
tivations have standard deviation of
σj=rX
i,k
w0
ijk
2,(2)
i.e., the outputs are scaled by the L2norm of the corre-
sponding weights. The subsequent normalization aims to
restore the outputs back to unit standard deviation. Based
on Equation 2, this is achieved if we scale (“demodulate”)
each output feature map jby 1j. Alternatively, we can
again bake this into the convolution weights:
w00
ijk =w0
ijk rX
i,k
w0
ijk
2+, (3)
where is a small constant to avoid numerical issues.
We have now baked the entire style block to a single con-
volution layer whose weights are adjusted based on susing
Equations 1and 3(Figure 2d). Compared to instance nor-
malization, our demodulation technique is weaker because
it is based on statistical assumptions about the signal in-
stead of actual contents of the feature maps. Similar statis-
tical analysis has been extensively used in modern network
initializers [14,19], but we are not aware of it being pre-
viously used as a replacement for data-dependent normal-
ization. Our demodulation is also related to weight normal-
ization [37] that performs the same calculation as a part of
reparameterizing the weight tensor. Prior work has iden-
tified weight normalization as beneficial in the context of
GAN training [43].
Our new design removes the characteristic artifacts (Fig-
ure 3) while retaining full controllability, as demonstrated
in the accompanying video. FID remains largely unaffected
(Table 1, rows A,B), but there is a notable shift from preci-
sion to recall. We argue that this is generally desirable, since
recall can be traded into precision via truncation, whereas
3
Configuration FFHQ, 1024×1024 LSUN Car, 512×384
FID Path length Precision Recall FID Path length Precision Recall
ABaseline StyleGAN [24] 4.40 212.1 0.721 0.399 3.27 1484.5 0.701 0.435
B+ Weight demodulation 4.39 175.4 0.702 0.425 3.04 862.4 0.685 0.488
C+ Lazy regularization 4.38 158.0 0.719 0.427 2.83 981.6 0.688 0.493
D+ Path length regularization 4.34 122.5 0.715 0.418 3.43 651.2 0.697 0.452
E+ No growing, new G & D arch. 3.31 124.5 0.705 0.449 3.19 471.2 0.690 0.454
F+ Large networks (StyleGAN2) 2.84 145.0 0.689 0.492 2.32 415.5 0.678 0.514
Config Awith large networks 3.98 199.2 0.716 0.422 – – – –
Table 1. Main results. For each training run, we selected the training snapshot with the lowest FID. We computed each metric 10 times
with different random seeds and report their average. Path length corresponds to the PPL metric, computed based on path endpoints in W
[24], without the central crop used by Karras et al. [24]. The FFHQ dataset contains 70k images, and the discriminator saw 25M images
during training. For LSUN CAR the numbers were 893k and 57M. indicates that higher is better, and that lower is better.
Figure 3. Replacing normalization with demodulation removes the
characteristic artifacts from images and activations.
the opposite is not true [27]. In practice our design can be
implemented efficiently using grouped convolutions, as de-
tailed in Appendix B. To avoid having to account for the
activation function in Equation 3, we scale our activation
functions so that they retain the expected signal variance.
3. Image quality and generator smoothness
While GAN metrics such as FID or Precision and Recall
(P&R) successfully capture many aspects of the generator,
they continue to have somewhat of a blind spot for image
quality. For an example, refer to Figures 13 and 14 that
contrast generators with identical FID and P&R scores but
markedly different overall quality.2
2We believe that the key to the apparent inconsistency lies in the par-
ticular choice of feature space rather than the foundations of FID or P&R.
It was recently discovered that classifiers trained using ImageNet [35] tend
to base their decisions much more on texture than shape [12], while hu-
mans strongly focus on shape [28]. This is relevant in our context because
(a) Low PPL scores (b) High PPL scores
Figure 4. Connection between perceptual path length and image
quality using baseline StyleGAN (config A) with LSUN CAT. (a)
Random examples with low PPL (10th percentile). (b) Exam-
ples with high PPL (90th percentile). There is a clear correla-
tion between PPL scores and semantic consistency of the images.
0 500 1000 1500 2000 2500 3000 3500 4000
0 500 1000 1500 2000 2500 3000 3500 4000
(a) StyleGAN (config A) (b) StyleGAN2 (config F)
Figure 5. (a) Distribution of PPL scores of individual images
generated using baseline StyleGAN (config A) with LSUN CAT
(FID = 8.53, PPL = 924). The percentile ranges corresponding to
Figure 4are highlighted in orange. (b) StyleGAN2 (config F) im-
proves the PPL distribution considerably (showing a snapshot with
the same FID = 8.53, PPL = 387).
We observe a correlation between perceived image qual-
ity and perceptual path length (PPL) [24], a metric that was
originally introduced for quantifying the smoothness of the
mapping from a latent space to the output image by measur-
ing average LPIPS distances [50] between generated images
under small perturbations in latent space. Again consulting
Figures 13 and 14, a smaller PPL (smoother generator map-
ping) appears to correlate with higher overall image qual-
FID and P&R use high-level features from InceptionV3 [39] and VGG-16
[39], respectively, which were trained in this way and are thus expected
to be biased towards texture detection. As such, images with, e.g., strong
cat textures may appear more similar to each other than a human observer
would agree, thus partially compromising density-based metrics (FID) and
manifold coverage metrics (P&R).
4
ity, whereas other metrics are blind to the change. Figure 4
examines this correlation more closely through per-image
PPL scores on LSUN CAT, computed by sampling the la-
tent space around wf(z). Low scores are indeed in-
dicative of high-quality images, and vice versa. Figure 5a
shows the corresponding histogram and reveals the long tail
of the distribution. The overall PPL for the model is sim-
ply the expected value of these per-image PPL scores. We
always compute PPL for the entire image, as opposed to
Karras et al. [24] who use a smaller central crop.
It is not immediately obvious why a low PPL should
correlate with image quality. We hypothesize that during
training, as the discriminator penalizes broken images, the
most direct way for the generator to improve is to effectively
stretch the region of latent space that yields good images.
This would lead to the low-quality images being squeezed
into small latent space regions of rapid change. While this
improves the average output quality in the short term, the
accumulating distortions impair the training dynamics and
consequently the final image quality.
Clearly, we cannot simply encourage minimal PPL since
that would guide the generator toward a degenerate solution
with zero recall. Instead, we will describe a new regular-
izer that aims for a smoother generator mapping without this
drawback. As the resulting regularization term is somewhat
expensive to compute, we first describe a general optimiza-
tion that applies to any regularization technique.
3.1. Lazy regularization
Typically the main loss function (e.g., logistic loss [16])
and regularization terms (e.g., R1[30]) are written as a sin-
gle expression and are thus optimized simultaneously. We
observe that the regularization terms can be computed less
frequently than the main loss function, thus greatly dimin-
ishing their computational cost and the overall memory us-
age. Table 1, row Cshows that no harm is caused when R1
regularization is performed only once every 16 minibatches,
and we adopt the same strategy for our new regularizer as
well. Appendix Bgives implementation details.
3.2. Path length regularization
We would like to encourage that a fixed-size step in W
results in a non-zero, fixed-magnitude change in the image.
We can measure the deviation from this ideal empirically
by stepping into random directions in the image space and
observing the corresponding wgradients. These gradients
should have close to an equal length regardless of wor the
image-space direction, indicating that the mapping from the
latent space to image space is well-conditioned [33].
At a single w∈ W, the local metric scaling properties
of the generator mapping g(w) : W 7→ Y are captured by
the Jacobian matrix Jw=∂g(w)/∂w. Motivated by the
desire to preserve the expected lengths of vectors regardless
of the direction, we formulate our regularizer as
Ew,y∼N(0,I)
JT
wy
2a2,(4)
where yare random images with normally distributed pixel
intensities, and wf(z), where zare normally dis-
tributed. We show in Appendix Cthat, in high dimen-
sions, this prior is minimized when Jwis orthogonal (up
to a global scale) at any w. An orthogonal matrix preserves
lengths and introduces no squeezing along any dimension.
To avoid explicit computation of the Jacobian matrix,
we use the identity JT
wy=w(g(w)·y), which is ef-
ficiently computable using standard backpropagation [6].
The constant ais set dynamically during optimization as
the long-running exponential moving average of the lengths
kJT
wyk2, allowing the optimization to find a suitable global
scale by itself.
Our regularizer is closely related to the Jacobian clamp-
ing regularizer presented by Odena et al. [33]. Practical dif-
ferences include that we compute the products JT
wyana-
lytically whereas they use finite differences for estimating
Jwδwith Z 3 δ∼ N(0,I). It should be noted that spec-
tral normalization [31] of the generator [46] only constrains
the largest singular value, posing no constraints on the oth-
ers and hence not necessarily leading to better conditioning.
We find that enabling spectral normalization in addition to
our contributions — or instead of them — invariably com-
promises FID, as detailed in Appendix E.
In practice, we notice that path length regularization
leads to more reliable and consistently behaving models,
making architecture exploration easier. We also observe
that the smoother generator is significantly easier to invert
(Section 5). Figure 5b shows that path length regularization
clearly tightens the distribution of per-image PPL scores,
without pushing the mode to zero. However, Table 1, row D
points toward a tradeoff between FID and PPL in datasets
that are less structured than FFHQ.
4. Progressive growing revisited
Progressive growing [23] has been very successful in sta-
bilizing high-resolution image synthesis, but it causes its
own characteristic artifacts. The key issue is that the pro-
gressively grown generator appears to have a strong location
preference for details; the accompanying video shows that
when features like teeth or eyes should move smoothly over
the image, they may instead remain stuck in place before
jumping to the next preferred location. Figure 6shows a re-
lated artifact. We believe the problem is that in progressive
growing each resolution serves momentarily as the output
resolution, forcing it to generate maximal frequency details,
which then leads to the trained network to have excessively
high frequencies in the intermediate layers, compromising
shift invariance [49]. Appendix Ashows an example. These
5
Figure 6. Progressive growing leads to “phase” artifacts. In this
example the teeth do not follow the pose but stay aligned to the
camera, as indicated by the blue line.
1
1024×1024
512×512
256×256
fRGB
fRGB
fRGB
1024×1024
512×512
256×256
fRGB
fRGB
fRGB
Down
Down
Down
Down
Down
+
fRGB
+
+
1024×1024
512×512
256×256
256×256
512×512
1024×1024
tRGB
tRGB
tRGB
+
Up
+
+
Up
256×256
512×512
1024×1024
tRGB
tRGB
tRGB
+
+
Up
Up
+
Up
tRGB
256×256
512×512
1024×1024
(a) MSG-GAN (b) Input/output skips (c) Residual nets
Figure 7. Three generator (above the dashed line) and discrimi-
nator architectures. Up and Down denote bilinear up and down-
sampling, respectively. In residual networks these also include
1×1 convolutions to adjust the number of feature maps. tRGB
and fRGB convert between RGB and high-dimensional per-pixel
data. Architectures used in configs Eand Fare shown in green.
issues prompt us to search for an alternative formulation
that would retain the benefits of progressive growing with-
out the drawbacks.
4.1. Alternative network architectures
While StyleGAN uses simple feedforward designs in the
generator (synthesis network) and discriminator, there is a
vast body of work dedicated to the study of better network
architectures. Skip connections [34,22], residual networks
[18,17,31], and hierarchical methods [7,47,48] have
proven highly successful also in the context of generative
methods. As such, we decided to re-evaluate the network
design of StyleGAN and search for an architecture that pro-
duces high-quality images without progressive growing.
Figure 7a shows MSG-GAN [22], which connects the
matching resolutions of the generator and discriminator us-
ing multiple skip connections. The MSG-GAN generator
is modified to output a mipmap [42] instead of an image,
and a similar representation is computed for each real im-
FFHQ D original D input skips D residual
FID PPL FID PPL FID PPL
G original 4.32 265 4.18 235 3.58 269
G output skips 4.33 169 3.77 127 3.31 125
G residual 4.35 203 3.96 229 3.79 243
LSUN Car D original D input skips D residual
FID PPL FID PPL FID PPL
G original 3.75 905 3.23 758 3.25 802
G output skips 3.77 544 3.86 316 3.19 471
G residual 3.93 981 3.40 667 2.66 645
Table 2. Comparison of generator and discriminator architectures
without progressive growing. The combination of generator with
output skips and residual discriminator corresponds to configura-
tion Ein the main result table.
age as well. In Figure 7b we simplify this design by up-
sampling and summing the contributions of RGB outputs
corresponding to different resolutions. In the discriminator,
we similarly provide the downsampled image to each reso-
lution block of the discriminator. We use bilinear filtering in
all up and downsampling operations. In Figure 7c we fur-
ther modify the design to use residual connections.3This
design is similar to LAPGAN [7] without the per-resolution
discriminators employed by Denton et al.
Table 2compares three generator and three discrimina-
tor architectures: original feedforward networks as used
in StyleGAN, skip connections, and residual networks, all
trained without progressive growing. FID and PPL are pro-
vided for each of the 9 combinations. We can see two broad
trends: skip connections in the generator drastically im-
prove PPL in all configurations, and a residual discriminator
network is clearly beneficial for FID. The latter is perhaps
not surprising since the structure of discriminator resem-
bles classifiers where residual architectures are known to be
helpful. However, a residual architecture was harmful in
the generator —the lone exception was FID in LSUN CAR
when both networks were residual.
For the rest of the paper we use a skip generator and a
residual discriminator, without progressive growing. This
corresponds to configuration Ein Table 1, and it signifi-
cantly improves FID and PPL.
4.2. Resolution usage
The key aspect of progressive growing, which we would
like to preserve, is that the generator will initially focus on
low-resolution features and then slowly shift its attention to
finer details. The architectures in Figure 7make it possible
for the generator to first output low resolution images that
are not affected by the higher-resolution layers in a signif-
icant way, and later shift the focus to the higher-resolution
3In residual network architectures, the addition of two paths leads to a
doubling of signal variance, which we cancel by multiplying with 1/2.
This is crucial for our networks, whereas in classification resnets [18] the
issue is typically hidden by batch normalization.
6
0 1 2 3 4 5 10 15 20 25
0%
20%
40%
60%
80%
100%
256×256
512×512
1024×1024
0 1 2 3 4 5 10 15 20 25
0%
20%
40%
60%
80%
100%
256×256
512×512
1024×1024
(a) StyleGAN-sized (config E) (b) Large networks (config F)
Figure 8. Contribution of each resolution to the output of the
generator as a function of training time. The vertical axis shows
a breakdown of the relative standard deviations of different reso-
lutions, and the horizontal axis corresponds to training progress,
measured in millions of training images shown to the discrimina-
tor. We can see that in the beginning the network focuses on low-
resolution images and progressively shifts its focus on larger res-
olutions as training progresses. In (a) the generator basically out-
puts a 5122image with some minor sharpening for 10242, while in
(b) the larger network focuses more on the high-resolution details.
layers as the training proceeds. Since this is not enforced in
any way, the generator will do it only if it is beneficial. To
analyze the behavior in practice, we need to quantify how
strongly the generator relies on particular resolutions over
the course of training.
Since the skip generator (Figure 7b) forms the image by
explicitly summing RGB values from multiple resolutions,
we can estimate the relative importance of the correspond-
ing layers by measuring how much they contribute to the
final image. In Figure 8a, we plot the standard deviation of
the pixel values produced by each tRGB layer as a function
of training time. We calculate the standard deviations over
1024 random samples of wand normalize the values so that
they sum to 100%.
At the start of training, we can see that the new skip
generator behaves similar to progressive growing — now
achieved without changing the network topology. It would
thus be reasonable to expect the highest resolution to dom-
inate towards the end of the training. The plot, however,
shows that this fails to happen in practice, which indicates
that the generator may not be able to “fully utilize” the tar-
get resolution. To verify this, we inspected the generated
images manually and noticed that they generally lack some
of the pixel-level detail that is present in the training data—
the images could be described as being sharpened versions
of 5122images instead of true 10242images.
This leads us to hypothesize that there is a capacity prob-
lem in our networks, which we test by doubling the number
of feature maps in the highest-resolution layers of both net-
works.4This brings the behavior more in line with expecta-
4We double the number of feature maps in resolutions 64210242
while keeping other parts of the networks unchanged. This increases the
total number of trainable parameters in the generator by 22% (25M
30M) and in the discriminator by 21% (24M 29M).
Dataset Resolution StyleGAN (A)StyleGAN2 (F)
FID PPL FID PPL
LSUN CAR 512×384 3.27 1485 2.32 416
LSUN CAT 256×256 8.53 924 6.93 439
LSUN CHURCH 256×256 4.21 742 3.86 342
LSUN HORSE 256×256 3.83 1405 3.43 338
Table 3. Improvement in LSUN datasets measured using FID and
PPL. We trained CAR for 57M images, CAT for 88M, CHURCH
for 48M, and HO RS E for 100M images.
tions: Figure 8b shows a significant increase in the contri-
bution of the highest-resolution layers, and Table 1, row F
shows that FID and Recall improve markedly. The last row
shows that baseline StyleGAN also benefits from additional
capacity, but its quality remains far below StyleGAN2.
Table 3compares StyleGAN and StyleGAN2 in four
LSUN categories, again showing clear improvements in
FID and significant advances in PPL. It is possible that fur-
ther increases in the size could provide additional benefits.
5. Projection of images to latent space
Inverting the synthesis network gis an interesting prob-
lem that has many applications. Manipulating a given im-
age in the latent feature space requires finding a matching
latent code wfor it first. Previous research [1,10] suggests
that instead of finding a common latent code w, the results
improve if a separate wis chosen for each layer of the gen-
erator. The same approach was used in an early encoder im-
plementation [32]. While extending the latent space in this
fashion finds a closer match to a given image, it also enables
projecting arbitrary images that should have no latent rep-
resentation. Instead, we concentrate on finding latent codes
in the original, unextended latent space, as these correspond
to images that the generator could have produced.
Our projection method differs from previous methods
in two ways. First, we add ramped-down noise to the la-
tent code during optimization in order to explore the latent
space more comprehensively. Second, we also optimize the
stochastic noise inputs of the StyleGAN generator, regular-
izing them to ensure they do not end up carrying coherent
signal. The regularization is based on enforcing the auto-
correlation coefficients of the noise maps to match those of
unit Gaussian noise over multiple scales. Details of our pro-
jection method can be found in Appendix D.
5.1. Attribution of generated images
Detection of manipulated or generated images is a very
important task. At present, classifier-based methods can
quite reliably detect generated images, regardless of their
exact origin [29,45,40,51,41]. However, given the rapid
pace of progress in generative methods, this may not be a
lasting situation. Besides general detection of fake images,
we may also consider a more limited form of the problem:
7
StyleGAN — generated images StyleGAN2 — generated images StyleGAN2 — real images
Figure 9. Example images and their projected and re-synthesized counterparts. For each configuration, top row shows the target images
and bottom row shows the synthesis of the corresponding projected latent vector and noise inputs. With the baseline StyleGAN, projection
often finds a reasonably close match for generated images, but especially the backgrounds differ from the originals. The images generated
using StyleGAN2 can be projected almost perfectly back into generator inputs, while projected real images (from the training set) show
clear differences to the originals, as expected. All tests were done using the same projection method and hyperparameters.
0.0 0.1 0.2 0.3 0.4 0.5
Generated
Real
0.0 0.1 0.2 0.3 0.4 0.5
Generated
Real
LSUN CAR, StyleGAN FFHQ, StyleGAN
0.0 0.1 0.2 0.3 0.4 0.5
Generated
Real
0.0 0.1 0.2 0.3 0.4 0.5
Generated
Real
LSUN CAR, StyleGAN2 FFHQ, StyleGAN2
Figure 10. LPIPS distance histograms between original and pro-
jected images for generated (blue) and real images (orange). De-
spite the higher image quality of our improved generator, it is
much easier to project the generated images into its latent space
W. The same projection method was used in all cases.
being able to attribute a fake image to its specific source [2].
With StyleGAN, this amounts to checking if there exists a
w∈ W that re-synthesis the image in question.
We measure how well the projection succeeds by com-
puting the LPIPS [50] distance between original and re-
synthesized image as DLPIPS[x, gg1(x))], where xis the
image being analyzed and ˜g1denotes the approximate pro-
jection operation. Figure 10 shows histograms of these dis-
tances for LSUN CAR and FFHQ datasets using the origi-
nal StyleGAN and StyleGAN2, and Figure 9shows exam-
ple projections. The images generated using StyleGAN2
can be projected into Wso well that they can be almost
unambiguously attributed to the generating network. How-
ever, with the original StyleGAN, even though it should
technically be possible to find a matching latent code, it ap-
pears that the mapping from Wto images is too complex
for this to succeed reliably in practice. We find it encour-
aging that StyleGAN2 makes source attribution easier even
though the image quality has improved significantly.
6. Conclusions and future work
We have identified and fixed several image quality is-
sues in StyleGAN, improving the quality further and con-
siderably advancing the state of the art in several datasets.
In some cases the improvements are more clearly seen in
motion, as demonstrated in the accompanying video. Ap-
pendix Aincludes further examples of results obtainable us-
ing our method. Despite the improved quality, StyleGAN2
makes it easier to attribute a generated image to its source.
Training performance has also improved. At 10242
resolution, the original StyleGAN (config Ain Table 1)
trains at 37 images per second on NVIDIA DGX-1 with
8 Tesla V100 GPUs, while our config Etrains 40% faster
at 61 img/s. Most of the speedup comes from simplified
dataflow due to weight demodulation, lazy regularization,
and code optimizations. StyleGAN2 (config F, larger net-
works) trains at 31 img/s, and is thus only slightly more
expensive to train than original StyleGAN. Its total training
time was 9 days for FFHQ and 13 days for LSUN CAR.
The entire project, including all exploration, consumed
132 MWh of electricity, of which 0.68 MWh went into
training the final FFHQ model. In total, we used about
51 single-GPU years of computation (Volta class GPU). A
more detailed discussion is available in Appendix F.
In the future, it could be fruitful to study further improve-
ments to the path length regularization, e.g., by replacing
the pixel-space L2distance with a data-driven feature-space
metric. Considering the practical deployment of GANs, we
feel that it will be important to find new ways to reduce the
training data requirements. This is especially crucial in ap-
plications where it is infeasible to acquire tens of thousands
of training samples, and with datasets that include a lot of
intrinsic variation.
Acknowledgements We thank Ming-Yu Liu for an early
review, Timo Viitanen for help with the public release,
David Luebke for in-depth discussions and helpful com-
ments, and Tero Kuosmanen for technical support with the
compute infrastructure.
8
References
[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Im-
age2StyleGAN: How to embed images into the StyleGAN
latent space? In ICCV, 2019. 7
[2] Michael Albright and Scott McCloskey. Source generator
attribution via inversion. In CVPR Workshops, 2019. 8
[3] Carl Bergstrom and Jevin West. Which face is
real? http://www.whichfaceisreal.com/learn.html, Accessed
November 15, 2019. 1
[4] Christopher M. Bishop. Pattern Recognition and Machine
Learning. Springer, 2006. 17
[5] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large
scale GAN training for high fidelity natural image synthesis.
CoRR, abs/1809.11096, 2018. 1
[6] Yann N. Dauphin, Harm de Vries, and Yoshua Bengio. Equi-
librated adaptive learning rates for non-convex optimization.
CoRR, abs/1502.04390, 2015. 5
[7] Emily L. Denton, Soumith Chintala, Arthur Szlam, and
Robert Fergus. Deep generative image models using
a Laplacian pyramid of adversarial networks. CoRR,
abs/1506.05751, 2015. 6
[8] Vincent Dumoulin, Ethan Perez, Nathan Schucher, Flo-
rian Strub, Harm de Vries, Aaron Courville, and Yoshua
Bengio. Feature-wise transformations. Distill, 2018.
https://distill.pub/2018/feature-wise-transformations. 1
[9] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kud-
lur. A learned representation for artistic style. CoRR,
abs/1610.07629, 2016. 1
[10] Aviv Gabbay and Yedid Hoshen. Style generator in-
version for image enhancement and animation. CoRR,
abs/1906.11880, 2019. 7
[11] R. Ge, X. Feng, H. Pyla, K. Cameron, and W.
Feng. Power measurement tutorial for the Green500
list. https://www.top500.org/green500/resources/tutorials/,
Accessed March 1, 2020. 21
[12] Robert Geirhos, Patricia Rubisch, Claudio Michaelis,
Matthias Bethge, Felix A. Wichmann, and Wieland Brendel.
ImageNet-trained CNNs are biased towards texture; increas-
ing shape bias improves accuracy and robustness. CoRR,
abs/1811.12231, 2018. 1,4
[13] Golnaz Ghiasi, Honglak Lee, Manjunath Kudlur, Vincent
Dumoulin, and Jonathon Shlens. Exploring the structure of a
real-time, arbitrary neural artistic stylization network. CoRR,
abs/1705.06830, 2017. 1
[14] Xavier Glorot and Yoshua Bengio. Understanding the diffi-
culty of training deep feedforward neural networks. In Pro-
ceedings of the Thirteenth International Conference on Arti-
ficial Intelligence and Statistics, pages 249–256, 2010. 3
[15] G.H. Golub and C.F. Van Loan. Matrix Computations. Johns
Hopkins Studies in the Mathematical Sciences. Johns Hop-
kins University Press, 2013. 17
[16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. Generative adversarial networks. In NIPS,
2014. 1,5,11
[17] Ishaan Gulrajani, Faruk Ahmed, Mart´
ın Arjovsky, Vincent
Dumoulin, and Aaron C. Courville. Improved training of
Wasserstein GANs. CoRR, abs/1704.00028, 2017. 6
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun. Deep residual learning for image recognition. CoRR,
abs/1512.03385, 2015. 6
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Delving deep into rectifiers: Surpassing human-level perfor-
mance on ImageNet classification. CoRR, abs/1502.01852,
2015. 3
[20] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
Bernhard Nessler, and Sepp Hochreiter. GANs trained by
a two time-scale update rule converge to a local Nash equi-
librium. In Proc. NIPS, pages 6626–6637, 2017. 1
[21] Xun Huang and Serge J. Belongie. Arbitrary style trans-
fer in real-time with adaptive instance normalization. CoRR,
abs/1703.06868, 2017. 1
[22] Animesh Karnewar and Oliver Wang. MSG-GAN: multi-
scale gradients for generative adversarial networks. In Proc.
CVPR, 2020. 6
[23] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.
Progressive growing of GANs for improved quality, stability,
and variation. CoRR, abs/1710.10196, 2017. 1,5,11
[24] Tero Karras, Samuli Laine, and Timo Aila. A style-based
generator architecture for generative adversarial networks. In
Proc. CVPR, 2018. 1,2,4,5,11,13,16,20
[25] Diederik P. Kingma and Jimmy Ba. Adam: A method for
stochastic optimization. In ICLR, 2015. 11,19
[26] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton.
ImageNet classification with deep convolutional neural net-
works. In NIPS, pages 1097–1105. 2012. 11
[27] Tuomas Kynk¨
a¨
anniemi, Tero Karras, Samuli Laine, Jaakko
Lehtinen, and Timo Aila. Improved precision and recall met-
ric for assessing generative models. In Proc. NeurIPS, 2019.
1,2,4
[28] Barbara Landau, Linda B. Smith, and Susan S. Jones. The
importance of shape in early lexical learning. Cognitive De-
velopment, 3(3), 1988. 4
[29] Haodong Li, Han Chen, Bin Li, and Shunquan Tan. Can
forensic detectors identify GAN generated images? In Proc.
Asia-Pacific Signal and Information Processing Association
Annual Summit and Conference (APSIPA ASC), 2018. 7
[30] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin.
Which training methods for GANs do actually converge?
CoRR, abs/1801.04406, 2018. 5,11
[31] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and
Yuichi Yoshida. Spectral normalization for generative ad-
versarial networks. CoRR, abs/1802.05957, 2018. 1,5,6,
20
[32] Dmitry Nikitko. StyleGAN – Encoder for official Ten-
sorFlow implementation. https://github.com/Puzer/stylegan-
encoder/, 2019. 7
[33] Augustus Odena, Jacob Buckman, Catherine Olsson, Tom B.
Brown, Christopher Olah, Colin Raffel, and Ian Goodfellow.
Is generator conditioning causally related to GAN perfor-
mance? CoRR, abs/1802.08768, 2018. 5,18
9
[34] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
Net: Convolutional networks for biomedical image segmen-
tation. In Proc. Medical Image Computing and Computer-
Assisted Intervention (MICCAI), pages 234–241, 2015. 6
[35] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
Aditya Khosla, Michael S. Bernstein, Alexander C. Berg,
and Fei-Fei Li. ImageNet large scale visual recognition chal-
lenge. In Proc. CVPR, 2015. 4
[36] Mehdi S. M. Sajjadi, Olivier Bachem, Mario Lucic, Olivier
Bousquet, and Sylvain Gelly. Assessing generative models
via precision and recall. CoRR, abs/1806.00035, 2018. 1
[37] Tim Salimans and Diederik P. Kingma. Weight normaliza-
tion: A simple reparameterization to accelerate training of
deep neural networks. CoRR, abs/1602.07868, 2016. 3
[38] Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. Inter-
preting the latent space of GANs for semantic face editing.
CoRR, abs/1907.10786, 2019. 1
[39] Karen Simonyan and Andrew Zisserman. Very deep convo-
lutional networks for large-scale image recognition. CoRR,
abs/1409.1556, 2014. 1,4
[40] Run Wang, Lei Ma, Felix Juefei-Xu, Xiaofei Xie, Jian Wang,
and Yang Liu. FakeSpotter: A simple baseline for spotting
AI-synthesized fake faces. CoRR, abs/1909.06122, 2019. 7
[41] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew
Owens, and Alexei A. Efros. CNN-generated images are
surprisingly easy to spot... for now. CoRR, abs/1912.11035,
2019. 7
[42] Lance Williams. Pyramidal parametrics. SIGGRAPH Com-
put. Graph., 17(3):1–11, 1983. 6
[43] Sitao Xiang and Hao Li. On the effects of batch and weight
normalization in generative adversarial networks. CoRR,
abs/1704.03971, 2017. 3
[44] Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianx-
iong Xiao. LSUN: Construction of a large-scale image
dataset using deep learning with humans in the loop. CoRR,
abs/1506.03365, 2015. 11
[45] Ning Yu, Larry Davis, and Mario Fritz. Attributing fake im-
ages to GANs: Analyzing fingerprints in generated images.
CoRR, abs/1811.08180, 2018. 7
[46] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augus-
tus Odena. Self-attention generative adversarial networks.
CoRR, abs/1805.08318, 2018. 5
[47] Han Zhang, Tao Xu, Hongsheng Li, ShaotingZhang, Xiaolei
Huang, Xiaogang Wang, and Dimitris N. Metaxas. Stack-
GAN: text to photo-realistic image synthesis with stacked
generative adversarial networks. In ICCV, 2017. 6
[48] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiao-
gang Wang, Xiaolei Huang, and Dimitris N. Metaxas. Stack-
GAN++: realistic image synthesis with stacked generative
adversarial networks. CoRR, abs/1710.10916, 2017. 6
[49] Richard Zhang. Making convolutional networks shift-
invariant again. In Proc. ICML, 2019. 5,11
[50] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shecht-
man, and Oliver Wang. The unreasonable effectiveness of
deep features as a perceptual metric. In Proc. CVPR, 2018.
4,8,19
[51] Xu Zhang, Svebor Karaman, and Shih-Fu Chang. Detect-
ing and simulating artifacts in GAN fake images. CoRR,
abs/1907.06515, 2019. 7
10
A. Image quality
We include several large images that illustrate various as-
pects related to image quality. Figure 11 shows hand-picked
examples illustrating the quality and diversity achievable
using our method in FFHQ, while Figure 12 shows uncu-
rated results for all datasets mentioned in the paper.
Figures 13 and 14 demonstrate cases where FID and
P&R give non-intuitive results, but PPL seems to be more
in line with human judgement.
We also include images relating to StyleGAN artifacts.
Figure 15 shows a rare case where the blob artifact fails
to appear in StyleGAN activations, leading to a seriously
broken image. Figure 16 visualizes the activations inside
Table 1configurations Aand F. It is evident that progres-
sive growing leads to higher-frequency content in the inter-
mediate layers, compromising shift invariance of the net-
work. We hypothesize that this causes the observed uneven
location preference for details when progressive growing is
used.
B. Implementation details
We implemented our techniques on top of the official
TensorFlow implementation of StyleGAN5corresponding
to configuration Ain Table 1. We kept most of the de-
tails unchanged, including the dimensionality of Zand W
(512), mapping network architecture (8 fully connected lay-
ers, 100×lower learning rate), equalized learning rate for
all trainable parameters [23], leaky ReLU activation with
α= 0.2, bilinear filtering [49] in all up/downsampling lay-
ers [24], minibatch standard deviation layer at the end of
the discriminator [23], exponential moving average of gen-
erator weights [23], style mixing regularization [24], non-
saturating logistic loss [16] with R1regularization [30],
Adam optimizer [25] with the same hyperparameters (β1=
0, β2= 0.99,  = 108,minibatch = 32), and train-
ing datasets [24,44]. We performed all training runs on
NVIDIA DGX-1 with 8 Tesla V100 GPUs using Tensor-
Flow 1.14.0 and cuDNN 7.4.2.
Generator redesign In configurations BFwe replace the
original StyleGAN generator with our revised architecture.
In addition to the changes highlighted in Section 2, we ini-
tialize components of the constant input c1using N(0,1)
and simplify the noise broadcast operations to use a single
shared scaling factor for all feature maps. Similar to Kar-
ras et al. [24], we initialize all weights using N(0,1) and all
biases and noise scaling factors to zero, except for the bi-
ases of the affine transformation layers, which we initialize
to one. We employ weight modulation and demodulation in
all convolution layers, except for the output layers (tRGB in
5https://github.com/NVlabs/stylegan
Figure 7) where we leave out the demodulation. With 10242
output resolution, the generator contains a total of 18 affine
transformation layers where the first one corresponds to 42
resolution, the next two correspond to 82, and so forth.
Weight demodulation Considering the practical imple-
mentation of Equations 1and 3, it is important to note that
the resulting set of weights will be different for each sam-
ple in a minibatch, which rules out direct implementation
using standard convolution primitives. Instead, we choose
to employ grouped convolutions [26] that were originally
proposed as a way to reduce computational costs by divid-
ing the input feature maps into multiple independent groups,
each with their own dedicated set of weights. We implement
Equations 1and 3by temporarily reshaping the weights and
activations so that each convolution sees one sample with
Ngroups — instead of Nsamples with one group. This ap-
proach is highly efficient because the reshaping operations
do not actually modify the contents of the weight and acti-
vation tensors.
Lazy regularization In configurations CFwe employ
lazy regularization (Section 3.1) by evaluating the regular-
ization terms (R1and path length) in a separate regulariza-
tion pass that we execute once every ktraining iterations.
We share the internal state of the Adam optimizer between
the main loss and the regularization terms, so that the opti-
mizer first sees gradients from the main loss for kiterations,
followed by gradients from the regularization terms for one
iteration. To compensate for the fact that we now perform
k+1 training iterations instead of k, we adjust the optimizer
hyperparameters λ0=c·λ,β0
1= (β1)c, and β0
2= (β2)c,
where c=k/(k+ 1). We also multiply the regularization
term by kto balance the overall magnitude of its gradients.
We use k= 16 for the discriminator and k= 8 for the
generator.
Path length regularization Configurations DFinclude
our new path length regularizer (Section 3.2). We initialize
the target scale ato zero and track it on a per-GPU basis
as the exponential moving average of
JT
wy
2using decay
coefficient βpl = 0.99. We weight our regularization term
by
γpl =ln 2
r2(ln rln 2) , (5)
where rspecifies the output resolution (e.g. r= 1024). We
have found these parameter choices to work reliably across
all configurations and datasets. To ensure that our regular-
izer interacts correctly with style mixing regularization, we
compute it as an average of all individual layers of the syn-
thesis network. Appendix Cprovides detailed analysis of
the effects of our regularizer on the mapping between W
and image space.
11
Figure 11. Four hand-picked examples illustrating the image quality and diversity achievable using StylegGAN2 (config F).
12
FFHQLSUN CARLSUN CATLSUN CHURCHLSUN HORSE
Figure 12. Uncurated results for each dataset used in Tables 1and 3. The images correspond to random outputs produced by our generator
(config F), with truncation applied at all resolutions using = 0.5[24].
13
Model 1: FID = 8.53, P = 0.64, R = 0.28, PPL = 924
Model 2: FID = 8.53, P = 0.62, R = 0.29, PPL = 387
Figure 13. Uncurated examples from two generative models trained on LSUN CAT without truncation. FID, precision, and recall are
similar for models 1 and 2, even though the latter produces cat-shaped objects more often. Perceptual path length (PPL) indicates a clear
preference for model 2. Model 1 corresponds to configuration Ain Table 3, and model 2 is an early training snapshot of configuration F.
14
Model 1: FID = 3.27, P = 0.70, R = 0.44, PPL = 1485
Model 2: FID = 3.27, P = 0.67, R = 0.48, PPL = 437
Figure 14. Uncurated examples from two generative models trained on LSUN CAR without truncation. FID, precision, and recall are
similar for models 1 and 2, even though the latter produces car-shaped objects more often. Perceptual path length (PPL) indicates a clear
preference for model 2. Model 1 corresponds to configuration Ain Table 3, and model 2 is an early training snapshot of configuration F.
15
Feature map 642Feature map 1282Feature map 2562Feature map 5122Generated image
Figure 15. An example of the importance of the droplet artifact in StyleGAN generator. We compare two generated images, one successful
and one severely corrupted. The corresponding feature maps were normalized to the viewable dynamic range using instance normalization.
For the top image, the droplet artifact starts forming in 642resolution, is clearly visible in 1282, and increasingly dominates the feature
maps in higher resolutions. For the bottom image, 642is qualitatively similar to the top row, but the droplet does not materialize in 1282.
Consequently, the facial features are stronger in the normalized feature map. This leads to an overshoot in 2562, followed by multiple
spurious droplets forming in subsequent resolutions. Based on our experience, it is rare that the droplet is missing from StyleGAN images,
and indeed the generator fully relies on its existence.
Generated image Feature map 1282Generated image Feature map 1282
(a) Progressive growing (config A) (b) Without progressive growing (config F)
Figure 16. Progressive growing leads to significantly higher frequency content in the intermediate layers. This compromises shift-
invariance of the network and makes it harder to localize features precisely in the higher-resolution layers.
Progressive growing In configurations ADwe use
progressive growing with the same parameters as Kar-
ras et al. [24] (start at 82resolution and learning rate λ=
103, train for 600k images per resolution, fade in next res-
olution for 600k images, increase learning rate gradually by
3×). In configurations EFwe disable progressive grow-
ing and set the learning rate to a fixed value λ= 2 ·103,
which we found to provide the best results. In addition, we
use output skips in the generator and residual connections
in the discriminator as detailed in Section 4.1.
Dataset-specific tuning Similar to Karras et al. [24], we
augment the FFHQ dataset with horizontal flips to effec-
tively increase the number of training images from 70k to
140k, and we do not perform any augmentation for the
LSUN datasets. We have found that the optimal choices
for the training length and R1regularization weight γtend
to vary considerably between datasets and configurations.
We use γ= 10 for all training runs except for configura-
tion Ein Table 1, as well as LSUN CHURCH and LSUN
HORSE in Table 3, where we use γ= 100. It is possible
that further tuning of γcould provide additional benefits.
16
Performance optimizations We profiled our training
runs extensively and found that— in our case— the default
primitives for image filtering, up/downsampling, bias ad-
dition, and leaky ReLU had surprisingly high overheads in
terms of training time and GPU memory footprint. This mo-
tivated us to optimize these operations using hand-written
CUDA kernels. We implemented filtered up/downsampling
as a single fused operation, and bias and activation as an-
other one. In configuration Eat 10242resolution, our opti-
mizations improved the overall training time by about 30%
and memory footprint by about 20%.
C. Effects of path length regularization
The path length regularizer described in Section 3.2 is of
the form:
Lpl =EwEy
JT
wy
2a2,(6)
where yRMis a unit normal distributed random variable
in the space of generated images (of dimension M= 3wh,
namely the RGB image dimensions), JwRM×Lis the
Jacobian matrix of the generator function g:RL7→ RMat
a latent space point wRL, and aRis a global value
that expresses the desired scale of the gradients.
C.1. Effect on pointwise Jacobians
The value of this prior is minimized when the inner ex-
pectation over yis minimized at every latent space point w
separately. In this subsection, we show that the inner ex-
pectation is (approximately) minimized when the Jacobian
matrix Jwis orthogonal, up to a global scaling factor. The
general strategy is to use the well-known fact that, in high
dimensions L, the density of a unit normal distribution is
concentrated on a spherical shell of radius L. The inner
expectation is then minimized when the matrix JT
wscales
the function under expectation to have its minima at this ra-
dius. This is achieved by any orthogonal matrix (with suit-
able global scale that is the same at every w).
We begin by considering the inner expectation
Lw:= Ey
JT
wy
2a2.
We first note that the radial symmetry of the distribution of
y, as well as of the l2norm, allows us to focus on diag-
onal matrices only. This is seen using the Singular Value
Decomposition JT
w=U˜
ΣVT, where URL×Land
VRM×Mare orthogonal matrices, and ˜
Σ= [Σ 0]is
a horizontal concatenation of a diagonal matrix ΣRL×L
and a zero matrix 0RL×(ML)[15]. Because rotating a
unit normal random variable by an orthogonal matrix leaves
the distribution unchanged, and rotating a vector leaves its
norm unchanged, the expression simplifies to
Lw=Ey
U˜
ΣVTy
2a2
=Ey
˜
Σy
2a2.
Furthermore, the zero matrix in ˜
Σdrops the dimensions of
ybeyond L, effectively marginalizing its distribution over
those dimensions. The marginalized distribution is again a
unit normal distribution over the remaining Ldimensions.
We are then left to consider the minimization of the expres-
sion
Lw=E˜
y(kΣ˜
yk2a)2,
over diagonal square matrices ΣRL×L, where ˜
yis unit
normal distributed in dimension L. To summarize, all matri-
ces JT
wthat share the same singular values with Σproduce
the same value for the original loss.
Next, we show that this expression is minimized when
the diagonal matrix Σhas a specific identical value at every
diagonal entry, i.e., it is a constant multiple of an identity
matrix. We first write the expectation as an integral over the
probability density of ˜
y:
Lw=Z(kΣ˜
yk2a)2p˜
y(˜
y) d˜
y
= (2π)L
2Z(kΣ˜
yk2a)2exp ˜
yT˜
y
2d˜
y
Observing the radially symmetric form of the density, we
change into a polar coordinates ˜
y=, where rR+is
the distance from origin, and φSL1is a unit vector, i.e.,
a point on the L1-dimensional unit sphere. This change
of variables introduces a Jacobian factor rL1:
˜
Lw= (2π)L
2ZSZ
0
(rkΣφk2a)2rL1
exp r2
2drdφ
The probability density (2π)L/2rL1exp r2
2is
then an L-dimensional unit normal density expressed in po-
lar coordinates, dependent only on the radius and not on the
angle. A standard argument by Taylor approximation shows
that when Lis high, for any φthe density is well approx-
imated by density (2πe/L)L/2exp 1
2(rµ)22,
which is a (unnormalized) one-dimensional normal density
in r, centered at µ=Lof standard deviation σ= 1/2
[4]. In other words, the density of the L-dimensional unit
normal distribution is concentrated on a shell of radius L.
Substituting this density into the integral, the loss becomes
17
approximately
Lw(2πe/L)L/2ZSZ
0
(rkΣφk2a)2
exp
rL2
2σ2
drdφ, (7)
where the approximation becomes exact in the limit of infi-
nite dimension L.
To minimize this loss, we set Σsuch that the function
(rkΣφk2a)2obtains minimal values on the spherical
shell of radius L. This is achieved by Σ=a
LI, whereby
the function becomes constant in φand the expression re-
duces to
Lw(2πe/L)L/2A(S)a2L1Z
0rL2
exp
rL2
2σ2
dr,
where A(S)is the surface area of the unit sphere (and
like the other constant factors, irrelevant for minimization).
Note that the zero of the parabola (rL)2coincides with
the maximum of the probability density, and therefore this
choice of Σminimizes the inner integral in Eq. 7separately
for every φ.
In summary, we have shown that— assuming a high di-
mensionality Lof the latent space — the value of the path
length prior (Eq. 6) is minimized when all singular values
of the Jacobian matrix of the generator are equal to a global
constant, at every latent space point w, i.e., they are orthog-
onal up to a globally constant scale.
While in theory amerely scales the values of the map-
ping without changing its properties and could be set to a
fixed value (e.g., 1), in practice it does affect the dynam-
ics of the training. If the imposed scale does not match
the scale induced by the random initialization of the net-
work, the training spends its critical early steps in pushing
the weights towards the required overall magnitudes, rather
than enforcing the actual objective of interest. This may de-
grade the internal state of the network weights and lead to
sub-optimal performance in later training. Empirically we
find that setting a fixed scale reduces the consistency of the
training results across training runs and datasets. Instead,
we set adynamically based on a running average of the ex-
isting scale of the Jacobians, namely aEw,y
JT
wy
2.
With this choice the prior targets the scale of the local Jaco-
bians towards whatever global average already exists, rather
than forcing a specific global average. This also eliminates
the need to measure the appropriate scale of the Jacobians
10 3
10 2
10 1
FFHQ, Config A
FFHQ, Config C
FFHQ, Config D
FFHQ, Config F
10 4
10 3
10 2
10 1
Cars, Config A
Cars, Config C
Cars, Config D
Cars, Config F
Figure 17. The mean and standard deviation of the magnitudes of
sorted singular values of the Jacobian matrix evaluated at random
latent space points w, with largest eigenvalue normalized to 1.
In both datasets, path length regularization (Config D) and novel
architecture (Config F) exhibit better conditioning; notably, the ef-
fect is more pronounced in the Cars dataset that contains much
more variability, and where path length regularization has a rela-
tively stronger effect on the PPL metric (Table 1).
explicitly, as is done by Odena et al. [33] who consider a
related conditioning prior.
Figure 17 shows empirically measured magnitudes of
singular values of the Jacobian matrix for networks trained
with and without path length regularization. While orthog-
onality is not reached, the eigenvalues of the regularized
network are closer to one another, implying better condi-
tioning, with the strength of the effect correlated with the
PPL metric (Table 1).
C.2. Effect on global properties of generator map-
ping
In the previous subsection, we found that the prior en-
courages the Jacobians of the generator mapping to be ev-
erywhere orthogonal. While Figure 17 shows that the map-
ping does not satisfy this constraint exactly in practice, it is
instructive to consider what global properties the constraint
implies for mappings that do. Without loss of generality,
we assume unit global scale for the matrices to simplify the
presentation.
The key property is that that a mapping g:RL7→ RM
with everywhere orthogonal Jacobians preserves the lengths
of curves. To see this, let u: [t0, t1]7→ RLparametrize a
curve in the latent space. Mapping the curve through the
generator g, we obtain a curve ˜u=guin the space of
images. Its arc length is
L=Zt1
t0|˜u0(t)|dt, (8)
where prime denotes derivative with respect to t. By chain
rule, this equals
L=Zt1
t0|Jg(u(t))u0(t)|dt, (9)
where JgRL×Mis the Jacobian matrix of gevaluated at
u(t). By our assumption, the Jacobian is orthogonal, and
18
consequently it leaves the 2-norm of the vector u0(t)unaf-
fected:
L=Zt1
t0|u0(t)|dt. (10)
This is the length of the curve uin the latent space, prior to
mapping with g. Hence, the lengths of uand ˜uare equal,
and so gpreserves the length of any curve.
In the language of differential geometry, gisometrically
embeds the Euclidean latent space RLinto a submani-
fold Min RM— e.g., the manifold of images represent-
ing faces, embedded within the space of all possible RGB
images. A consequence of isometry is that straight line seg-
ments in the latent space are mapped to geodesics, or short-
est paths, on the image manifold: a straight line vthat con-
nects two latent space points cannot be made any shorter, so
neither can there be a shorter on-manifold image-space path
between the corresponding images than gv. For exam-
ple, a geodesic on the manifold of face images is a continu-
ous morph between two faces that incurs the minimum total
amount of change (as measured by l2difference in RGB
space) when one sums up the image difference in each step
of the morph.
Isometry is not achieved in practice, as demonstrated in
empirical experiments in the previous subsection. The full
loss function of the training is a combination of potentially
conflicting criteria, and it is not clear if a genuinely isomet-
ric mapping would be capable of expressing the image man-
ifold of interest. Nevertheless, a pressure to make the map-
ping as isometric as possible has desirable consequences. In
particular, it discourages unnecessary “detours”: in a non-
constrained generator mapping, a latent space interpolation
between two similar images may pass through any number
of distant images in RGB space. With regularization, the
mapping is encouraged to place distant images in different
regions of the latent space, so as to obtain short image paths
between any two endpoints.
D. Projection method details
Given a target image x, we seek to find the correspond-
ing w∈ W and per-layer noise maps denoted niRri×ri
where iis the layer index and ridenotes the resolution of
the ith noise map. The baseline StyleGAN generator in
1024×1024 resolution has 18 noise inputs, i.e., two for each
resolution from 4×4 to 1024×1024 pixels. Our improved
architecture has one fewer noise input because we do not
add noise to the learned 4×4 constant (Figure 2).
Before optimization, we compute µw=Ezf(z)by run-
ning 10 000 random latent codes zthrough the mapping net-
work f. We also approximate the scale of Wby computing
σ2
w=Ezkf(z)µwk2
2, i.e., the average square Euclidean
distance to the center.
At the beginning of optimization, we initialize w=µw
and ni=N(0,I)for all i. The trainable parameters are
Generated target image Real target image
No noise With noise No noise With noise
regularization regularization regularization regularization
Figure 18. Effect of noise regularization in latent-space projection
where we also optimize the contents of the noise inputs of the
synthesis network. Top to bottom: target image, re-synthesized
image, contents of two noise maps at different resolutions. When
regularization is turned off in this test, we only normalize the noise
maps to zero mean and unit variance, which leads the optimization
to sneak signal into the noise maps. Enabling the noise regulariza-
tion prevents this. The model used here corresponds to configura-
tion Fin Table 1.
the components of was well as all components in all noise
maps ni. The optimization is run for 1000 iterations us-
ing Adam optimizer [25] with default parameters. Maxi-
mum learning rate is λmax = 0.1, and it is ramped up from
zero linearly during the first 50 iterations and ramped down
to zero using a cosine schedule during the last 250 itera-
tions. In the first three quarters of the optimization we add
Gaussian noise to wwhen evaluating the loss function as
˜w =w+N(0,0.05 σwt2), where tgoes from one to zero
during the first 750 iterations. This adds stochasticity to the
optimization and stabilizes finding of the global optimum.
Given that we are explicitly optimizing the noise maps,
we must be careful to avoid the optimization from sneak-
ing actual signal into them. Thus we include several noise
map regularization terms in our loss function, in addition
to an image quality term. The image quality term is the
LPIPS [50] distance between target image xand the synthe-
sized image: Limage =DLPIPS[x, g (˜w,n0,n1, . . .)]. For
increased performance and stability, we downsample both
images to 256×256 resolution before computing the LPIPS
distance. Regularization of the noise maps is performed on
19
SN-G SN-D Demod P.reg FID PPL Pre. Rec.
1 – – X X 2.83 145.0 0.689 0.492
2 – X X X 2.98 131.4 0.700 0.469
3X X X X 3.40 130.9 0.720 0.435
4X X X3.38 162.6 0.705 0.468
5X X – – 3.33 394.9 0.705 0.463
6X– – X3.36 217.1 0.695 0.464
7X– – – 3.22 394.4 0.692 0.489
Table 4. Effect of spectral normalization with FFHQ at 10242.
The first row corresponds to StyleGAN2, i.e., config Fin Table 1.
In the subsequent rows, we enable spectral normalization in the
generator (SN-G) and in the discriminator (SN-D). We also test the
training without weight demodulation (Demod) and path length
regularization (P.reg). All of these configurations are highly detri-
mental to FID, as well as to Recall. indicates that higher is better,
and that lower is better.
multiple resolution scales. For this purpose, we form for
each noise map greater than 8×8 in size a pyramid down
to 8×8 resolution by averaging 2×2 pixel neighborhoods
and multiplying by 2 at each step to retain the expected unit
variance. These downsampled noise maps are used for reg-
ularization only and have no part in synthesis.
Let us denote the original noise maps by ni,0=niand
the downsampled versions by ni,j>0. Similarly, let ri,j be
the resolution of an original (j= 0) or downsampled (j >
0) noise map so that ri,j+1 =ri,j /2. The regularization
term for noise map ni,j is then
Li,j =1
r2
i,j ·X
x,y
ni,j (x, y)·ni,j (x1, y)!2
+1
r2
i,j ·X
x,y
ni,j (x, y)·ni,j (x, y 1)!2
,
where the noise map is considered to wrap at the edges. The
regularization term is thus sum of squares of the resolution-
normalized autocorrelation coefficients at one pixel shifts
horizontally and vertically, which should be zero for a nor-
mally distributed signal. The overall loss term is then
Ltotal =Limage +αPi,j Li,j . In all our tests, we have
used noise regularization weight α= 105. In addition, we
renormalize all noise maps to zero mean and unit variance
after each optimization step. Figure 18 illustrates the effect
of noise regularization on the resulting noise maps.
E. Results with spectral normalization
Since spectral normalization (SN) is widely used in
GANs [31], we investigated its effect on StyleGAN2. Ta-
ble 4gives the results for a variety of configurations where
spectral normalization is enabled in addition to our tech-
niques (weight demodulation, path length regularization) or
instead of them.
Item GPU years (Volta) Electricity (MWh)
Initial exploration 20.25 58.94
Paper exploration 13.71 31.49
FFHQ config F0.23 0.68
Other runs in paper 7.20 16.77
Backup runs left out 4.73 12.08
Video, figures, etc. 0.31 0.82
Public release 4.62 10.82
Total 51.05 131.61
Table 5. Computational effort expenditure and electricity con-
sumption data for this project. The unit for computation is GPU-
years on a single NVIDIA V100 GPU — it would have taken ap-
proximately 51 years to execute this project using a single GPU.
See the text for additional details about the computation and en-
ergy consumption estimates. Initial exploration includes all train-
ing runs after the release of StyleGAN [24] that affected our de-
cision to start this project. Paper exploration includes all training
runs that were done specifically for this project, but were not in-
tended to be used in the paper as-is. FFHQ config F refers to the
training of the final network. This is approximately the cost of
training the network for another dataset without hyperparameter
tuning. Other runs in paper covers the training of all other net-
works shown in the paper. Backup runs left out includes the train-
ing of various networks that could potentially have been shown in
the paper, but were ultimately left out to keep the exposition more
focused. Video, figures, etc. includes computation that was spent
on producing the images and graphs in the paper, as well as on
the result video. Public release covers testing, benchmarking, and
large-scale image dumps related to the public release.
Interestingly, adding spectral normalization to our gen-
erator is almost a no-op. On an implementation level, SN
scales the weight tensor of each layer with a scalar value
1(w). The effect of such scaling, however, is overridden
by Equation 3for the main convolutional layers as well as
the affine transformation layers. Thus, the only thing that
SN adds on top of weight demodulation is through its effect
on the tRGB layers.
When we enable spectral normalization in the discrim-
inator, FID is slightly compromised. Enabling it in the
generator as well leads to significantly worse results, even
though its effect is isolated to the tRGB layers. Leaving SN
enabled, but disabling a subset of our contributions does not
improve the situation. Thus we conclude that StyleGAN2
gives better results without spectral normalization.
F. Energy consumption
Computation is a core resource in any machine learning
project: its availability and cost, as well as the associated
energy consumption, are key factors in both choosing re-
search directions and practical adoption. We provide a de-
tailed breakdown for our entire project in Table 5in terms
of both GPU time and electricity consumption.
We report expended computational effort as single-GPU
years (Volta class GPU). We used a varying number of
20
NVIDIA DGX-1s for different stages of the project, and
converted each run to single-GPU equivalents by simply
scaling by the number of GPUs used.
The entire project consumed approximately 131.61
megawatt hours (MWh) of electricity. We followed the
Green500 power measurements guidelines [11] as follows.
For each job, we logged the exact duration, number of
GPUs used, and which of our two separate compute clus-
ters the job was executed on. We then measured the ac-
tual power draw of an 8-GPU DGX-1 when it was training
FFHQ config F. A separate estimate was obtained for the
two clusters because they use different DGX-1 SKUs. The
vast majority of our training runs used 8 GPUs, and for the
rest we approximated the power draw by scaling linearly
with n/8, where nis the number of GPUs.
Approximately half of the total energy was spent on early
exploration and forming ideas. Then subsequently a quar-
ter was spent on refining those ideas in more targeted ex-
periments, and finally a quarter on producing this paper
and preparing the public release of code, trained models,
and large sets of images. Training a single FFHQ network
(config F) took approximately 0.68 MWh (0.5% of the to-
tal project expenditure). This is the cost that one would
pay when training the network from scratch, possibly us-
ing a different dataset. In short, vast majority of the elec-
tricity used went into shaping the ideas, testing hypotheses,
and hyperparameter tuning. We did not use automated tools
for finding hyperparameters or optimizing network archi-
tectures.
21