Why Deep Learning Works: Implicit Self-Regularization in DNNs, Michael W. Mahoney 20190225 - YouTube
My Collaborator did a great job giving a talk on our research at the local San Francisco Bay ACM Meetup
Michael W. Mahoney UC Berkeley
Random Matrix Theory (RMT) is applied to analyze the weight matrices of Deep Neural Networks (DNNs), including both production quality, pre-trained models and smaller models trained from scratch. Empirical and theoretical results clearly indicate that the DNN training process itself implicitly implements a form of self-regularization, implicitly sculpting a more regularized energy or penalty landscape. In particular, the empirical spectral density (ESD) of DNN layer matrices displays signatures of traditionally-regularized statistical models, even in the absence of exogenously specifying traditional forms of explicit regularization. Building on relatively recent results in RMT, most notably its extension to Universality classes of Heavy-Tailed matrices, and applying them to these empirical results, we develop a theory to identify 5+1 Phases of Training, corresponding to increasing amounts of implicit self-regularization. For smaller and/or older DNNs, this implicit self-regularization is like traditional Tikhonov regularization, in that there appears to be a “size scale” separating signal from noise. For state-of-the-art DNNs, however, we identify a novel form of heavy-tailed self-regularization, similar to the self-organization seen in the statistical physics of disordered systems. This implicit self-regularization can depend strongly on the many knobs of the training process. In particular, by exploiting the generalization gap phenomena, we demonstrate that we can cause a small model to exhibit all 5+1 phases of training simply by changing the batch size. This demonstrates that—all else being equal—DNN optimization with larger batch sizes leads to less-well implicitly-regularized models, and it provides an explanation for the generalization gap phenomena. Joint work with Charles Martin of Calculation Consulting, Inc.
Michael W. Mahoney is at the UCB in the Department of Statistics and at the International Computer Science Institute (ICSI). He works on algorithmic and statistical aspects of modern large-scale data analysis. Much of his recent research has focused on large-scale machine learning, including randomized matrix algorithms and randomized numerical linear algebra, geometric network analysis tools for structure extraction in large informatics graphs, scalable implicit regularization methods, and applications in genetics, astronomy, medical imaging, social network analysis, and internet data analysis. He received him PhD from Yale University with a dissertation in computational statistical mechanics. He has worked and taught at Yale University in the Math department, Yahoo Research, and Stanford University in the Math department. Among other things, he is on the national advisory committee of the Statistical and Applied Mathematical Sciences Institute (SAMSI), He was on the National Research Council’s Committee on the Analysis of Massive Data. He co-organized the Simons Institute’s fall 2013 program on the Theoretical Foundations of Big Data Analysis, and he runs the biennial MMDS Workshops on Algorithms for Modern Massive Data Sets. He is currently the lead PI for the NSF/TRIPODS-funded FODA (Foundations of Data Analysis) Institute at UC Berkeley. He holds several patents for work done at Yahoo Research and as Lead Data Scientist for Vieu Labs, Inc., a startup re-imagining consumer video for billions of users.
My talk at ICSI-the International Computer Science Institute at UC Berkeley. ICSI is a leading independent, nonprofit center for research in computer science.
The idea…suppose we want to compare 2 or more deep neural networks (DNNs). Maybe we are
fine tuning a DNN for transfer learning, or
comparing a new architecture to an old on, or
we are just tuning our hyper-parameters.
Can we determine which DNN will generalize best–without peeking at the test data?
Theory actually suggests–yes we can!
An Unsupervised Test Metric for DNNs
We just need to measure the average log norm of the layer weight matrices
where is the Frobenius norm
The Frobenius norm is just the sum of the square of the matrix elements. For example, it is easily computed in numpy as
np.linalg.norm(W,ord='fro')
where ‘fro’ is the default norm.
It turns out that is amazingly correlated with the test accuracy of a DNN. How do we know ? We can plot vs the reported test accuracy for the pretrained DNNs, available in PyTorch. First, we look at the VGG models:
VGG and VGG_BN models
The plot shows the 4 VGG and VGG_BN models. Notice we do not need the ImageNet data to compute this; we simply compute the average log Norm and plot with the (reported Top 5) Test Accuracy. For example, the orange dots show results for the pre-trained VGG13 and VGG13_BN ImageNet models. For each pair of models, the larger the Test Accuracy, the smaller . Moreover, the correlation is nearly linear across the entire class of VGG models. We see similar behavior for …
the ResNet models
Across 4/5 pretrained ResNet models, with very different sizes, a smaller generally implies a better Test Accuracy.
It is not perfect–ResNet 50 is an outlier–but it works amazingly well across numerous pretrained models, both in pyTorch and elsewhere (such as the OSMR sandbox). See the Appendix for more plots. What is more, notice that
the log Norm metric is completely Unsupervised
Recall that we have not peeked at the test data–or the labels. We simply computed for the pretrained models directly from their weight files, and then compared this to the reported test accuracy.
Imagine being able to fine tune a neural network without needing test data. Many times we barely have enough training data for fine tuning, and there is a huge risk of over-training. Every time you peek at the test data, you risk leaking information into the model, causing it to overtrain. It is my hope this simple but powerful idea will help avoid this and advance the field forward.
Why does this work ?
Applying VC Theory of Product Norms
The theory shows that each layer weight matrix of (a well trained) DNNs resembles a random heavy tailed matrix, and we can associate with it a power law exponent
The exponent characterizes how well the layer weight matrix represents the correlations in the training data. Smaller is better.
Smaller exponents correspond to more implicit regularization, and, presumably, better generalization (if the DNN is not overtrained). This suggests that the average power law would make a good overall unsupervised complexity metric for a DNN–and this is exactly what the last blog post showed.
The average power law metric is a weighted average,
where the layer weight factor should depend on the scale of . In other words, ‘larger’ weight matrices (in some sense) should contribute more to the weighted average.
Smaller usually implies better generalization
For heavy trailed matrices, we can work out a relation between the log Norm of and the power law exponent :
where we note that
So the weight factor is simply the log of the maximum eigenvalue associated with
In the paper will show the math; below we present numerical results to convince the reader.
We argue here that we can approximate the average Power Law metric by simply computing the average log Norm of the DNN layer weight matrices. And using this, we can actually predict the trends in generalization accuracy — without needing a test data set!
Discussion
Implications: Norms vs Power Laws
The Power Law metric is consistent with the recent theoretical results, but our approach and the intent is different:
Unlike their result, our approach does not require modifying the loss function.
Moreover, they seek a Worst Case complexity bound. We seek Average Case metrics. Incredibly, the 2 approaches are completely compatible.
But the biggest difference is that we apply our Unsupervised metric to large, production quality DNNs. Here are …
We believe this result will have large applications in hyper-parameter fine tuning DNNs. Because we do not need to peek at the test data, it may prevent information from leaking from the test set into the model, thereby helping to prevent overtraining and making fined tuned DNNs more robust.
WeightWatcher
We have built a python package for Jupyter Notebooks that does this for you–the weight watcher. It works on Keras and PyTorch. We will release it shortly.
For a given weight matrix , we form the correlation matrix
and then compute the M eigenvalues of
We call the histogram of eigenvalues the Empirical Spectral Density (ESD). It can be nearly always be fit to a power law
We call the Power Law Universal because 80-90% of the exponents lie in range
For fully connected layers, we just take as is. For Conv2D layers with shape we consider all 2D feature maps of shape . For any large, modern, pretrained DNN, this can give a large number of eigenvalues. The results on Conv2D layers have not yet been published except on my blog on Power Laws in Deep Learning, but the results are very easy to reproduce with this notebook.
As with the FC layers, we find that nearly all the ESDs can be fit to a power law, and 80-90% of the exponents like between 2 and 4. Although compared to the FC layers, for the Conv2D layers, we do see more exponents . We will discuss the details and these results in a future paper. And while Universality is very theoretically interesting, a more practical question is
Are power law exponents correlated with better generalization accuracies ? … YES they are!
We can see this by looking at 2 or more versions of several pretrained models, available in pytorch, including
The VGG models, with and without BatchNormalization, such as VGG11 vs VGG11_BN
Inception V3 vs V4
SqueezeNet V1.0 vs V1.1
The ResNext101 models
The sequence of Resnet models, including Resnet18, 34, 50, 101, & 152, as well as
2 other ResNet implementations, CaffeResnet101 and FbResnet152
To compare these model versions, we can simply compute the average power law exponent , averaged across all FC weight matrices and Conv2D feature maps. (Note I only consider matrices with . ) In nearly every case, smaller is correlated with better test accuracy (i.e. generalization performance).
The only significant caveats are:
for the VGG16 and VGG19 models, we do not include the last FC layer in the average–the layer that connects the model to the labels. In both models, this last layer has a higher power law exponent that throws off the average for the model.
the InceptionResnetV2 is an outlier. It is unclear why at this time. It is not shown here but will be discussed when these results are published.
Lets first look at the VGG models, plus a couple others, not including the final FC layer in the average (again, this only changes the results for VGG16 and VGG19).
In all cases, the pre-trained model with the better Test Accuracy has, on average, smaller power law exponents , . This is an easy comparison because we are looking at 2 versions of the same architectures, with only slight improvements. For example, VGG11_BN only differs from VGG11 because it has Batch Normalization.
The Inception models show similar behavior: InceptionV3 has smaller Test Accuracy than InceptionV4, and, likewise, the InceptionV3 is larger than InceptionV4.
Now consider the Resnet models, which are increasing in size and have more architectural differences between them:
Across all these Resnet models, the better Test Accuracies are strongly correlated with smaller average exponents. The correlation is not perfect; the smaller Resnet50 is an outlier, and Resnet152 has a slighly larger than FbResnet152, but they are close. Overall, I would argue the theory works pretty well, and better Test Accuracies are correlated with smaller latex Avg(\alpha)&bg=ffffff $ across a wide range of architectures.
Suppose you are training a DNN and trying to optimize the hyper-parameters. I believe by looking at the power law exponents of the layer weight matrices, you can predict which variation will perform better–without peeking at the test data.
I hope it is useful to you in training your own Deep Neural Networks. And I hope to get feedback from you as to see how useful this is in practice.
we simplWe can learn a lot about Why Deep Learning Works by studying the properties of the layer weight matrices of pre-trained neural networks. And, hopefully, by doing this, we can get some insight into what a well trained DNN looks like–even without peaking at the training data.
One broad question we can ask is:
How is information concentrated in Deep Neural Network (DNNs)?
To get a handle on this, we can run ‘experiments’ on the pre-trained DNNs available in pyTorch.
In a previous post, we formed the Singular Value Decomposition (SVD) of the weight matrices of the linear, or fully connected (FC) layers. And we saw that nearly all the FC Layers display Power Law behavior. And, in fact, this behavior is Universal across models both ImageNet and NLP models.
But this only part of the story. Here, we ask related question–do well trained DNNs weight matrices lose Rank ?
Matrix Rank:
Lets say is an matrix. We can form the Singular Value Decomposition (SVD):
The Matrix Rank , or Hard Rank, is simply the number of non-zero singular values
which express the decrease in Full Rank M.
Notice the Hard Rank of the rectangular matrix is the dimension of the square correlation matrix .
In python, this can be computed using
rank = numpy.linalg.matrix_rank(W)
Of course, being a numerical method, we really mean the number of singular values above some tolerance …and we can get different results depending on if we use
Here, we will compute the rank ourselves, and use an extremely loose bound, and consider any . As we shall see, DNNs are so good at concentrating information that it will not matter
Rank Collapse and Regularization
If all the singular values are non-zero, we say is Full Rank. If one or more , then we say is Singular. It has lost expressiveness, and the model has undergone Rank collapse.
When a model undergoes Rank Collapse, it traditionally needs to be regularized. Say we are solving a simple linear system of equations / linear regression
The simple solution is to use a little linear algebra to get the optimal values for the unknown
But when is Singular, we can not form the matrix inverse. To fix this, we simply add some small constant to diagonal of
So that all the singular values will now be greater than zero, and we can form a generalized pseudo-inverse, called the Moore-Penrose Inverse
This procedure is also called Tikhonov Regularization. The constant, or Regularizer, sets the Noise Scale for the model. The information in is concentrated in the singular vectors associated with larger singular values , and the noise is left over in the those associated with smaller singular values :
Information: vectors where
Noise: vectors where
In cases where is Singular, regularization is absolutely necessary. But even when it is not singular, Regularization can be useful in traditional machine learning. (Indeed, VC theory tells us that Regularization is a first class concept)
Do the weight matrices of well trained DNNs undergo Rank Collapse ?
Answer: They DO NOT — as we now see:
Analyzing Pre-Trained pyTorch Models
We can easily examine the numerous pre-trained models available in PyTorch. We simply need to get the layer weight matrices and compute the SVD. We then compute the minimum singular value and compute a histogram of the minimums across different models.
for im, m in enumerate(model.modules()):
if isinstance(m, torch.nn.Linear):
W = np.array(m.weight.data.clone().cpu())
M, N = np.min(W.shape), np.max(W.shape)
_, svals, _ = np.linalg.svd(W)
minsval=np.min(svals)
...
We do this here for numerous models trained on ImageNet and available in pyTorch, such as AlexNet, VGG16, VGG19, ResNet, DenseNet201, etc.– as shown in this Jupyter Notebook.
We also examine the NLP models available in AllenNLP. This is a little bit trickier; we have to install AllenNLP from source, then create an analyze.py command class, and rebuild AllenNLP. Then, to analyze, say, the AllenNLP pre-trained NER model, we run
This print out the ranks (and other information, like power law fits), and then plot the results. The code for all this is here.
Notice that many of the AllenNLP models include Attention matrices, which can be quite large and very rectangular (i.e. = ), as compared to the smaller (and less rectangular) weight matrices used in the ImageNet models (i.e. ),.
Note: We restrict our analysis to rectangular layer weight matrices with an aspect ratio , and really larger then 1.1. This is because the Marchenko Pastur (MP) Random Matrix Theory (RMT) tells us that only when. We will review this in a future blog.
Minimum Singular Values of Pre-Trained Models
For the ImageNet models, most fully connected (FC) weight matrices have a large minimum singular value . Only 6 of the 24 matrices looked at have –and we have not carefully tested the numerical threshold–we are just eyeballing it here.
For the AllenNLP models, none of the FC matrices show any evidence of Rank Collapse. All of the singular values for every linear weight matrix are non-zero.
It is conjectured that fully optimized DNNs–those with the best generalization accuracy–will not show Rank Collapse in any of their linear weight matrices.
If you are training your own model and you see Rank Collapse, you are probably over-regularizing.
Inducing Rank Collapse is easy–just over-regularize
it is, in fact, very easy to induce Rank Collapse. We can do this in a Mini version of AlexNet, coded in Keras 2, and available here.
A Mini version of AlexNet, trained on CIFAR10, used to explore regularization and rank collapse in DNNs.
To induce rank collapse in our Fc weight matrices, we can add large weight norm constraints to the linear layers, using the kernel_initializer=0.001
We train this smaller MiniAlexnet model on CIFAR10 for 20 epochs, save the final weight matrix, and plot a histogram of the eigenvalues of the weight correlation matrix .
Rank Collapse Induced in a Mini AlexNet model, caused by adding weight norm constraints of 0.001
Recall that the eigenvalues are simply the square of the singular values. Here, we have most of them are nearly 0
.
Adding too much regularization causes nearly all of the eigenvalues/singular values to collapse to zero.
Well trained Deep Neural Networks do not display Rank Collapse
Implications
We believe this is a unique property of DNNs, and related to how Regularization works in these models. We will discuss this and more in an upcoming paper
Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning
In a previous post, we saw that the Fully Connected (FC) layers of the most common pre-trained Deep Learning display power law behavior. Specifically, for each FC weight matrix , we compute the eigenvalues of the correlation matrix
For every FC matrix, the eigenvalue frequencies, or Empirical Spectral Density (ESD), can be fit to a power law
where the exponents all lie in
Remarkably the FC matrices all lie within the Universality Class of Fat Tailed Random Matrices
Heavy Tailed Random Matrices
We define a random matrix by defining a matrix of size , and drawing the matrix elements from a random distribution. We can choose a
Gaussian Random Matrix: , where is a Gaussian distribution
or a
Heavy Tailed Random Matrix: , where is a power law distribution
In either case, Random Matrix Theory tells us what the asymptotic form of ESD should look like. But first, let’s see what model works best.
AlexNet FC3
First, lets look at the ESD for AlexNet for layer FC3, and zoomed in:
Recall that AlexNet FC3 fits a power law with exponent $\alpha\sim&bg=ffffff $ , so we also plot the ESD on a log-log scale
AlexNet Layer FC3 Log Log Histogram of ESD
Notice that the distribution is linear in the central region, and the long tail cuts off sharply. This is typical of the ESDs for the fully connected (FC) layers of the all the pretrained models we have looked at so far. We now ask…
What kind of Random Matrix would make a good model for this ESD ?
ESDs: Gaussian random matrices
We first generate a few Gaussian Random matrices (mean 0, variance 1), for different aspect ratios Q, and plot the histogram of their eigenvalues.
N, M = 1000, 500
Q = N / M
W = np.random.normal(0,1,size=(M,N))
# X shape is M x M
X = (1/N)*np.dot(W.T,W)
evals = np.linalg.eigvals(X)
plot.hist(evals, bins=100,density=True)
Empirical Spectral Density (ESD) for Gaussian Random Matrices, with different Q values.
Notice that the shape of the ESD depends only on Q, and is tightly bounded; there is, in fact, effectively no tail at all to the distributions (except, perhaps, misleadingly for Q=1)
ESDs: Power Laws and Log Log Histograms
We can generate a heavy, or fat-tailed, random matrix as easily using the numpy Pareto function
W=np.random.pareto(mu,size=(N,M))
Heavy Tailed Random matrices have a very ESDs. They have very long tails–so long, in fact, that it is better to plot them on a log log Histogram
Do any of these look like a plausible model for the ESDs of the weight matrices of a big DNN, like AlexNet ?
the smallest exponent, (blue), has a very long tail, extending over 11 orders of magnitude. This means the largest eigenvalues would be . No real W would behave like this.
the largest exponent, (red), has a very compact ESD, resembling more the Gaussian Ws above.
the fat tailed ESD (green), however, is just about right. The ESD is linear in the central region, suggesting a power law. It is a little too large for our eigenvalues , but the tail also cuts off sharply, which is expected for any finite W . So we are close
AlexNet FC3
Lets overlay the ESD of fat-tailed W with the actual empirical from AlexNet for layer FC3
We see a pretty good match to a Fat-tailed random matrix with .
Turns out, there is something very special about being in the range 2-4.
Universality Classes:
Random Matrix Theory predicts the shape of the ESD , in the asymptotic limit, for several kinds of Random Matrix, called University Classes. The 3 different values of each represent a different Universality Class:
In particular, if we draw from any heavy tailed / power law distribution, the empirical (i.e. finite size) eigenvalue density is likewise a power law (PL), either globally, or at least locally.
What is more, the predicted ESDs have different, characteristic global and local shapes, for specific ranges of . And the amazing thing is that
the ESDs of the fully connected (FC) layers of pretrained DNNs all resemble the ESDs of the Fat-Tailed Universality Classes of Random Matrix Theory
But this is a little tricky to show, because we need to show that we fit to the theoretical . We now look at the
Relations between and
RMT tells us that, for , the ESD takes the limiting for
, where
And this works pretty well in practice for the Heavy Tailed Universality Class, for . But for any finite matrix, as soon as , the finite size effects kick in, and we can not naively apply the infinite limit result.
Statistics of the maximum eigenvalue(s)
RMT not only tells us about the shape of the ESD; it makes statements about the statistics of the edge and/or tails — the fluctuations in the maximum eigenvalue . Specifically, we have
Gaussian RMT:
Fat Tailed RMT:
For standard, Gaussian RMT, the (near the bulk edge) is governed by the famous Tracy Widom. And for , RMT is governed by the Tau Four Moment Theorem.
In particular, the effects of M and Q kick in as soon as . If we underestimate , (small Q, large M), the power law will look weaker, and we will overestimate in our fits.
And, for us, this affects how we estimate from and assign the Universality Class
Fat Tailed Matrices and the Finite Size Effects for
Here, we generate generate ESDs for 3 different Pareto Heavy tailed random matrices, with the fixed M (left) or N (right), but different Q. We fit each..
Why Does Deep Learning Work ? If we could get a better handle on this, we could solve some very hard problems in Deep Learning. A new class at Stanford is trying to address this. Here, I review some very new results from my own research (w/UC Berkeley) on the topic which has direct practical application. Here, I will show that
In pretrained, production quality DNNs, the weight matrices for the Fully Connected (FC ) layers display Fat Tailed Power Law behavior.
Setup: the Energy Landscape function
Deep Neural Networks (DNNs) minimize the Energy function defined by their architecture. We define the layer weight matrices , biases, , and activations functions , giving
We train the DNN on a labeled data set (d,y), giving the optimization problem
We call this the Energy Landscape because the DNN optimization problem is only parameterized by the weights and biases. Of course, in any real DNN problem, we do have other adjustable parameters, such as the amount of Dropout, the learning rate, the batch size, etc. But these regularization effects simply change the global
But here, for now, we only look at the final result. Once a DNN is trained, what is left are the weights (and biases). We can reuse the weights (and biases) of a pre-trained DNN to build new DNNs with transfer learning. And if we train a bunch of DNNs, we want to know which one is better ?
But, practically, we would really like to identify a very good DNN without peaking at the test data, since every time we peak, and retrain, we risk overtraining our DNN.
I now show we can at least start do this by looking the weights matrices themselves. So let us look at the weights of some pre-trained DNNs.
Pretrained Linear Models in Pytorch
Pytorch comes with several pretrained models, such as AlexNet. To start, we just examine the weight matrices of the Linear / Fully Connected (FC) layers.
pretrained_model = models.alexnet(pretrained=True)
for module in pretrained_model.modules():
if isinstance(module, nn.Linear):
...
The Linear layers have the simplest weight matrices ; they are 2-dimensional tensors, or just rectangular matrices.
Let be an matrix, where . We can get the matrix from the pretraing model using:
W = np.array(module.weight.data.clone().cpu())
M, N = np.min(W.shape), N=np.max(W.shape)
Empirical Spectral Density
How is information concentrated in . For any rectangular matrix, we can form the
Singular Value Decomposition (SVD)
which is readily computed in scikit learn. We will use the faster TruncatedSVD method, and compute singular values :
(Technically, we do miss the smallest singular value doing this, but that’s ok. It won’t matter here, and we can always use the pure svd method to be a exact)
Eigenvalues of the Correlation Matrix
We can, alternatively form the eigenvalues of the correlation matrix
The eigenvalues are just the square of the singular values.
Notice here we normalize them by N.
evals = (1/N)*svals*svals
We now form the Empirical Spectral Density (ESD), which is, formally
This notation just means compute a histogram of the eigenvalues
import matplotlib.pyplot as plt
plt.hist(evals, bins=100, density=True)
We could also compute the spectral density using a Kernel Density Estimator (KDE); we save this for a future post.
We now look at the ESD of
Pretrained AlexNet
Here, we examine just FC3, the last Linear layer, connecting the model to the labels. The other linear layers, FC1 and FC2, look similar. Below is a histogram for ESD. Notice it is very sharply peaked and has long tail, extending out past 40.
We can get a better view of the heavy tailed behavior by zooming in.
The red curve is a fit of the ESD to the Marchenko Pastur (MP) Random Matrix Theory (RMT) result — it is not a very good fit. This means ESD does not resemble Gaussian Random matrix. Instead, it is looks heavy tailed. Which leads to the question…
Do we have a Power Law ?
(Yes do, as we shall see…)
Physicists love to claim they have discovered data that follows a power law:
The first thing to do: plot the data on a log-log histogram, and check that this plot is linear–at least in some region. Let’s look at our ESD for AlexNet FC3:
AlexNet FC3: Log log histogram of the Empirical Spectral Density (ESD). Notice the central region is mostly linear, suggesting a power law.
Yes, it is linear–in the central region, for eigenvalue frequencies between roughly ~1 and ~100–and that is most of the distribution.
Why is not linear everywhere? Because it is finite size–there are min and max cutoffs. In the infinite limit, a powerlaw diverges at , and the tail extends indefinitely as . In any finite size data set, there will be an and .
Power Law Fits
Second, fit the data to a power law, with and in mind. The most widely available and accepted method the Maximum Likelihood Estimator (MLE), develop by Clauset et. al., and available in the python powerlaw package.
import powerlaw
fit = powerlaw.Fit(evals, xmax=np.max(evals))
alpha, D = fit.alpha, fit.D
The D value is a quality metric, the KS distance. There are other options as well. The smaller D, the better. The table below shows typical values of good fits.
Linear Log Log Plots
The powerlaw package also makes some great plots. Below is a log log plot generated for our fit of FC3, for the central region of the ESD. The filled lines represent our fits, and the dotted lines are actual power lawPDF (blue) and CCDF (red). The filled lines look like straight lines and overlap the dotted lines–so this fit looks pretty good.
AlexNet FC3: Log Log plot of the central region of the Empirical Spectral Density and best Powerlaw fits: pdf (blue) and ccdf (red)
Is this enough ? Not yet…
Quality of the Estimator
We still need to know, do we have enough data to get a good estimate for , what are our error bars, and what kind of systematic errors might we get?
This so-called statistically valid MLE estimator actually only works properly for .
And we need a lot of data points…more than we have singular values
We can calibrate the estimator by generating some modest size (N=1000) random power law datasets using the numpy Pareto function, where
and then fitting these with the PowerLaw package. We get the following curve
Powerlaw fits of exponent compared to data generated from a Pareto distribution, N=1000. Notice that the estimator works best for exponents between 1.5 and 3.5
The green line is a perfect estimate. The Powerlaw package overestimates small and underestimates large . Fortunately, most of our fits lie in the good range.
Is a Power law the best model ?
A good fit is not enough. We also should ensure that no other obvious model is a better fit. The power law package lets us test out fit against other common (long tailed) choices, namely
exponential (EXP)
stretched_exponential (S_EXP)
log normal (LOG_N)
truncated power law (TPL)
For example, to check if our data is better fit by a log normal distribution, we run
R, p = fit.distribution_compare('powerlaw', 'lognormal', normalized_ratio=True)
and R and the the p-value. If if R<0 and p <= 0.05, then we can conclude that a power law is a better model.
Note that sometimes, for , the best model may be a truncated power law (TPL). ..
Why Deep Learning Works: Self Regularization in DNNs
To be presented at UC Berkeley / NERSC Jun 8, 2018
Empirical results, using the machinery of Random Matrix Theory (RMT), are presented that are aimed at clarifying and resolving some of the puzzling and seemingly-contradictory aspects of deep neural networks (DNNs). We apply RMT to several well known pre-trained models: LeNet5, AlexNet, and Inception V3, as well as 2 small, toy models.
We show that the DNN training process itself implicitly implements a form of self-regularization associated with the entropy collapse / information bottleneck. We find that the self-regularization in small models like LeNet5, resembles the familar Tikhonov regularization
whereas large, modern deep networks display a new kind of heavy tailed self-regularization.
We characterize self-regularization using RMT by identifying a taxonomy of the 5+1 phases of training.
Then, with our toy models, we show that even in the absence of any explicit regularization mechanism, the DNN training process itself leads to more and more capacity-controlled models. Importantly, this phenomenon is strongly affected by the many knobs that are used to optimize DNN training. In particular, we can induce heavy tailed self-regularization by adjusting the batch size in training, thereby exploiting the generalization gap phenomena unique to DNNs.
We argue that this heavy tailed self-regularization has practical implications both designing better DNNs and deep theoretical implications for understanding the complex DNN Energy landscape / optimization problem.