or

CALCULATED CONTENT by Charles H Martin, Phd - 2M ago

Big thanks to and the team at This Week in Machine Learning and AI for my recent interview:

https://twimlai.com/meetups/implicit-self-regularization-in-deep-neural-networks/

Implicit Self Regularization in Deep Neural Networks @ TWiML Online Meetup Americas 20 February 2019 - YouTube
• Show original
• .
• Share
• .
• Favorite
• .
• Email
• .
CALCULATED CONTENT by Charles H Martin, Phd - 2M ago

Why Deep Learning Works: Implicit Self-Regularization in DNNs, Michael W. Mahoney 20190225 - YouTube

My Collaborator did a great job giving a talk on our research at the local San Francisco Bay ACM Meetup

Michael W. Mahoney UC Berkeley

Random Matrix Theory (RMT) is applied to analyze the weight matrices of Deep Neural Networks (DNNs), including both production quality, pre-trained models and smaller models trained from scratch. Empirical and theoretical results clearly indicate that the DNN training process itself implicitly implements a form of self-regularization, implicitly sculpting a more regularized energy or penalty landscape. In particular, the empirical spectral density (ESD) of DNN layer matrices displays signatures of traditionally-regularized statistical models, even in the absence of exogenously specifying traditional forms of explicit regularization. Building on relatively recent results in RMT, most notably its extension to Universality classes of Heavy-Tailed matrices, and applying them to these empirical results, we develop a theory to identify 5+1 Phases of Training, corresponding to increasing amounts of implicit self-regularization. For smaller and/or older DNNs, this implicit self-regularization is like traditional Tikhonov regularization, in that there appears to be a “size scale” separating signal from noise. For state-of-the-art DNNs, however, we identify a novel form of heavy-tailed self-regularization, similar to the self-organization seen in the statistical physics of disordered systems. This implicit self-regularization can depend strongly on the many knobs of the training process. In particular, by exploiting the generalization gap phenomena, we demonstrate that we can cause a small model to exhibit all 5+1 phases of training simply by changing the batch size. This demonstrates that—all else being equal—DNN optimization with larger batch sizes leads to less-well implicitly-regularized models, and it provides an explanation for the generalization gap phenomena. Joint work with Charles Martin of Calculation Consulting, Inc.

Michael W. Mahoney is at the UCB in the Department of Statistics and at the International Computer Science Institute (ICSI). He works on algorithmic and statistical aspects of modern large-scale data analysis. Much of his recent research has focused on large-scale machine learning, including randomized matrix algorithms and randomized numerical linear algebra, geometric network analysis tools for structure extraction in large informatics graphs, scalable implicit regularization methods, and applications in genetics, astronomy, medical imaging, social network analysis, and internet data analysis. He received him PhD from Yale University with a dissertation in computational statistical mechanics. He has worked and taught at Yale University in the Math department, Yahoo Research, and Stanford University in the Math department. Among other things, he is on the national advisory committee of the Statistical and Applied Mathematical Sciences Institute (SAMSI), He was on the National Research Council’s Committee on the Analysis of Massive Data. He co-organized the Simons Institute’s fall 2013 program on the Theoretical Foundations of Big Data Analysis, and he runs the biennial MMDS Workshops on Algorithms for Modern Massive Data Sets. He is currently the lead PI for the NSF/TRIPODS-funded FODA (Foundations of Data Analysis) Institute at UC Berkeley. He holds several patents for work done at Yahoo Research and as Lead Data Scientist for Vieu Labs, Inc., a startup re-imagining consumer video for billions of users.

Long version of the paper (upon which the talk is based): https://arxiv.org/abs/1810.01075http://www.meetup.com/SF-Bay-ACM/http://www.sfbayacm.org/

• Show original
• .
• Share
• .
• Favorite
• .
• Email
• .
CALCULATED CONTENT by Charles H Martin, Phd - 6M ago

My talk at ICSI-the International Computer Science Institute at UC Berkeley. ICSI is a leading independent, nonprofit center for research in computer science.

Why Deep Learning Works: Self Regularization in Neural Networks

Presented Thursday, December 13, 2018

Why Deep Learning Works: ICSI UC Berkeley 2018 - YouTube

The slides are available on my slideshare.

The supporting tool, WeightWatcher, can be installed using:

pip install weightwatcher

https://github.com/CalculatedContent/WeightWatcher

• Show original
• .
• Share
• .
• Favorite
• .
• Email
• .
CALCULATED CONTENT by Charles H Martin, Phd - 7M ago

This is a followup to a previous post:

DON’T PEEK: DEEP LEARNING WITHOUT LOOKING … AT TEST DATA

The idea…suppose we want to compare 2 or more  deep neural networks (DNNs). Maybe we are

• fine tuning a DNN for transfer learning, or
• comparing a new architecture to an old on, or
• we are just tuning our hyper-parameters.

Can we determine which DNN will generalize best–without peeking at the test data?

Theory actually suggests–yes we can!

An Unsupervised Test Metric for DNNs

We just need to measure the average log norm of the layer weight matrices

where  is the Frobenius norm

The Frobenius norm is just the sum of the square of the matrix elements. For example, it is easily computed in numpy as

np.linalg.norm(W,ord='fro')

where ‘fro’ is the default norm.

It turns out that  is amazingly correlated with the test accuracy of a DNN.  How do we know ?  We can plot   vs the reported test accuracy for the  pretrained DNNs, available in PyTorch.   First, we look at the VGG models:

VGG and VGG_BN models

The plot shows the 4 VGG and VGG_BN models. Notice we do not need the ImageNet data to compute this; we simply compute the average log Norm and plot with the (reported Top 5) Test Accuracy.  For example, the orange dots show results for the pre-trained VGG13 and VGG13_BN ImageNet models.  For each pair of models, the larger the  Test Accuracy, the smaller  .    Moreover, the correlation is nearly linear across the entire class of VGG models.  We see similar behavior for …

the ResNet models

Across 4/5 pretrained ResNet models, with very different sizes, a smaller   generally implies a better Test Accuracy.

It is not perfect–ResNet 50 is an outlier–but it works amazingly well across numerous pretrained models, both in pyTorch and elsewhere (such as the OSMR sandbox).  See the Appendix for more plots.  What is more, notice that

the log Norm metric  is completely Unsupervised

Recall that we have not peeked at the test data–or the labels.  We simply computed for the pretrained models directly from their weight files, and then compared this to the reported test accuracy.

Imagine being able to fine tune a neural network without needing test data.  Many times we barely have enough training data for fine tuning, and there is a huge risk of over-training. Every time you peek at the test data, you risk leaking information into the model, causing it to overtrain. It is my hope this simple but powerful idea will help avoid this and advance the field forward.

Why does this work ?  Applying VC Theory of Product Norms

A recent paper by Google X and MIT shows that there is A Surprising Linear Relationship [that] Predicts Test Performance in Deep Networks.  The idea is to compute a VC-like data dependent complexity metric — — based on the Product Norm of the weight matrices:

Usually we just take  as the Frobenius norm (but any p-norm may do)

If we take the log of both sides, we get the sum

So here we just form the average log Frobenius Norm as measure of DNN complexity, as suggested by current ML theory

And it seems to work remarkably well in practice.

Log Norms and Power Laws

We can also understand this through our Theory of Heavy Tailed Implicit Self-Regularization in Deep Neural Networks.

The theory shows that each layer weight matrix of (a well trained) DNNs resembles a random heavy tailed matrix, and we can associate with it a power law exponent

The exponent characterizes how well the layer weight matrix represents the correlations in the training data.  Smaller is better.

Smaller exponents correspond to more implicit regularization, and, presumably, better generalization (if the DNN is not overtrained).  This suggests that the average power law would make a good overall unsupervised complexity metric for a DNN–and this is exactly what the last blog post showed.

The average power law metric is a weighted average,

where the layer weight factor should depend on the scale of . In other words, ‘larger’ weight matrices (in some sense) should contribute more to the weighted average.

Smaller  usually implies better generalization

For heavy trailed matrices, we can work out a relation between the log Norm of and the power law exponent :

where we note that

So the weight factor is simply the log of the maximum eigenvalue associated with

In the paper will show the math; below we present numerical results to convince the reader.

This also explains why Spectral Norm Regularization Improv[e]s the Generalizability of Deep Learning.  The smaller  gives a smaller power law contribution, and, also, a smaller log Norm.  We can now relate these 2 complexity metrics:

We argue here that we can approximate the average Power Law metric  by simply computing the average log Norm of the DNN layer weight matrices.   And using this,  we can actually predict the trends in generalization accuracy — without needing a test data set!

Discussion Implications: Norms vs Power Laws

The Power Law metric is consistent with the recent theoretical results, but our approach and the intent is different:

• Unlike their result, our approach does not require modifying the loss function.
• Moreover, they seek a Worst Case complexity bound.  We seek Average  Case metrics.  Incredibly, the 2 approaches are completely compatible.

But the biggest difference is that we apply our Unsupervised metric to large, production quality DNNs.  Here are …

We believe this result will have large applications in hyper-parameter fine tuning DNNs.  Because we do not need to peek at the test data, it may prevent information from leaking from the test set into the model, thereby helping to prevent overtraining and making fined tuned DNNs more robust.

WeightWatcher

We have built a python package for Jupyter Notebooks that does this for you–the weight watcher.  It works on Keras and PyTorch.  We will release it shortly.

More Results

We use the OSMR Sandbox to compute the average log Norm for a wide variety of DNN models, using pyTorch, and compare to the reported Top 1 Errors.  This notebook reproduces the results.

All the ResNet Models

DenseNet

SqueezeNet

DPN

• Show original
• .
• Share
• .
• Favorite
• .
• Email
• .
CALCULATED CONTENT by Charles H Martin, Phd - 7M ago

Machine Learning and AI for the Lean Start Up

My recent talk at the French Tech Hub Startup Accelerator

AI and Machine Learning for the Lean Startup - YouTube
• Show original
• .
• Share
• .
• Favorite
• .
• Email
• .
CALCULATED CONTENT by Charles H Martin, Phd - 9M ago

What is the purpose of a theory ?  To explain why something works. Sure.  But to also make predictions–testable predictions.

Recently we introduced the theory of Implicit Self-Regularization in Deep Neural Networks.  Most notably, we observe that in all pretrained models, the layer weight matrices display near Universal power law behavior.  That is, we can compute their eigenvalues, and fit the empirical spectral density (ESD) to a power law form:

For a given weight matrix , we form the correlation matrix

and then compute the M eigenvalues  of

We call the histogram of eigenvalues the Empirical Spectral Density (ESD).  It can be nearly always be fit to a power law

We call the Power Law Universal because 80-90% of the exponents lie in range

For fully connected layers, we just take  as is.  For Conv2D layers with shape   we consider all  2D feature maps of shape .  For any large, modern, pretrained DNN, this can give a large number of eigenvalues.  The results on Conv2D layers have not yet been published except on my blog on Power Laws in Deep Learning, but the results are very easy to reproduce with this notebook.

As with the FC layers, we find that nearly all the ESDs can be fit to a power law, and 80-90% of the exponents like between 2 and 4.  Although compared to the FC layers, for the Conv2D layers, we do see more exponents .   We will discuss the details and these results in a future paper. And while Universality is very theoretically interesting, a more practical question is

Are power law exponents correlated with better generalization accuracies ?  … YES they are!

We can see this by looking at 2 or more versions of several pretrained models, available in pytorch, including

• The VGG models, with and without BatchNormalization, such as VGG11 vs VGG11_BN
• Inception V3 vs V4
• SqueezeNet V1.0 vs V1.1
• The ResNext101 models
• The sequence of Resnet models, including Resnet18, 34, 50, 101, & 152, as well as
• 2 other  ResNet implementations, CaffeResnet101 and FbResnet152

To compare these model versions, we can simply compute the average power law exponent , averaged across all FC weight matrices and Conv2D feature maps.  (Note I only consider matrices with . )  In nearly every case, smaller  is correlated with better test accuracy (i.e. generalization performance).

The only significant caveats are:

1. for the VGG16 and VGG19 models, we do not include the last FC layer in the average–the layer that connects the model to the labels.   In both models, this last layer has a higher power law exponent that throws off the average for the model.
2. the InceptionResnetV2 is an outlier.  It is unclear why at this time. It is not shown here but will be discussed when these results are published.

Lets first look at the VGG models, plus a couple others, not including the final FC layer in the average (again, this only changes the results for VGG16 and VGG19).

In all cases, the pre-trained model with the better Test Accuracy has, on average, smaller power law exponents , .  This is an easy comparison because we are looking at 2 versions of the same architectures, with only slight improvements.  For example, VGG11_BN only differs from VGG11 because it has Batch Normalization.

The Inception models show similar behavior: InceptionV3 has smaller Test Accuracy than InceptionV4, and, likewise, the InceptionV3   is larger than InceptionV4.

Now consider the Resnet models, which are increasing in size and have more architectural differences between them:

Across all these Resnet models, the better Test Accuracies are strongly correlated with smaller average exponents.  The correlation is not perfect; the smaller Resnet50 is an outlier, and Resnet152 has a slighly larger  than FbResnet152, but they are close.  Overall, I would argue the theory works pretty well, and better Test Accuracies are correlated with smaller latex Avg(\alpha)&bg=ffffff $across a wide range of architectures. These results are easily reproduced with this notebook. This is an amazing result ! Suppose you are training a DNN and trying to optimize the hyper-parameters. I believe by looking at the power law exponents of the layer weight matrices, you can predict which variation will perform better–without peeking at the test data. I hope it is useful to you in training your own Deep Neural Networks. And I hope to get feedback from you as to see how useful this is in practice. Read Full Article • Show original • . • Share • . • Favorite • . • Email • . • Add Tags CALCULATED CONTENT by Charles H Martin, Phd - 9M ago we simplWe can learn a lot about Why Deep Learning Works by studying the properties of the layer weight matrices of pre-trained neural networks. And, hopefully, by doing this, we can get some insight into what a well trained DNN looks like–even without peaking at the training data. One broad question we can ask is: How is information concentrated in Deep Neural Network (DNNs)? To get a handle on this, we can run ‘experiments’ on the pre-trained DNNs available in pyTorch. In a previous post, we formed the Singular Value Decomposition (SVD) of the weight matrices of the linear, or fully connected (FC) layers. And we saw that nearly all the FC Layers display Power Law behavior. And, in fact, this behavior is Universal across models both ImageNet and NLP models. But this only part of the story. Here, we ask related question–do well trained DNNs weight matrices lose Rank ? Matrix Rank: Lets say is an matrix. We can form the Singular Value Decomposition (SVD): The Matrix Rank , or Hard Rank, is simply the number of non-zero singular values which express the decrease in Full Rank M. Notice the Hard Rank of the rectangular matrix is the dimension of the square correlation matrix . In python, this can be computed using  rank = numpy.linalg.matrix_rank(W)  Of course, being a numerical method, we really mean the number of singular values above some tolerance …and we can get different results depending on if we use • the default python tolerance • the numerical recipes tolerance, which is tighter See the numpy documentation on matrix_rank for details. Here, we will compute the rank ourselves, and use an extremely loose bound, and consider any . As we shall see, DNNs are so good at concentrating information that it will not matter Rank Collapse and Regularization If all the singular values are non-zero, we say is Full Rank. If one or more , then we say is Singular. It has lost expressiveness, and the model has undergone Rank collapse. When a model undergoes Rank Collapse, it traditionally needs to be regularized. Say we are solving a simple linear system of equations / linear regression The simple solution is to use a little linear algebra to get the optimal values for the unknown But when is Singular, we can not form the matrix inverse. To fix this, we simply add some small constant to diagonal of So that all the singular values will now be greater than zero, and we can form a generalized pseudo-inverse, called the Moore-Penrose Inverse This procedure is also called Tikhonov Regularization. The constant, or Regularizer, sets the Noise Scale for the model. The information in is concentrated in the singular vectors associated with larger singular values , and the noise is left over in the those associated with smaller singular values : • Information: vectors where • Noise: vectors where In cases where is Singular, regularization is absolutely necessary. But even when it is not singular, Regularization can be useful in traditional machine learning. (Indeed, VC theory tells us that Regularization is a first class concept) But we know that Understanding deep learning requires rethinking generalization. Which leads to the question ? Do the weight matrices of well trained DNNs undergo Rank Collapse ? Answer: They DO NOT — as we now see: Analyzing Pre-Trained pyTorch Models We can easily examine the numerous pre-trained models available in PyTorch. We simply need to get the layer weight matrices and compute the SVD. We then compute the minimum singular value and compute a histogram of the minimums across different models. for im, m in enumerate(model.modules()): if isinstance(m, torch.nn.Linear): W = np.array(m.weight.data.clone().cpu()) M, N = np.min(W.shape), np.max(W.shape) _, svals, _ = np.linalg.svd(W) minsval=np.min(svals) ...  We do this here for numerous models trained on ImageNet and available in pyTorch, such as AlexNet, VGG16, VGG19, ResNet, DenseNet201, etc.– as shown in this Jupyter Notebook. We also examine the NLP models available in AllenNLP. This is a little bit trickier; we have to install AllenNLP from source, then create an analyze.py command class, and rebuild AllenNLP. Then, to analyze, say, the AllenNLP pre-trained NER model, we run allennlp analyze https://s3-us-west-2.amazonaws.com/allennlp/models/ner-model-2018.04.26.tar.gz  This print out the ranks (and other information, like power law fits), and then plot the results. The code for all this is here. Notice that many of the AllenNLP models include Attention matrices, which can be quite large and very rectangular (i.e. = ), as compared to the smaller (and less rectangular) weight matrices used in the ImageNet models (i.e. ),. Note: We restrict our analysis to rectangular layer weight matrices with an aspect ratio , and really larger then 1.1. This is because the Marchenko Pastur (MP) Random Matrix Theory (RMT) tells us that only when. We will review this in a future blog. Minimum Singular Values of Pre-Trained Models For the ImageNet models, most fully connected (FC) weight matrices have a large minimum singular value . Only 6 of the 24 matrices looked at have –and we have not carefully tested the numerical threshold–we are just eyeballing it here. For the AllenNLP models, none of the FC matrices show any evidence of Rank Collapse. All of the singular values for every linear weight matrix are non-zero. It is conjectured that fully optimized DNNs–those with the best generalization accuracy–will not show Rank Collapse in any of their linear weight matrices. If you are training your own model and you see Rank Collapse, you are probably over-regularizing. Inducing Rank Collapse is easy–just over-regularize it is, in fact, very easy to induce Rank Collapse. We can do this in a Mini version of AlexNet, coded in Keras 2, and available here. A Mini version of AlexNet, trained on CIFAR10, used to explore regularization and rank collapse in DNNs. To induce rank collapse in our Fc weight matrices, we can add large weight norm constraints to the linear layers, using the kernel_initializer=0.001 ... model.add(Dense(384, kernel_initializer='glorot_normal', bias_initializer=Constant(0.1),activation='relu', kernel_regularizer=l2(1e-3)) model.add(Dense(192, kernel_initializer='glorot_normal', bias_initializer=Constant(0.1),activation='relu'), kernel_regularizer=l2(1e-3)) ...  We train this smaller MiniAlexnet model on CIFAR10 for 20 epochs, save the final weight matrix, and plot a histogram of the eigenvalues of the weight correlation matrix . Rank Collapse Induced in a Mini AlexNet model, caused by adding weight norm constraints of 0.001 Recall that the eigenvalues are simply the square of the singular values. Here, we have most of them are nearly 0 . Adding too much regularization causes nearly all of the eigenvalues/singular values to collapse to zero. Well trained Deep Neural Networks do not display Rank Collapse Implications We believe this is a unique property of DNNs, and related to how Regularization works in these models. We will discuss this and more in an upcoming paper Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning by Charles H. Martin (Calculation Consulting) and Michael W. Mahoney (UC Berkeley). Please stay tuned and subscribe to this blog for more updates Read Full Article • Show original • . • Share • . • Favorite • . • Email • . • Add Tags CALCULATED CONTENT by Charles H Martin, Phd - 9M ago Power Law Distributions in Deep Learning In a previous post, we saw that the Fully Connected (FC) layers of the most common pre-trained Deep Learning display power law behavior. Specifically, for each FC weight matrix , we compute the eigenvalues of the correlation matrix For every FC matrix, the eigenvalue frequencies, or Empirical Spectral Density (ESD), can be fit to a power law where the exponents all lie in Remarkably the FC matrices all lie within the Universality Class of Fat Tailed Random Matrices Heavy Tailed Random Matrices We define a random matrix by defining a matrix of size , and drawing the matrix elements from a random distribution. We can choose a • Gaussian Random Matrix: , where is a Gaussian distribution or a • Heavy Tailed Random Matrix: , where is a power law distribution In either case, Random Matrix Theory tells us what the asymptotic form of ESD should look like. But first, let’s see what model works best. AlexNet FC3 First, lets look at the ESD for AlexNet for layer FC3, and zoomed in: Recall that AlexNet FC3 fits a power law with exponent$\alpha\sim&bg=ffffff \$ , so we also plot the ESD on a log-log scale

AlexNet Layer FC3 Log Log Histogram of ESD

Notice that the distribution is linear in the central region, and the long tail cuts off sharply.  This is typical of the ESDs for the fully connected (FC) layers of the all the pretrained models we have looked at so far.  We now ask…

What kind of Random Matrix would make a good model for this ESD ?

ESDs: Gaussian random matrices

We first generate a few Gaussian Random matrices (mean 0, variance 1), for different aspect ratios Q,  and plot the histogram of their eigenvalues.

N, M = 1000, 500
Q = N / M
W = np.random.normal(0,1,size=(M,N))
# X shape is M x M
X = (1/N)*np.dot(W.T,W)
evals = np.linalg.eigvals(X)
plot.hist(evals, bins=100,density=True)

Empirical Spectral Density (ESD) for Gaussian Random Matrices, with different Q values.

Notice that the shape of the ESD depends only on Q, and is tightly bounded; there is, in fact, effectively no tail at all to the distributions (except, perhaps, misleadingly for Q=1)

ESDs: Power Laws and Log Log Histograms

We can generate a heavy, or fat-tailed, random matrix as easily using the numpy Pareto function

W=np.random.pareto(mu,size=(N,M))

Heavy Tailed Random matrices have a very ESDs.   They have very long tails–so long, in fact, that it is better to plot them on a log log Histogram

Do any of these look like a plausible model for the ESDs of the weight matrices of a big DNN, like AlexNet ?

• the smallest exponent, (blue), has a very long tail, extending over 11 orders of magnitude. This means the largest eigenvalues would be .  No real W would behave like this.
• the largest exponent,  (red), has a very compact ESD, resembling more the Gaussian Ws above.
• the fat tailed   ESD (green), however, is just about right.  The ESD is linear in the central region, suggesting a power law.  It is a little too large for our eigenvalues , but the tail also cuts off sharply, which is expected for any finite W .  So we are close
AlexNet FC3

Lets overlay the ESD  of fat-tailed W with the actual empirical  from AlexNet for layer FC3

We see a pretty good match to a Fat-tailed random matrix with .

Turns out, there is something very special about being in the range 2-4.

Universality Classes:

Random Matrix Theory predicts the shape of the ESD , in the asymptotic limit, for several kinds of Random Matrix, called University Classes.  The 3 different values of each represent a different Universality Class:

In particular, if we draw  from any heavy tailed / power law distribution, the empirical (i.e. finite size) eigenvalue density  is likewise a power law (PL), either globally, or at least locally.

What is more, the predicted ESDs have different, characteristic global and local shapes, for specific ranges of .    And the amazing thing is that

the ESDs of the fully connected (FC) layers of pretrained DNNs all resemble the ESDs of the Fat-Tailed Universality Classes of Random Matrix Theory

But this is a little tricky to show, because we need to show that we fit to the theoretical .  We now look at the

Relations between  and

RMT tells us that, for , the ESD takes the limiting for

, where

And this works pretty well in practice for the Heavy Tailed Universality Class, for .  But for any finite matrix, as soon as , the finite size effects kick in, and we can not naively apply the infinite limit result.

Statistics of the maximum eigenvalue(s)

RMT not only tells us about the shape of the ESD; it makes statements about the statistics of the edge and/or tails — the fluctuations in the maximum eigenvalue .  Specifically, we have

• Gaussian RMT:
• Fat Tailed RMT:

For standard, Gaussian RMT, the (near the bulk edge) is governed by the famous Tracy Widom.  And for , RMT is governed by the Tau Four Moment Theorem.

But for , the tail fluctuations follow Frechet statistics, and the maximum eigenvalue has Power Law finite size effects

In particular, the effects of M and Q kick in as soon as .  If we underestimate , (small Q, large M), the power law will look weaker, and we will overestimate in our fits.

And, for us, this affects how we estimate from  and assign the Universality Class

Fat Tailed Matrices and the Finite Size Effects for

Here, we generate generate ESDs for 3 different Pareto Heavy tailed random matrices, with the fixed M (left) or N (right), but different Q.  We fit each..

• Show original
• .
• Share
• .
• Favorite
• .
• Email
• .
CALCULATED CONTENT by Charles H Martin, Phd - 10M ago

Why Does Deep Learning Work ?    If we could get a better handle on this, we could solve some very hard problems in Deep Learning. A new class at Stanford is trying to address this. Here, I review some very new results from my own research (w/UC Berkeley) on the topic which has direct practical application.  Here, I will show that

In pretrained, production quality DNNs,  the weight matrices for the Fully Connected (FC ) layers display Fat Tailed Power Law behavior.

Setup: the Energy Landscape function

Deep Neural Networks (DNNs) minimize the Energy function defined by their architecture.  We define the layer weight matrices , biases, , and activations functions , giving

We train the DNN on a labeled data set (d,y), giving the optimization problem

We call this the Energy Landscape because the DNN optimization problem is only parameterized by the weights and biases.  Of course, in any real DNN problem, we do have other adjustable parameters,  such as the amount of Dropout, the learning rate, the batch size, etc.   But these regularization effects simply change the global

The Energy Landscape function changes on each epoch–and do we care about how. In fact, I have argued that  must form an Energy Funnel:

Energy Funnel

But here, for now, we only look at the final result.  Once a DNN is trained, what is left are the weights (and biases).   We can reuse the weights (and biases) of a pre-trained DNN to build new DNNs with transfer learning. And if we train a bunch of DNNs, we want to know which one is better ?

But, practically, we would really like to identify a very good DNN without peaking at the test data, since every time we peak, and retrain, we risk overtraining our DNN.

I now show we can at least start do this by looking the weights matrices themselves.  So let us look at the weights of some pre-trained DNNs.

Pretrained Linear Models in Pytorch

Pytorch comes with several pretrained models, such as AlexNet. To start, we just examine the weight matrices of the Linear / Fully Connected (FC) layers.

pretrained_model = models.alexnet(pretrained=True)
for module in pretrained_model.modules():
if isinstance(module, nn.Linear):
...


The Linear layers have the simplest weight matrices ; they are 2-dimensional tensors, or just rectangular matrices.

Let  be an matrix, where .  We can get the matrix from the pretraing model using:

W = np.array(module.weight.data.clone().cpu())
M, N = np.min(W.shape), N=np.max(W.shape)

Empirical Spectral Density

How is information concentrated in .  For any rectangular matrix, we can form the

Singular Value Decomposition (SVD)

which is readily computed in scikit learn. We will use the faster TruncatedSVD method, and compute singular values :

from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=M-1, n_iter=7, random_state=10)
svd.fit(W)
svals = svd.singular_values_


(Technically, we do miss the smallest singular value doing this, but that’s ok.  It won’t matter here, and we can always use the pure svd method to be a exact)

Eigenvalues of the Correlation Matrix

We can, alternatively form the eigenvalues of the correlation matrix

The eigenvalues are just the square of the singular values.

Notice here we normalize them by N.

evals = (1/N)*svals*svals


We now form the Empirical Spectral Density (ESD), which is, formally

This notation just means compute a histogram of the eigenvalues

import matplotlib.pyplot as plt
plt.hist(evals, bins=100, density=True)


We could also compute the spectral density using a Kernel Density Estimator (KDE); we save this for a future post.

We now look at the ESD of

Pretrained AlexNet

Here, we examine just FC3, the last Linear layer, connecting the model to the labels.  The other linear layers, FC1 and FC2, look similar.  Below is a histogram for ESD.  Notice it is very sharply peaked and has long tail, extending out past 40.

We can get a better view of the heavy tailed behavior by zooming in.

The red curve is a fit of the ESD to the Marchenko Pastur (MP) Random Matrix Theory (RMT) result — it is not a very good fit.  This means ESD does not resemble Gaussian Random matrix.  Instead, it is looks heavy tailed.  Which leads to the question…

Do we have a Power Law ?

(Yes do, as we shall see…)

Physicists love to claim they have discovered data that follows a power law:

But this is harder to do than it seems.  And statisticians love to point this out.  Don’t be fooled–we physicists knew this;  Sornette’s book has a whole chapter on it.  Still, we have to use best practices.

Log Log Histograms

The first thing to do: plot the data on a log-log histogram, and check that this plot is linear–at least in some region.  Let’s look at our ESD for AlexNet FC3:

AlexNet FC3: Log log histogram of the Empirical Spectral Density (ESD). Notice the central region is mostly linear, suggesting a power law.

Yes, it is linear–in the central region, for eigenvalue frequencies between roughly ~1 and ~100–and that is most of the distribution.

Why is not linear everywhere?  Because it is finite size–there are min and max cutoffs.  In the infinite limit, a powerlaw diverges at , and the tail extends indefinitely as . In any finite size data set, there will be an  and .

Power Law Fits

Second, fit the data to a power law, with  and  in mind.  The most widely available and accepted method the Maximum Likelihood Estimator (MLE), develop by Clauset et. al., and available in the python powerlaw package.

import powerlaw
fit = powerlaw.Fit(evals, xmax=np.max(evals))
alpha, D = fit.alpha, fit.D


The D value is a quality metric, the KS distance.   There are other options as well.  The smaller D, the better.  The table below shows typical values of good fits.

Linear Log Log Plots

The powerlaw package also makes some great plots.  Below is a log log plot generated for our fit of FC3, for the central region of the ESD.  The filled lines represent our fits, and the dotted  lines are actual power lawPDF (blue) and CCDF (red).   The filled lines look like straight lines and overlap the dotted lines–so this fit looks pretty good.

AlexNet FC3: Log Log plot of the central region of the Empirical Spectral Density and best Powerlaw fits: pdf (blue) and ccdf (red)

Is this enough ?  Not yet…

Quality of the Estimator

We still need to know, do we have enough data to get a good estimate for , what are our error bars, and what kind of systematic errors might we get?

1. This so-called statistically valid MLE estimator actually only works properly for .
2. And we need a lot of data points…more than we have singular values

We can calibrate the estimator by generating some modest size (N=1000) random power law datasets using the numpy Pareto function, where

and then fitting these with the PowerLaw package.  We get the following curve

Powerlaw fits of exponent compared to data generated from a Pareto distribution, N=1000. Notice that the estimator works best for exponents between 1.5 and 3.5

The green line is a perfect estimate.  The Powerlaw package overestimates small and underestimates large . Fortunately, most of our fits lie in the good range.

Is a Power law the best model ?

A good fit is not enough. We also should ensure that no other obvious model is a better fit. The power law package lets us test out fit against other common (long tailed) choices, namely

• exponential (EXP)
• stretched_exponential (S_EXP)
• log normal  (LOG_N)
• truncated power law (TPL)

For example, to check if our data is better fit by a log normal distribution, we run


R, p = fit.distribution_compare('powerlaw', 'lognormal', normalized_ratio=True)



and R and the  the p-value.  If if R<0 and p <= 0.05, then we can conclude that a power law is a better model.

Note that sometimes, for , the best model may be a truncated power law (TPL). ..

• Show original
• .
• Share
• .
• Favorite
• .
• Email
• .
CALCULATED CONTENT by Charles H Martin, Phd - 1y ago
Why Deep Learning Works: Self Regularization in DNNs To be presented at UC Berkeley / NERSC Jun 8, 2018

Empirical results, using the machinery of Random Matrix Theory (RMT), are presented that are aimed at clarifying and resolving some of the puzzling and seemingly-contradictory aspects of deep neural networks (DNNs). We apply RMT to several well known pre-trained models: LeNet5, AlexNet, and Inception V3, as well as 2 small, toy models.

We show that the DNN training process itself implicitly implements a form of self-regularization associated with the entropy collapse / information bottleneck.  We find that the self-regularization in small models like LeNet5, resembles the familar Tikhonov regularization

whereas large, modern deep networks display a new kind of heavy tailed self-regularization.

We characterize self-regularization using RMT by identifying a taxonomy of the 5+1 phases of training.

Then, with our toy models, we show that even in the absence of any explicit regularization mechanism, the DNN training process itself leads to more and more capacity-controlled models. Importantly, this phenomenon is strongly affected by the many knobs that are used to optimize DNN training.  In particular, we can induce heavy tailed self-regularization by adjusting the batch size in training, thereby exploiting the generalization gap phenomena unique to DNNs.

We argue that this heavy tailed self-regularization has practical implications both designing better DNNs and deep theoretical implications for understanding the complex DNN Energy landscape / optimization problem.

Articles marked as Favorite are saved for later viewing.
• Show original
• .
• Share
• .
• Favorite
• .
• Email
• .

Separate tags by commas