Knowing Something vs. Knowing the Name of Something: Some Points about Causal Analysis
Mad (Data) Scientist
by matloff
1M ago
The famed physicist Richard Feynman once said, “I learned very early the difference between knowing the name of something and knowing something,” a lesson from his father. I think too often we in the statistics/machine learning field are guilty of “only knowing the name of something.” Well, in most cases, we may know a bit more than the name, but not as much as we should to be able to use the concept usefully. As in most of my stat posts, I hope there is something here for everyone. Some of this will be new and thought provoking for those who are already familiar with causal analysis, but also ..read more
Visit website
Quantile Regression with Random Forests
Mad (Data) Scientist
by matloff
3M ago
In my December 22 blog, I first introduced the classic parametric quantile regression (QR) concept. I then showed how one could use the qeML package to perform quantile regression nonparametrically, using the package’s qeKNN function for a k-Nearest Neighbors approach. A reader then asked if this could be applied to random forests (RFs). The answer is yes, and this will be the topic of the current post. My goals in this post, as in the previous one, are to introduce the capabilities of qeML and to point out some general ML issues. The key example of the latter here is the fact that leaves in a ..read more
Visit website
QeML Example: Nonparametric Quantile Regression
Mad (Data) Scientist
by matloff
4M ago
In this post, I will first introduce the concept of quantile regression (QR), a powerful technique that is rarely taught in stat courses. I’ll give an example from the quantreg package, and then will show how qeML can be used to do model-free QR estimation. Along the way, I will also illustrate the use of closures in R. Notation: We are predicting a scalar Y (including the case of dummy/one-hot variables) from a feature vector X. In its simplest form, QR estimates the conditional median of Y given X, as opposed to the usual conditional mean, using a linear model. As we all know, the median is ..read more
Visit website
A Comparison of Several qeML Predictive Methods
Mad (Data) Scientist
by matloff
4M ago
Is machine learning overrated, with traditional methods being underrated these days? Yes, ML has had some celebrated successes, but these have come after huge amounts of effort, and it’s possible that similar effort with traditional methods may have produced similar results. A related issue concerns the type of data. Hard core MLers tend to divide applications into tabular and nontabular data. The former consists of the classical observations in rows, variables in columns format, while the latter means image processing, NLP and the like. The MLers’ prescription: use XGBoost for tabular data, d ..read more
Visit website
Data.table User Survey
Mad (Data) Scientist
by matloff
5M ago
The data.table 2023 user community survey is here, open until December 1st ..read more
Visit website
The “Secret Sauce” Used in Many qeML Functions
Mad (Data) Scientist
by matloff
5M ago
In writing an R package, it is often useful to build up some function call in string form, then “execute” the string. To give a really simple example: > s <- '1+1' > eval(parse(text=s)) [1] 2 Quite a lot of trouble to go to just to find that 1+1 = 2? Yes, but this trick can be extremely useful, as we’ll see here. data(svcensus) z <- qePCA(svcensus,'wageinc','qeKNN',pcaProp=0.5) This says, “Apply Principal Component Analysis to the ‘svcensus’ data, with enough PCs to get 0.5 of the total variance. Then do k-Nearest Neighbor Analysis, fitting qeKNN to the PCs to predict wage inco ..read more
Visit website
QeML Example: Issues of Overfitting, Dimension Reduction Etc.
Mad (Data) Scientist
by matloff
5M ago
What about variable selection? Which predictor variables/features should we use? No matter what anyone tells you, this is an unsolved problem. But there are lots of useful methods. See the qeML vignettes on feature selection and overfitting for detailed background on the issues involved. We note at the outset what our concluding statement will be: Even a very simple, very clean-looking dataset like this one may be much more nuanced than it looks. Real life is not like those simplistic textbooks, eh? Here I’ll discuss qeML::qeLeaveOut1Var. (I usually omit parentheses in referring to function na ..read more
Visit website
New Package, New Book!
Mad (Data) Scientist
by matloff
5M ago
Sorry I haven’t been very active on this blog lately, but now that I have more time, that will change. I’ve got myriad things to say. To begin with, then, I’ll announce a major new R package, and my new book. qeML package (“quick and easy machine learning”) Featured aspects: Now on CRAN, https://cran.r-project.org/package=qeML. See GitHub README for intro, https://github.com/matloff/qeML. Extremely simple, “one liner” user interface. Ideal for teaching: very simple interface, lots of included datasets, included ML tutorials. Lots for the advanced MLer too: special ML methods, advanced graphic ..read more
Visit website
Use of Differential Privacy in the US Census–All for Nothing?
Mad (Data) Scientist
by matloff
1y ago
The field of data privacy has long been of broad interest. In a medical database, for instance, how can administrators enable statistical analysis by medical researchers, while at the same time protecting the privacy of individual patients? Over the years, many methods have been proposed and used. I’ve done some work in the area myself. But in 2006, an approach known as differential privacy (DP) was proposed, by a group of prominent cryptography researchers. With its catchy name and theoretical underpinnings, DP immediately attracted lots of attention. As it is more mathematical than many othe ..read more
Visit website
Base-R and Tidyverse Code, Side-by-Side
Mad (Data) Scientist
by matloff
1y ago
I have a new short writeup, showing common R design patterns, implemented side-by-side in base-R and Tidy. As readers of this blog know, I strongly believe that Tidy is a poor tool for teaching R learners who have no coding background. Relative to learning in a base-R environment, learners using Tidy take longer to become proficient, and once proficient, find that they are only equipped to work in a very narrow range of operations. As a result, we see a flurry of online questions from Tidy users asking “How do I do such-and-such,” when a base-R solution would be simple and straightforward. I b ..read more
Visit website

Follow Mad (Data) Scientist on FeedSpot

Continue with Google
Continue with Apple
OR