STHDA
974 FOLLOWERS
STHDA is a web site for statistical data analysis and data visualization using R software. It provides many R programming tutorials easy to follow.
STHDA
4y ago
Linear regression (or linear model) is used to predict a quantitative outcome variable (y) on the basis of one or multiple predictor variables (x) (James et al. 2014,P. Bruce and Bruce (2017)).
The goal is to build a mathematical formula that defines y as a function of the x variable. Once, we built a statistically significant model, it’s possible to use it for predicting future outcome on the basis of new x values.
When you build a regression model, you need to assess the performance of the predictive model. In other words, you need to evaluate how well the model is in predicting the outcome ..read more
STHDA
4y ago
This chapter describes how to compute multiple linear regression with interaction effects.
Previously, we have described how to build a multiple linear regression model (Chapter @ref(linear-regression)) for predicting a continuous outcome variable (y) based on multiple predictor variables (x).
For example, to predict sales, based on advertising budgets spent on youtube and facebook, the model equation is sales = b0 + b1*youtube + b2*facebook, where, b0 is the intercept; b1 and b2 are the regression coefficients associated respectively with the predictor variables youtube and facebook.
The abo ..read more
STHDA
4y ago
This chapter describes how to compute regression with categorical variables.
Categorical variables (also known as factor or qualitative variables) are variables that classify observations into groups. They have a limited number of different values, called levels. For example the gender of individuals are a categorical variable that can take two levels: Male or Female.
Regression analysis requires numerical variables. So, when a researcher wishes to include a categorical variable in a regression model, supplementary steps are required to make the results interpretable.
In these steps, the cate ..read more
STHDA
4y ago
In some cases, the true relationship between the outcome and a predictor variable might not be linear.
There are different solutions extending the linear regression model (Chapter @ref(linear-regression)) for capturing these nonlinear effects, including:
Polynomial regression. This is the simple approach to model non-linear relationships. It add polynomial terms or quadratic terms (square, cubes, etc) to a regression.
Spline regression. Fits a smooth curve with a series of polynomial segments. The values delimiting the spline segments are called Knots.
Generalized additive models (GAM ..read more
STHDA
4y ago
Linear regression (Chapter @ref(linear-regression)) makes several assumptions about the data at hand. This chapter describes regression assumptions and provides built-in plots for regression diagnostics in R programming language.
After performing a regression analysis, you should always check if the model works well for the data at hand.
A first step of this regression diagnostic is to inspect the significance of the regression beta coefficients, as well as, the R2 that tells us how well the linear regression model fits to the data. This has been described in the Chapters @ref(linear-regressi ..read more
STHDA
4y ago
In multiple regression (Chapter @ref(linear-regression)), two or more predictor variables might be correlated with each other. This situation is referred as collinearity.
There is an extreme situation, called multicollinearity, where collinearity exists between three or more variables even if no pair of variables has a particularly high correlation. This means that there is redundancy between predictor variables.
In the presence of multicollinearity, the solution of the regression model becomes unstable.
For a given predictor (p), multicollinearity can assessed by computing a score called the ..read more
STHDA
4y ago
A Confounding variable is an important variable that should be included in the predictive model but you omit it.Naive interpretation of such models can lead to invalid conclusions.
For example, consider that we want to model life expentency in different countries based on the GDP per capita, using the gapminder data set:
library(gapminder)
lm(lifeExp ~ gdpPercap, data = gapminder)
In this example, it is clear that the continent is an important variable: countries in Europe are estimated to have a higher life expectancy compared to countries in Africa. Therefore, continent is a confounding v ..read more
STHDA
4y ago
In this chapter we’ll describe different statistical regression metrics for measuring the performance of a regression model (Chapter @ref(linear-regression)).
Next, we’ll provide practical examples in R for comparing the performance of two models in order to select the best one for our data.
Contents:
Model performance metrics
Loading required R packages
Example of data
Building regression models
Assessing model quality
Comparing regression models performance
Discussion
The Book:
Machine Learning Essentials: Practical Guide in R
Model performance metrics
In regression model, the most c ..read more
STHDA
4y ago
Cross-validation refers to a set of methods for measuring the performance of a given predictive model on new test data sets.
The basic idea, behind cross-validation techniques, consists of dividing the data into two sets:
The training set, used to train (i.e. build) the model;
and the testing set (or validation set), used to test (i.e. validate) the model by estimating the prediction error.
Cross-validation is also known as a resampling method because it involves fitting the same statistical method multiple times using different subsets of the data.
In this chapter, you’ll learn ..read more
STHDA
4y ago
Similarly to cross-validation techniques (Chapter @ref(cross-validation)), the bootstrap resampling method can be used to measure the accuracy of a predictive model. Additionally, it can be used to measure the uncertainty associated with any statistical estimator.
Bootstrap resampling consists of repeatedly selecting a sample of n observations from the original data set, and to evaluate the model on each copy. An average standard error is then calculated and the results provide an indication of the overall variance of the model performance.
This chapter describes the basics of bootstrapping a ..read more