flowchart TB A[model] --> B(assumptions) B --> C[fit] --> D{check} -->|adequate| E(stop) D --> |not good| B
Regression models
Data
Distributions
Fitting
Selection
Comparison
Interpretation
Through a simple data example
Models and statistical modelling
Assumptions
Regression Models
Distributional Regression
Example
“www.gamlss.com”
“all models are wrong but some are useful”.
– George Box
Models should be parsimonious
Models should be fit for purpose and able to answer the question at hand
Statistical models have a stochastic component
All models are based on assumptions.
Assumptions are made to simplify things
Explicit assumptions
Implicit assumptions
it is easier to check the explicit assumptions rather the implicit ones
flowchart TB A[model] --> B(assumptions) B --> C[fit] --> D{check} -->|adequate| E(stop) D --> |not good| B
\[ \begin{equation} y_i= b_0 + b_1 x_{1i} + b_2 x_{2i}, \ldots, b_p x_{pi}+ \epsilon_i \end{equation} \qquad(1)\]
\[ \begin{eqnarray} y_i & \stackrel{\small{ind}}{\sim } & {N}(\mu_i, \sigma) \nonumber \\ \mu_i &=& b_0 + b_1 x_{1i} + b_2 x_{2i}, \ldots, b_p x_{pi} \end{eqnarray} \qquad(2)\]
\[ \begin{eqnarray} y_i & \stackrel{\small{ind}}{\sim } & {N}(\mu_i, \sigma) \nonumber \\ \mu_i &=& b_0 + s_1(x_{1i}) + s_2(x_{2i}), \ldots, s_p(x_{pi}) \end{eqnarray} \qquad(3)\]
\[\begin{eqnarray} y_i & \stackrel{\small{ind}}{\sim }& {N}(\mu_i, \sigma) \nonumber \\ \mu_i &=& ML(x_{1i},x_{2i}, \ldots, x_{pi}) \end{eqnarray} \qquad(4)\]
\[\begin{eqnarray} y_i & \stackrel{\small{ind}}{\sim }& {E}(\mu_i, \phi) \nonumber \\ g(\mu_i) &=& b_0 + b_1 x_{1i} + b_2 x_{2i}, \ldots, b_p x_{pi} \end{eqnarray} \qquad(5)\]
\({E}(\mu_i, \phi)\) : Exponential
family
\(g(\mu_i)\) : the link
function
\[ X \stackrel{\textit{M}(\boldsymbol{\theta})}{\longrightarrow} D\left(Y|\boldsymbol{\theta}(\textbf{X})\right) \]
All parameters \(\boldsymbol{\theta}\) could functions of the explanatory variables \(\boldsymbol{\theta}(\textbf{X})\).
\(D\left(Y|\boldsymbol{\theta}(\textbf{X})\right)\) can be any \(k\) parameter distribution
\[\begin{eqnarray} y_i & \stackrel{\small{ind}}{\sim }& {D}( \theta_{1i}, \ldots, \theta_{ki}) \nonumber \\ g(\theta_{1i}) &=& b_{10} + s_1({x}_{1i}) + \ldots, s_p({x}_{pi}) \nonumber\\ \ldots &=& \ldots \nonumber\\ g({\theta}_{ki}) &=& b_0 + s_1({x}_{1i}) + \ldots, s_p({x}_{pi}) \end{eqnarray} \qquad(6)\]
\[\begin{eqnarray} y_i & \stackrel{\small{ind}}{\sim }& {D}( \theta_{1i}, \ldots, \theta_{ki}) \nonumber \\ g({\theta}_{1i}) &=& {ML}_1({x}_{1i},{x}_{2i}, \ldots, {x}_{pi}) \nonumber \\ \ldots &=& \ldots \nonumber\\ g({\theta}_{ki}) &=& {ML}_1({x}_{1i},{x}_{2i}, \ldots, {x}_{pi}) \end{eqnarray} \qquad(7)\]
Figure 1 Abdominal circumference against gestation age.
library(ggplot2)
library(gamlss.ggplots)
library(gamlss.add)
# Linear
lm1 <- gamlss(y~x, data=abdom, trace=FALSE)
# additive smooth
am1 <- gamlss(y~pb(x), data=abdom,trace=FALSE)# smooth
# neural network
set.seed(123)
nn1 <- gamlss(y~nn(~x), size=5, data=abdom, trace=FALSE)# neural
# regression three
rt1 <- gamlss(y~tr(~x), data=abdom, trace=FALSE)# three
GAIC(lm1, am1, nn1, rt1)
df AIC
am1 6.508274 4948.869
nn1 12.000000 4965.171
lm1 3.000000 5008.453
rt1 14.000000 5305.390
The additive smooth model is the best parsimonious model
A kurtotic distribution is adequate for the data
No simple Machine Learning method will do because there is kurtosis and we are interested in centiles
quantile regression
could be used here but in general it is more difficult to check the implicit assumptions made
Tip
Implicit assumptions are more difficult to check
The Books
www.gamlss.com