Discrete Distributions

Mikis Stasinopoulos

Bob Rigby

Fernanda De Bastiani

Gillian Heller

Niki Umlauf

Types

discrete count distributions defined on $0,1,\ldots,\infty$; Poison type or infinity count discrete distributions
discrete count distributions defined on $0,1,\ldots,n$; binomial type of finite count discrete distributions

infinite count

infinite count discrete

The three major problems encounter when modelling count data using the Poisson distribution.

overdispersion
excess (or shortage) of zero values
long tails (rare events)

flow chart

Discrete infinity count distribut.

No Param.	Modelling	Distributions
1	Location	`GEOM`, `LG`, `PO`, `YULE`
2	Location and scale	`DPO`, `GPO`, `NBI`, `NBII`, `PIG`, `WARING`
2	Location and zero probability	`ZALG`, `ZAP`, `ZAZIPF`, `ZIP`, `ZIP2`
3	Location, scale and skewness	`BNB`, `DEL`, `NBF`, `SI`, `SICHEL`
3	Location, scale and zero probability	`ZANBI`, `ZAPIG`, `ZINBI`, `ZIPIG`
4	Location, scale, skewness and zero probability	`ZANBF`, `ZABNB`, `ZASICHEL`, `ZINBF`, `ZIBNB`, `ZISICHEL`

Different types of count (ZIP)

library(gamlss.ggplots)
family_pdf(ZIP, mu=10, sigma=.2, from=0, to=20)

Different types of count (NBI)

family_pdf(NBI, mu=5, sigma=1, from=0, to=20)

Different types of count (NBItr)

library(gamlss.tr)
gen.trun(par=0, family="NBI")

A truncated family of distributions from NBI has been generated 
 and saved under the names:  
 dNBItr pNBItr qNBItr rNBItr NBItr 
The type of truncation is left 
 and the truncation parameter is 0

family_pdf(NBItr, mu=3, sigma=2.5, from=1, to=20)

Different types of count (PIG)

family_pdf(PIG, mu=5, sigma=2, from=1, to=20)

Zero inflated

Zero inflated distribution, $Y \sim {\bf ZID}$ is given by

$Y=0$ with probability $p$

$Y \sim {\bf D}$ with probability $1-p$.

Hence

\[ \begin{split} P(Y=y) &= p + (1-p)P(Y_1=0) & \mbox{if $y=0$} \\ &= (1-p) P(Y_1=y) & \mbox{if $y=1,2,3,...$} \\ \end{split}\]

where $Y_1 \sim {\bf D}$.

Zero adjusted

Zero adjusted distribution, $Y \sim {\bf ZAD}$ is given by

$Y=0$ with probability $p$

$Y \sim {\bf Dtr}$ with probability $1-p$,

where ${\bf Dtr}$ is a truncated distribution, ${\bf D}$ truncated at zero.

Hence

\[ \begin{split} P(Y=y) &= p & \mbox{if $y=0$} \\ &= (1-p) \frac{P(Y_1=y)}{1-P(Y_1=0)} & \mbox{if $y=1,2,3,...$} \\ \end{split}\]

where $Y_1 \sim {\bf D}$.

Oversispersion and underdispersion

Types of overdispersion

Ad-hoc solutions

(i) quasi-likelihood (QL),  Extended QL

(ii) Efron's Double Exponential

(iii) pseudo-likelihood (PL)

Discretized continuous distributions for example if $F_W(w)$ is the cdf a continuous random variable $W$ defined in $\Re^+$ then \[f_Y(y)=F_W(y+1)-F_W(y)\]
Random effect at the observation level solutions. $f_Y(y)=\int f(y|\gamma) f_{\gamma}(\gamma) d\gamma$.

random effect at observation level

when an an explicit continuous mixture distribution, $f_Y(y)$, exists.
when a continuous mixture distribution, $f_Y(y)$, is not explicit but is approximated by integrating out the random effect using approximations, e.g. Gaussian quadrature or Laplace approximation.
when a non-parametric mixture (effectively a finite mixture) is assumed for the response variable.

random effect at observation level

Explicit continuous mixture distribution

\[ \underbrace{f_Y(y)}_\text{discrete}= \int \underbrace{f(y|\gamma)}_\text{discrete} \underbrace{ f_{\gamma}(\gamma)}_\text{continuous} d\gamma \]

$Y\sim NBI(\mu,\sigma)$
$Y|\gamma \sim PO(\gamma \mu )$
$\gamma \sim GA(1, \sigma^{1/2})$

random effect at observation level II

Parametric mixture distribution

\[ \underbrace{f_Y(y)}_\text{discrete}= \int \underbrace{f(y|\gamma)}_\text{discrete} \underbrace{ f_{\gamma}(\gamma)}_\text{continuous} d\gamma \]

$Y\sim PO-Normal(\mu,\sigma)$
$Y|\gamma \sim PO(\gamma \mu)$
$\log(\gamma) \sim NO(1, \sigma)$

random effect at observation level III

Non-parametric mixture distribution {.smaller}

\[ \underbrace{f_Y(y)}_\text{discrete}= \sum_{k=1}^{K} \underbrace{f(y| \gamma_k)}_\text{discrete} \underbrace{ p(\gamma=\gamma_k)}_\text{continuous} \]

$Y\sim PO-NPFM(\mu,\sigma)$
$Y|\gamma \sim PO({\gamma \mu})$
$\log(\gamma) \sim NPFM(2)$

where NPFM(2) equals Non-Parametric Finite Mixture with 2 point probabilities

explicit continuous Mixtures

Distributions	R Name	mixing distribution for $\gamma$
Poisson	PO$(\mu)$	-
Neg. bin. I	NBI$(\mu,\sigma)$	GA$(1,\sigma^{\frac{1}{2}}$)
Neg. bin. II	NBII$(\mu,\sigma)$	GA$(1,\sigma^{\frac{1}{2}}/\mu)$
Poisson IG	PIG$(\mu,\sigma)$	IG$(1,\sigma^{\frac{1}{2}})$
Sichel	SICHEL$(\mu,\sigma,\nu)$	GIG$(1,\sigma^{\frac{1}{2}},\nu)$
Delaporte	DEL$(\mu,\sigma,\nu)$	SG$(1,\sigma^{\frac{1}{2}},\nu)$
Zero inf. Poisson	ZIP$(\mu,\sigma)$	BI$(1,1-\sigma)$
Zero inf. Poisson 2	ZIP2$(\mu,\sigma)$	$(1-\sigma)^{-1}$$(1,1-\sigma)$
Zero inf. neg. bin.	$(\mu,\sigma, \nu)$	zero inflated gamma
Poisson-Tweedie	-	Tweedie family

mean and variance

R Name	params	mean	variance
PO$(\mu)$	1	$\mu$	$\mu$
NBI$(\mu,\sigma)$	2	$\mu$	$\mu+\sigma\mu^2$
NBII$(\mu,\sigma)$	2	$\mu$	$\mu+\sigma\mu$
PIG$(\mu,\sigma)$	2	$\mu$	$\mu+\sigma\mu^2$
SICHEL$(\mu,\sigma,\nu)$	3	$\mu$	$\mu+h(\sigma,\nu)\mu^2$
DEL$(\mu,\sigma,\nu)$	3	$\mu$	$\mu+\sigma(1-\nu)^2\mu^2$
ZIP$(\mu,\sigma)$	2	$(1-\sigma)\mu$	$(1-\sigma)\mu+\sigma(1-\sigma)\mu^2$
ZIP2$(\mu,\sigma)$	2	$\mu$	$\mu+ \frac{\sigma}{(1-\sigma)} \mu^2$

nagative binomial Family

Negative Binomial type I: \[V \left[ Y \right]= \mu + \sigma \mu^2 \]
Negative Binomial type II: \[V \left[ Y \right]= \mu + \sigma \mu\]
Negative Binomial family \[V(Y)=\mu+\sigma \mu^{\nu}\]

finite count

Binomial type

Distribution	name	links
		$\mu$	$\sigma$	$\nu$	$\tau$
binomial	`BI`	logit	-	-	-
beta binomial	`BB`	logit	log	-	-
double binomial	`DBI`	logit	log	-	-
zero-adj beta binomial	`ZABB`	logit	log	logit	-
zero-adj binomial	`ZABI`	logit	logit	-	-
zero-inf beta binomial	`ZABB`	logit	log	logit	-
zero-inf binomial	`ZIBI`	logit	logit	-	-

examples

stylometric: data

Data summary: R data file: stylo in package gamlss.data of dimensions $ 64 $ - source: Dr Mario Corina-Borja

word: is the number of times a word appears in a single text
freq: the frequency of the number of times a word appears in a text
purpose: to demonstrate the fitting of a truncated discrete dist.
conclusion: the truncated SICHEL distributions fits best

stylometric: plot (con.)

library(ggplot2)
library(gamlss)
ggplot(data=stylo, aes(x=word, xend = word, y = 0, yend=freq))+
  geom_segment(lwd=2)

stylometric: truncation (con.)

library(gamlss.tr)
gen.trun(par = 0, family = PO, type = "left")

A truncated family of distributions from PO has been generated 
 and saved under the names:  
 dPOtr pPOtr qPOtr rPOtr POtr 
The type of truncation is left 
 and the truncation parameter is 0

gen.trun(par = 0, family = NBII, type = "left")

A truncated family of distributions from NBII has been generated 
 and saved under the names:  
 dNBIItr pNBIItr qNBIItr rNBIItr NBIItr 
The type of truncation is left 
 and the truncation parameter is 0

gen.trun(par = 0, family = DEL, type = "left")

A truncated family of distributions from DEL has been generated 
 and saved under the names:  
 dDELtr pDELtr qDELtr rDELtr DELtr 
The type of truncation is left 
 and the truncation parameter is 0

gen.trun(par = 0, family = SICHEL, type = "left",
     delta = 0.001)

A truncated family of distributions from SICHEL has been generated 
 and saved under the names:  
 dSICHELtr pSICHELtr qSICHELtr rSICHELtr SICHELtr 
The type of truncation is left 
 and the truncation parameter is 0

stylometric: fits (con.)

library(gamlss2)
mPO <- gamlss2(word ~ 1, weights = freq, data = stylo, family = POtr, 
trace = FALSE)
mNBII <- gamlss2(word ~ 1, weights = freq, data = stylo, family = NBIItr, 
n.cyc = 50, trace = FALSE)
mDEL <- gamlss2(word ~ 1, weights = freq, data = stylo, family = DELtr, 
n.cyc = 50, trace = FALSE)
mSI <- gamlss2(word ~ 1, weights = freq, data = stylo, family = SICHELtr, 
n.cyc = 50, trace = FALSE)
gamlss2::GAIC(mPO, mNBII, mDEL, mSI)

           AIC df
mSI   5149.054  3
mDEL  5160.706  3
mNBII 5322.344  2
mPO   9207.459  1

stylometric: fitted POtr (con.)

m1 <- histDist(word, data=stylo, freq=freq, family=POtr)

stylometric: fitted NBIItr (con.)

m1 <- histDist(word, data=stylo,  freq=freq, family=NBIItr)

stylometric: fitted DELtr (con.)

m1 <- histDist(word, data=stylo,  freq=freq, family=DELtr)

stylometric: fitted SICHELtr (con.)

m1 <- histDist(word, data=stylo,  freq=freq, family=SICHELtr)

fish species data

R data file: species in package gamlss.data of dimensions $70 \times 2$
variables
- fish: the number of different species in 70 lakes in the world
- lake: the lake area

fish species plot

fish species questions

How does the mean of $y$ depend on $x$?
Is $y$ overdispersed Poisson?
How does the variance $y$ depend on its mean?
What is the distribution of $y$ given $x$?
Do the scale and shape parameters of the distribution of $y$ depend on $x$?

fish species fits

Model	$f_Y(y)$	$\mu$	$\sigma$	$\nu$	GDEV	df	AIC	SBC
1	PO	poly(x,2)	-	-	1849.3	3	1855.3	1862.0
2	NBI	$x$	$1$	-	619.8	3	625.8	632.6
3	NBI	poly(x,2)	$1$	-	614.3	4	622.3	631.3
4	NBI	s(x)	$1$	-	611.9	6	623.9	637.4
5	NBI	poly(x,2)	x	-	605.0	5	615.0	626.2
6	NBI-fam	poly(x,2)	1	1	606.0	5	616.0	627.3
7	NBI-fam	poly(x,2)	x	1	604.9	6	616.9	630.4
8	PIG	poly(x,2)	$1$	-	613.3	4	621.3	630.3
9	SI	poly(x,2)	$1$	x	597.7	6	`609.7`	623.2

fish species fits (con.)

Model	$f_Y(y)$	$\mu$	$\sigma$	$\nu$	GDEV	df	AIC	SBC
10	DEL	poly(x,2)	1	x	600.6	6	612.6	626.1
11	DEL	poly(x,2)	-	x	600.6	5	610.6	621.9
12	PO-Normal	poly(x,2)	1	-	615.2	4	623.2	632.2
13	NBI-Normal	poly(x,2)	x	$1$	603.7	6	615.7	629.2
14	PO-NPFM(5)	poly(x,2)	-	$-$	601.9	13	627.9	657.2
15	NB-NPFM(2)	poly(x,2)	1	$-$	611.9	6	623.9	637.4
16	doublePO	poly(x,2)	x	-	616.4	5	626.4	637.6
17	IGdisc	poly(x,2)	1	-	603.3	4	611.3	`620.3`

fish species fitted mean

library(gamlss2)
m1 <- gamlss2(fish~log(lake), data=species, family=SI, trace=F)
ggplot(data=species, aes(x=log(lake), y=fish))+
geom_point()+
geom_line(aes(y=predict(m1,type="parameter", what="mu")))

#geom_line(aes(y=fitted(m1, type="parameter", what="mu")))

fish species fitted pdf

fish species fitted cdf

fish species diagnostics

the hospital stay data

library(gamlss.prepdata)
da <- aep[, -c(1,2,3,8,9)]
da$r <- with(aep, noinap/los*100)
data_xyplot(da, response=r)

100 % of data are ploted, 
that is, 1383 observations.

the hospital stay fits

library(gamlss2)
mI <- gamlss2(y ~ ward + year + loglos|year, family = BB, data = aep, trace=F)
mII <- gamlss2(y ~ ward + year + loglos|year + ward, family = BB, 
               data = aep, trace=FALSE)
mIII <- gamlss2(y ~ ward + year + s(loglos)| year+ward, 
              family = BB, data = aep,trace=FALSE)
mIV <- gamlss2(y ~ ward + year + s(loglos)+s(age)| year+ward, 
              family = BB, data = aep,trace=FALSE)
gamlss2::GAIC(mI, mII, mIII, mIV, k = 2)

          AIC       df
mIV  4476.796 15.86038
mIII 4477.800 14.09619
mII  4501.020  9.00000
mI   4533.441  7.00000

the hospital stay fits

END

back to the index

The Books

Distributions	R Name	mixing distribution for \(\gamma\)
Poisson	PO\((\mu)\)	-
Neg. bin. I	NBI\((\mu,\sigma)\)	GA\((1,\sigma^{\frac{1}{2}}\))
Neg. bin. II	NBII\((\mu,\sigma)\)	GA\((1,\sigma^{\frac{1}{2}}/\mu)\)
Poisson IG	PIG\((\mu,\sigma)\)	IG\((1,\sigma^{\frac{1}{2}})\)
Sichel	SICHEL\((\mu,\sigma,\nu)\)	GIG\((1,\sigma^{\frac{1}{2}},\nu)\)
Delaporte	DEL\((\mu,\sigma,\nu)\)	SG\((1,\sigma^{\frac{1}{2}},\nu)\)
Zero inf. Poisson	ZIP\((\mu,\sigma)\)	BI\((1,1-\sigma)\)
Zero inf. Poisson 2	ZIP2\((\mu,\sigma)\)	\((1-\sigma)^{-1}\)\((1,1-\sigma)\)
Zero inf. neg. bin.	\((\mu,\sigma, \nu)\)	zero inflated gamma
Poisson-Tweedie	-	Tweedie family

R Name	params	mean	variance
PO\((\mu)\)	1	\(\mu\)	\(\mu\)
NBI\((\mu,\sigma)\)	2	\(\mu\)	\(\mu+\sigma\mu^2\)
NBII\((\mu,\sigma)\)	2	\(\mu\)	\(\mu+\sigma\mu\)
PIG\((\mu,\sigma)\)	2	\(\mu\)	\(\mu+\sigma\mu^2\)
SICHEL\((\mu,\sigma,\nu)\)	3	\(\mu\)	\(\mu+h(\sigma,\nu)\mu^2\)
DEL\((\mu,\sigma,\nu)\)	3	\(\mu\)	\(\mu+\sigma(1-\nu)^2\mu^2\)
ZIP\((\mu,\sigma)\)	2	\((1-\sigma)\mu\)	\((1-\sigma)\mu+\sigma(1-\sigma)\mu^2\)
ZIP2\((\mu,\sigma)\)	2	\(\mu\)	\(\mu+ \frac{\sigma}{(1-\sigma)} \mu^2\)

Distribution	name	links
		\(\mu\)	\(\sigma\)	\(\nu\)	\(\tau\)
binomial	`BI`	logit	-	-	-
beta binomial	`BB`	logit	log	-	-
double binomial	`DBI`	logit	log	-	-
zero-adj beta binomial	`ZABB`	logit	log	logit	-
zero-adj binomial	`ZABI`	logit	logit	-	-
zero-inf beta binomial	`ZABB`	logit	log	logit	-
zero-inf binomial	`ZIBI`	logit	logit	-	-