The Functions in the Package `gamlss.prepdata`

Authors

Affiliations

De Bastiani, F.

Federal University of Pernambuco, Brazil

Heller, G.

University of Syndey, Austalia

Kneib, T.

Georg-August-Universitat Gottigan, Gernany

Mayr, A.

University of Marburg, Gernany

Rigby, R. A.

University of Greenwich, UK

Stasinopoulos, M. D.

University of Greenwich, UK

Stauffer, Reto

University of Innsbruck, Austria

Umlauf, N,

University of Innsbruck, Austria

Zeileis, A.

University of Innsbruck, Austria

Introduction

This booklet introduces the gamlss.prepdata package and its functionality. It aims to describe the available functions and how they can be used.

The idea is we have to know the data before modelling. More precise is worth knowing;

which variables are continuous and which categorical in the data
missing values in the data
outliers in variables in the data
pairwise relationships between variables in the data
first order interactions between the explanatory variables
possible future data partitions

The latest versions of the packages used hare gamlss, gamlss2 and gamlss.prepdata are shown below:

rm(list=ls())
library(gamlss)
library(gamlss2)
library(ggplot2)
library(gamlss.ggplots)
library(gamlss.prepdata)
library("dplyr") 
packageVersion("gamlss")

[1] '5.5.0'

packageVersion("gamlss2")

[1] '0.1.0'

packageVersion("gamlss.prepdata")

[1] '0.1.19'

The package: `gamlss.prepdata`

The gamlss.prepdata package originated from the gamlss.ggplots package. As gamlss.ggplots became too large, for easy maintenance, it was split into two separate packages, and gamlss.prepdata was created.

Since gamlss.prepdata is still at an experimental stage, some of functions are remained hidden to allow time for thorough checking and validation. These hidden functions can still be accessed using the triple colon notation, for example: gamlss.prepdata:::.

The functions available in gamlss.prepdata are intended for pre-fitting — that is, to be used before applying the gamlss() or gamlss2() fitting functions. The available functions can be grouped into the following categories:

Information functions

These functions provide information about:

the size of the dataset;
the extent of missing values;
the structure of the dataset;
whether the class of variables is appropriate for analysis.

Plotting functions

These functions allow plotting;

individual variables and
pairwise relationships of variables

Features functions

These functions can assist in:

detecting outliers;
applying transformations to variables and
scaling variables.

Data Partition functions

Functions that facilitate partitioning the data to improve inference and avoid overfitting during model selection.

Distributional Regression

The aim of this vignette is to demonstrate how to manipulate and prepare data before applying a distributional regression analysis.

The general form a distributional regression model can be written as; \[ \begin{split} \textbf{y} & \stackrel{\small{ind}}{\sim } \mathcal{D}( \boldsymbol{\theta}_1, \ldots, \boldsymbol{\theta}_k) \nonumber \\ g_1(\boldsymbol{\theta}_1) &= \mathcal{ML}_1(\textbf{x}_{11},\textbf{x}_{21}, \ldots, \textbf{x}_{p1}) \nonumber \\ \ldots &= \ldots \nonumber\\ g_k(\boldsymbol{\theta}_k) &= \mathcal{ML}_k(\textbf{x}_{1k},\textbf{x}_{2k}, \ldots, \textbf{x}_{pk}). \end{split} \tag{1}\] where we assume that the response variable $y_i$ for $i=1,\ldots, n$, is independently distributed having a distribution $\mathcal{D}( \theta_1, \ldots, \theta_k)$ with $k$ parameters and where all parameters could be effected by the explanatory variables $\textbf{x}_{1},\textbf{x}_{2}, \ldots, \textbf{x}_{p}$. The $\mathcal{ML}$ represents any regression type machine learning algorithm i.e. LASSO, Neural networks etc.

When only additive smoothing terms are used in the fitting the model can be written as; \[\begin{split} \textbf{y} & \stackrel{\small{ind}}{\sim } \mathcal{D}( \boldsymbol{\theta}_1, \ldots, \boldsymbol{\theta}_k) \nonumber \\ g_1( \boldsymbol{\theta}_1) &= b_{01} + s_1(\textbf{x}_{11}) + \cdots, +s_p(\textbf{x}_{p1}) \nonumber\\ \ldots &= \ldots \nonumber\\ g_k( \boldsymbol{\theta}_k) &= b_{0k} + s_1(\textbf{x}_{1k}) + \cdots, +s_p(\textbf{x}_{pk}). \end{split} \tag{2}\] which is the GAMLSS model introduced by Rigby and Stasinopoulos (2005).

There are three books on GAMLSS, D. M. Stasinopoulos et al. (2017), Rigby et al. (2019) and M. D. Stasinopoulos et al. (2024) and several ’GAMLSS lecture materials, available from GitHUb, https://mstasinopoulos.github.io/Porto_short_course/ and https://mstasinopoulos.github.io/ShortCourse/. The latest R packages related to GAMLSS can be found in https://gamlss-dev.r-universe.dev/builds.

The aim of this package is to prepare data and extract useful information that can be utilized during the modeling stage.

Ιnformation functions

The functions for obtaining information from a dataset are described in this section and summarized in Table 1. These functions can provide:

General information about the dataset such as the dimensions (number of rows and columns) and the percentage of omitted (missing) observations
Information about the variables in the dataset
Information about the observations in the dataset
Changing the variables into suitable R classes for futher analysis.

The functions which use a data.frame as their first argument are staring from data_ while ones variable whixg take only a variable start fro, y_.

Here is a table of the information functions;

Table 1: A summary table of the functions to obtain information

Functions	Usage
`data_dim()`	Returns the number of rows and columns, and show the percentage of omitted (missing) observations
`data_names()`	Lists the names of the variables in the dataset
`data_rename()`	Allows renaming of one or more variables in the dataset
`data_shorter_names()`	Allows shortening the names of one or more variables in the dataset
`data_distinct()`	Displays the number of distinct values for each variable in the dataset
`data_na_vars()`	Displays the count of NA (missing) values for each variable in the dataset
`data_na_obs()`	Displays which observation are missing for each variable
`data_omit()`	Removes all rows (observations) that contain NA’s (missing values)
`data_str()`	Displays the class for each variable (e.g., numeric, factor, etc.) along with additional detail
`data_cha2fac()`	Converts all character variables in the dataset to factors
`data_few2fac()`	Converts variables with a small number of distinct values (e.g., binary or categorical) to factors
`data_int2num()`	Converts integer variables with several distinct values to numeric
`data_fac2num()`	Converts factors or characters to to numeric
`data_rm()`	Removes one or more variables (columns) from the dataset
`data_rmNAvars()`	Removes all variables with NA’s as their only value.
`data_rm1val()`	Removes all variables/factors with a single value/level
`data_select()`	Allows selection of one or more variables (columns)
`data_exclude_class()`	Removes all variables of a specified class (e.g., factor, numeric) from the dataset
`data_only_continuous()`	Retains only the numeric variables (excluding factors or other variable types)
`data_factor()`	For all factors in the data the first level becomes the level with `lower` of `higher` number of observations

Next we demonstrate simple use of the functions.

`data_dim()`

This function provides detailed information about the dimensions of a data.frame. It is similar to the R function dim(), but with additional details. The output is the original data frame, allowing it to be used in a series of piping commands.

rent99 |> data_dim()

************************************************************** 
************************************************************** 
the R class of the data is: data.frame 
the dimensions of the data are: 3082 by 9 
number of observations with missing values: 0 
% of NA's in the data: 0 % 
************************************************************** 
**************************************************************

Functions	Usage
`data_plot()`	Plots each variable in the dataset to visualize its distribution or other important characteristics (see also Section 6.1
`data_bucket()`	Generates bucket plots for all numerical variables in the dataset to visualize their skewness and kurtosis
`data_response()`	Plots the response variable alongside its z-score, providing a standardized version of the response for comparison
`data_zscores()`	Plots the z-scores for all continuous variables in the dataset, allowing for easy visualization of the standardized values (see also Section 6.1)
`data_xyplot()`	Generates pairwise plots of the response variable against all other variables in the dataset to visualize their relationships
`data_cor()`	Calculates and plots the pairwise correlations for all continuous variables in the dataset to assess linear relationships between them
`data_mcor()`	Calculates and plots the pairwise maximal correlations for all continuous variables in the dataset to assess non-linear relationships between them
`data_void()`	Searches for pairwise empty spaces across all continuous variables in the dataset to identify problems with interpretation or prediction
`data_pcor()`	Calculates and plots the pairwise patrial-correlations for all continuous variables in the dataset
`data_inter()`	Searches for potential pairwise interactions between variables in the dataset to identify relationships or dependencies that may be useful for modelling
`data_leverage()`	Detects outliers in the continuous explanatory variables (x-variables) as a group to highlight unusual observations
`data_Ptrans_plot()`	Plots the response variable against various power transformations of the continuous x-variables to explore potential relationships and model suitability

Functions	Usage
`y_outliers()`	identify possible outliers for one continuous variable
`y_outliers_both()`	identify possible outliers using both z-scores and the quantile rule methodology
`y_outliers_loop()`	identify outliers (using z-scores) by fitting several times the chossen `family` eliminating each time observations identifyied as outliers in the prievious fits
`y_outliers_by()`	identify possible outliers (using z-scores) at partitions of the data by a factor
`y_outliers_z()`	identify possible outliers (using z-scores) for different combination of `loop` and `by`
`data_outliers()`	identify possible outliers for all continuous x-variable (see also `data_zscores` in Section 5.0.4)
`data_leverage()`	this is a repeat of the Section 5.0.14 function
`data_index_cor()`	this function gives an index of continuous variables with high correlation with others continuous variables
`data_index_association()`	this function gives an index of variables with high association with others continuous or factor variables
`data_scale()`	scales all continuous x-variables to zero mean and one s.d. or to the range [0,1]
`xy_Ptrans()`	looking for appropriate power transformation for response against one of the x-variable
`data_Ptrans()`	looking for appropriate power transformation for response against all x-variable
`data_Ptrans_plot()`	plotting all x’s against the response using identity, square root and log transformations
`time_dt2dh()`	Separates a date and time variable into two variables
`time_hour2num()`	takes hours ans translate to numeric

Functions	Usage
`data_part()`	Creates a single or multiple (CV) partitions by introducing a factor with different levels
`data_part_list()`	Creates a single or multiple (CV) partitions with output a list of `data.frame`s
`data_boot_index()`	Creates two lists. The in-bag,`IB`, indices for fitting and the out-of-bag, `OOB`, indices for prediction
`data_boot_weights()`	Create a \(n \times B\) matrix with columns weights for bootstrap fits
`data_Kfold_index()`	Creates a \(n \times K\) matrix of variables to be used for cross validation data indexing
`data_Kfold_weights()`	Creates a \(n \times K\) matrix of dummy variables to be used for cross validation weighted fits
`data_cut()`	This is not a partition funtion but randomly select a specified porpotion of the data

Introduction

The package: gamlss.prepdata

Information functions

Plotting functions

Features functions

Data Partition functions

Distributional Regression

Ιnformation functions

data_dim()

data_names()

data_rename()

data_shorter_names()

data_distinct()

data_na_vars()

data_na_obs()

data_omit()

data_str()

data_cha2fac()

data_few2fac()

data_int2num()

data_fac2num()

data_rm()

data_rmNAvars()

data_rm1val()

data_select()

data_exclude_class()

data_only_continuous()

data_factor()

Graphical functions

data_plot()

data_bucket()

data_response()

data_zscores()

data_xyplot()

data_cor()

cor_perm_test()

cor_boot()

data_mcor()

data_association()

data_void()

data_pcor()

data_inter()

data_leverage()

data_Ptrans_plot()

Feature functions

Outliers

y_outliers()

y_outliers_both()

y_outliers_by()

y_outliers_loop()

y_outliers_z()

data_outliers()

data_leverage() (repeat)

Hight correlations

data_index_cor()

data_index_association()

Scaling

data_scale()

Transformations

xy_Ptrans()

data_Ptrans()

data_Ptrans_plot() (repeat)

Dates and Time

time_dt2dh()

time_hour2num()

Data Partition

Introduction to data partition

data_part()

data_part_list()

data_boot_index()

data_boot_weights()

data_Kfold_index()

data_Kfold_weights()

data_cut()

Distribution Families

family_pdf()

family_cdf

family_cor()

References

The package: `gamlss.prepdata`

`data_dim()`

`data_names()`

`data_rename()`

`data_shorter_names()`

`data_distinct()`

`data_na_vars()`

`data_na_obs()`

`data_omit()`

`data_str()`

`data_cha2fac()`

`data_few2fac()`

`data_int2num()`

`data_fac2num()`

`data_rm()`

`data_rmNAvars()`

`data_rm1val()`

`data_select()`

`data_exclude_class()`

`data_only_continuous()`

`data_factor()`

`data_plot()`

`data_bucket()`

`data_response()`

`data_zscores()`

`data_xyplot()`

`data_cor()`

`cor_perm_test()`

`cor_boot()`

`data_mcor()`

`data_association()`

`data_void()`

`data_pcor()`

`data_inter()`

`data_leverage()`

`data_Ptrans_plot()`

`y_outliers()`

`y_outliers_both()`

`y_outliers_by()`

`y_outliers_loop()`

`y_outliers_z()`

`data_outliers()`

`data_leverage()` (repeat)

`data_index_cor()`

`data_index_association()`

`data_scale()`

`xy_Ptrans()`

`data_Ptrans()`

`data_Ptrans_plot()` (repeat)

`time_dt2dh()`

`time_hour2num()`

`data_part()`

`data_part_list()`

`data_boot_index()`

`data_boot_weights()`

`data_Kfold_index()`

`data_Kfold_weights()`

`data_cut()`

`family_pdf()`

`family_cdf`

`family_cor()`