Missing data is a common problem when analysing real-world data from many different research fields such as biostatistics, sociology, economics etc. Three types of missing data are typically defined: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). Ignoring observations with missingness could lead to serious bias and inefficiency, especially when the number of such cases is large compared to the sample size. One popular technique for solving the missing data issue is the multiple imputation (MI).

There are two general approaches to MI. One is joint modelling which draws missing values simultaneously for all incomplete variables from a multivariate distribution. The other is the fully conditional specification (FCS, also known as MICE), which imputes variables one at a time from a series of univariate conditional distributions. For each incomplete variable FCS draws from a univariate density conditional on the other variables included in the imputation model.

In this work, we define a computationally efficient numerical simulation framework for data generation and evaluation of different imputation methods. We consider different FCS imputation methods along with traditional ones under different scenarios for the parameters of the models – a percentage of missingness, data dimensionality, a different combination of categorical and numerical predictors and different correlations between the covariates. Our results are based on synthetic data generated on the HPC cluster and show the optimal imputation methods in the different cases according to two scoring techniques.

Springer Lecture Notes in Computer Science, Volume 13127