An important goal of translational bioinformatics is the construction of classifiers that can identify different sample classes from high-throughput measurements, such as RNA-seq. The genomic data sets are high-dimensional – with tens of thousands or more features, while the number of tested samples is often restricted to a few hundred or less, due to the high per sample cost or the rarity of phenotype. This means that the training and testing of the classifiers have to be performed on the same data set, where the standard deviation, variance, and lack of correlation with the true error can seriously impair error estimation. This problem underlines the importance of synthetic data generation models, resembling the behaviour of real gene expression to study the classification accuracy and to compare the performance of various classification rules. In this work, we examine the influence of different model parameters for the generation of synthetic data, that resembles real RNA-seq data, on several classifiers (linear discriminant analysis (LDA), k-nearest neighbours (KNN), support vector machine (SVM), and artificial neural networks (ANN)). The estimation of the classification accuracy is assessed using the quantities related to the classification confusion matrix and bolstered error estimation (BRESUB). BRESUB is particularly useful in the case of a small number of available samples. Our comparative results show that SVM is the most accurate classifier in the majority of the considered simulated data scenarios. However, the simplest classification rule, LDA, has similar or in some cases even better accuracy when the number of training samples is small. This suggests that when only a small number of samples of real data is available LDA could be recommended as both simple and  accurate classification rule.

Studies in Computational Intelligence, Springer