Data cleaning is the most time-consuming activity in data science projects aimed at delivery high-quality datasets to provide accuracy of the corresponding trained models. Due to variability of the data types and formats, data origin and acquisition, different data quality problems arise leading to development of variety cleaning techniques and tools. This paper provides a mapping between nature, scope and dimension of data quality problems and a comparative analysis of widely used tools dealing with those problems. The existing data cleaning techniques serve as a basis for comparing the cleaning capabilities of the tools. Furthermore, a cases study addressing the presented data quality problems and cleaning techniques is presented utilizing one of the commonly used software products OpenRefine and Trifacta Wrangler. Although the application of the similar data cleaning techniques on the same dataset, the results show that the performance of the tools is different.

Cite this paper as: Petrova-Antonova D., Tancheva R. (2020) Data Cleaning: A Case Study with OpenRefine and Trifacta Wrangler. In: Shepperd M., Brito e Abreu F., Rodrigues da Silva A., Pérez-Castillo R. (eds) Quality of Information and Communications Technology. QUATIC 2020. Communications in Computer and Information Science, vol 1266. Springer, Cham. https://doi.org/10.1007/978-3-030-58793-2_3



Communications in Computer and Information Science