This paper presents an approach for the automatic association of diagnoses in the Bulgarian language to ICD-10 codes. Since this task is currently performed manually by medical professionals, the ability to automate it would save time and allow doctors to focus more on patient care. The presented approach employs a fine-tuned language model (i.e. BERT) as a multi-class classification model. As there are several different types of BERT models, we conduct experiments to assess the applicability of domain and language-specific model adaptation. To train our models we use big corpora of about 350,000 textual descriptions of diagnosis in the Bulgarian language annotated with ICD-10 codes. We conduct experiments comparing the accuracy of ICD-10 code prediction using different types of BERT language models. The results show that the MultilingualBERT model (Accuracy Top 1 – 81%; Macro F1 – 86%, MRR Top 5 – 88%) outperforms other models. However, all models seem to suffer from the class imbalance in the training dataset. The achieved accuracy of prediction in the experiments can be evaluated as very high, given the huge amount of classes and noisiness of the data. The result also provides evidence that the collected dataset and the proposed approach can be useful in building an application to help medical practitioners with this task and encourages further research to improve the prediction accuracy of the models. By design, the proposed approach strives to be language-independent as much as possible and can be easily adapted to other languages.

Proc. of CSBio '20: Proceedings of the Eleventh International Conference on Computational Systems-Biology and Bioinformatics, Bangkok, Thailand, 19-21 November 2020, ACM Digital Library, 2020, ISBN:978-1-4503-8823-8/20/11