Automatically detecting disinformation is an important Natural Language Processing (NLP) task whose results can assist journalists and the general public. The European Commission defines “disinformation” as “false or misleading content that is spread with an intention to deceive”. Deception and thus disinformation can be identified by the presence of (psycho)linguistic markers, but some lower-resourced languages (e.g. Bulgarian) lack sufficient linguistic and psycholinguistic research on this topic, lists of such markers and suitable datasets. This article introduces the first-ever resources for studying and detecting deception and disinformation in Bulgarian (some of which can be adapted to other languages). The resources can benefit linguists, psycholinguists and NLP researchers, are accessible on Zenodo (subject to legal conditions) and include: 1) an extended hierarchical classification of linguistic markers signalling deception; 2) lists of Bulgarian expressions for recognizing some of the linguistic markers; 3) four large Bulgarian social media datasets on topics related to deception, not fact-checked, but automatically annotated with the markers; 4) Python scripts to automatically collect, clean, anonymize, and annotate new Bulgarian texts. The datasets can be used to build machine learning methods or study potential deception. The article describes the methods of collecting and processing the datasets and linguistic markers and presents some statistics.

Proceedings of the 10th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics. Poznań, Poland.