Multilingual Hate Speech Detection: Comparison of Transfer Learning Methods to Classify German, Italian, and Spanish Posts

Abstract

With the increase of digital communication, a surge in online hate speech can be witnessed. Recent studies have concentrated on automated supervised detection of hate speech. However, there remains limited understanding of an effective strategy for identifying multilingual hate speech in social media posts. This study introduces an innovate experimental design for multilingual hate speech detection. It compares different approaches to automatically detect multilingual hate speech through a series of experiments and creates a classification algorithm for hate speech in German, Italian and Spanish text-based social media content. The study creates monolingual, multilingual, and translated datasets specific to the language triplet. Subsequently, the research explores suitable models for multilingual hate speech detection, evaluating a total of seven transformer-based models along with corresponding SVM models on the constructed datasets. The findings indicate that all chosen transformer-based models outperform the baseline SVM models. The research highlights the superiority of a multilingual approach, utilizing XLM-RoBERTa as a classifier model, over monolingual, multilingual, and translation-based approaches. Furthermore, the study demonstrates that translation-based methods in connection to the model DistillBERT can serve as viable alternatives to the multilingual XLM-RoBERTa approach, particularly in scenarios where computational resources are restricted and processing speed is of importance.

Publication
In 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 2023 pp. 5503-5511