- Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi
- Volume:30 Issue:5
- Performance comparison of data balancing techniques on hate speech detection in Turkish
Performance comparison of data balancing techniques on hate speech detection in Turkish
Authors : Habibe Karayiğit, Ali Akdağli, Çiğdem İnan Acı
Pages : 610-621
View : 58 | Download : 109
Publication Date : 2024-10-30
Article Type : Research Paper
Abstract :Increasing hate speech on social media platforms causes psychological disorders and deep and negative effects. Automatic language classification models are needed to detect hate speech. When testing language models for hate speech, imbalanced datasets where one data class is represented much more frequently than the other can be a problem in language datasets. When the dataset is imbalanced, the classifier may be biased towards the majority class and may not perform well in the minority class. This can lead to incorrect or unreliable classification results. To solve this problem, data level balancing methods such as oversampling or undersampling are used to balance the class distribution before classifying the dataset. This study, it is aimed to achieve a successful classification model combination that detects hate speech by using data-level balancing methods. For this, a comprehensive study was carried out by applying the balancing method at eight data levels (random oversampling, Synthetic Minority Oversampling Technique (SMOTE), K-means SMOTE, Localized Random Affine Shadow Sample (LoRAS), Text-based Generative AdversarialNetwork (TextGAN), Nearmiss, Tomek Links ve Clustering-based) to the Abusive Turkish Comments (ATC) dataset, which has an imbalanced distribution of labels, obtained from Instagram. Classification performances of data level balancing methods were evaluated with Basic Machine Learning (BML) and Convolutional Neural Network (CNN) methods. It has been observed that the CBoW+CNN model based on the TextGAN data-level balancing method, as well as the Skip-gram CNN model, exhibited the best classification performance with a MacroAveraged F1 score of 0.972.Keywords : Veri dengeleme, Sosyal medya, Makine öğrenmesi, Derin öğrenme, Doğal dil işleme, nefret söylemi