Performance comparison of data balancing techniques on hate speech detection in Turkish

Home Page
About
Submit A Journal
Submit A Conference
Submit Paper/Book
- Submit a Preprint
- Submit a Book
Contact

Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi
Volume:30 Issue:5
Performance comparison of data balancing techniques on hate speech detection in Turkish

Performance comparison of data balancing techniques on hate speech detection in Turkish

Authors : Habibe Karayiğit, Ali Akdağli, Çiğdem İnan Acı

Pages : 610-621

View : 58 | Download : 109

Publication Date : 2024-10-30

Article Type : Research Paper

Abstract :Increasing hate speech on social media platforms causes psychological disorders and deep and negative effects. Automatic language classification models are needed to detect hate speech. When testing language models for hate speech, imbalanced datasets where one data class is represented much more frequently than the other can be a problem in language datasets. When the dataset is imbalanced, the classifier may be biased towards the majority class and may not perform well in the minority class. This can lead to incorrect or unreliable classification results. To solve this problem, data level balancing methods such as oversampling or undersampling are used to balance the class distribution before classifying the dataset. This study, it is aimed to achieve a successful classification model combination that detects hate speech by using data-level balancing methods. For this, a comprehensive study was carried out by applying the balancing method at eight data levels (random oversampling, Synthetic Minority Oversampling Technique (SMOTE), K-means SMOTE, Localized Random Affine Shadow Sample (LoRAS), Text-based Generative AdversarialNetwork (TextGAN), Nearmiss, Tomek Links ve Clustering-based) to the Abusive Turkish Comments (ATC) dataset, which has an imbalanced distribution of labels, obtained from Instagram. Classification performances of data level balancing methods were evaluated with Basic Machine Learning (BML) and Convolutional Neural Network (CNN) methods. It has been observed that the CBoW+CNN model based on the TextGAN data-level balancing method, as well as the Skip-gram CNN model, exhibited the best classification performance with a MacroAveraged F1 score of 0.972.
Keywords : Veri dengeleme, Sosyal medya, Makine öğrenmesi, Derin öğrenme, Doğal dil işleme, nefret söylemi

ORIGINAL ARTICLE URL

VIEW PAPER (PDF)

* There may have been changes in the journal, article,conference, book, preprint etc. informations. Therefore, it would be appropriate to follow the information on the official page of the source. The information here is shared for informational purposes. IAD is not responsible for incorrect or missing information.

Index of Academic Documents
İzmir Academy Association
CopyRight © 2023-2025