Exploring bigram character features for Arabic text clustering

Home Page
About
Submit A Journal
Submit A Conference
Submit Paper/Book
- Submit a Preprint
- Submit a Book
Contact

Turkish Journal of Electrical Engineering and Computer Science
Volume:27 Issue:4
Exploring bigram character features for Arabic text clustering

Exploring bigram character features for Arabic text clustering

Authors : Dia Eddin ABUZEINA

Pages : 3165-3179

View : 14 | Download : 6

Publication Date : 0000-00-00

Article Type : Research Paper

Abstract :The vector space model insert ignore into journalissuearticles values(VSM); is an algebraic model that is widely used for data representation in text mining applications. However, the VSM poses a critical challenge, as it requires a high-dimensional feature space. Therefore, many feature selection techniques, such as employing roots or stems insert ignore into journalissuearticles values(i.e. words without infixes and prefixes, and/or suffixes); instead of using complete word forms, are proposed to tackle this space challenge problem. Recently, the literature shows that one more basic unit feature can be used to handle the textual features, which is the twoneighboring character form that we call microword. To evaluate this feature type, we measure the accuracy of the Arabic text clustering using two feature types: the complete word form and the microword form. Hence, the microword is two consecutive characters which are also known as the Bigram character feature. In the experiment, the principal component analysis insert ignore into journalissuearticles values(PCA); is used to reduce the feature vector dimensions while the k-means algorithm is used for the clustering purposes. The testing set includes 250 documents of five categories. The entire corpus contains 54,472 words, whereas the vocabulary contains 13,356 unique words. The experimental results show that the complete word form score accuracy is 97.2% while the two-character form score is 96.8%. In conclusion, the accuracies are almost the same; however, the two-character form uses a smaller vocabulary as well as less PCA subspaces. The study experiments might be a significant indication of the necessity to consider the Bigram character feature in the future text processing and natural language processing applications.
Keywords : Arabic, text, clustering, features, dimensionality reduction, k means, principal component analysis, vector space model

ORIGINAL ARTICLE URL

VIEW PAPER (PDF)

* There may have been changes in the journal, article,conference, book, preprint etc. informations. Therefore, it would be appropriate to follow the information on the official page of the source. The information here is shared for informational purposes. IAD is not responsible for incorrect or missing information.