Text Categorization

800 Words4 Pages
Huge number of documents are increasing rapidly, therefore, to organize it in digitized form text categorization becomes an challenging issue. Text categorization is an information retrieval technique where the documents are grouped into different classes. A major issue for text categorization is its large number of features. Most of the features are noisy, irrelevant and redundant, which may mislead the classifier. Hence, it is most important to reduce dimensionality of data to get smaller subset and provide the most gain in information. Feature selection techniques reduce the dimensionality of feature space. It also improves the overall accuracy and performance. Hence, to overcome the issues of text categorization feature selection is…show more content…
Finally a document classification is performed on the core features using Naive Bayes and KNN classifier. Experiments are carried out on three UCI datasets, Reuters 21578, Classic 04 and Newsgroup 20. Results show the better accuracy and performance of the proposed model.
Keywords— Introduction, Document Preprocessing, Information Gain, Rough Set, Classifiers
I. INTRODUCTION
In the field of data mining, it has been observed that the data grow rapidly. With the rapid growth of data and the availability an increasing number of electronic documents, the task of classification becomes a key method [1]. Document preprocessing is an important parameter and feature selection is a common problem used in preprocessing for machine learning, data mining and pattern recognition [1][2][3]. Text categorization has always been a hot topic due to explosive growth of digital documents available. Due to huge development information acquirement and storage, tens, hundreds and even thousands of features are acquired and stored in real world databases. Storing and processing relevant or irrelevant attributes becomes computationally very expensive and impractical [2][3]. A major
…show more content…
Many methods have been applied to text categorization task on machine learning, such as KNN, Naive Bayes, C4.5 and SVM. Several dimension reduction techniques like PCA, GA, IG are carried out; still the problem of time complexity and text categorization can be improved. Hence, in this we proposed multistage approaches: document preprocessing, feature selection and reduction technique which are used to reduce the high dimensionality of feature space. It removes the redundant and irrelevant attributes and thereby decreases the computational complexity of the machine learning process and increases the performance of classification. In the first stage documents are preprocessed with various steps. In the second stage, information gain is used to rank the importance of the features. In third stage Rough set approach is used to reduce the attributes. Finally, to evaluate the effectiveness of dimension reduction methods, experiments are conducted on Reuters-21,578, Classic 04 and Newsgroup 20 dataset collection. For overall accuracy and performance the different classifiers like KNN and Naive Bayes are used. The results show that the proposed model is able to achieve high categorization effectiveness as measured by precision, recall and

More about Text Categorization

Open Document