Feature selection methods for text classification
Seema narwariya
Department of CSE
Maulana Azad National Institute of Technology
Bhopal, Madhya Pradesh seema.narwariya@gmail.com Saritha Khethawat
Astt. Professor, Department of CSE
Maulana Azad National Institute of Technology
Bhopal, Madhya Pradesh sarithakishan@gmail.com Abstract—Feature selection serves a vital role in text classification, particularly for high dimensional datasets. Feature selection methods helps in reducing computation time, improves prediction performance, and a better understanding of the data. The purpose of Feature selection is to select a subset of input features with slightly or no with predictive information. In this paper various feature selection techniques*…show more content…*

Generic scheme of filter methods. 2.1.1 Chi-square approach The chi-square determine the relationship between the feature (f) and the category (c). The χ2 statistic measures the lack of independence between feature and category and can be compared to χ 2 distribution with one degree freedom. In statistics, the χ2 test is applied to test independence of two events, where two events X and Y are defined P(XY)=P(X)P(Y). In feature selection, the two events are occurrence of the feature and occurrence of the category. Features are ranked with respect to the following quantity: Where df =1(the document contains feature f ) and df =0 (the document does not contain f ). dc =1 (the document is in category c) and dc =0 (the document is not in category c). N is the observed frequency in D and E the expected frequency. For example, E11 is the expected frequency of f and c occurring together in a document assuming that feature and category are independent. χ2 is a measure of how much expected counts E and observed counts N deviate from each other. A high value of χ2 indicates that the hypothesis of independence. If the two events are dependent, then the occurrence of the feature makes the occurrence of the category more likely (or less likely), so it should be helpful feature. This is the rationale of χ2 feature

