Importance Of Text Mining

700 Words3 Pages
Text mining is the process of extracting high quality information from unstructured or semi structured data. The high quality information refers to the combination of relevancy and novelty. Figure 2 shows the important process of text mining. Figure 2: Text mining process flow
Data Gathering Text mining deals with the unstructured data or semi structured data. The sources of text may be a file, single document, document collection from online and offline both. It may be a form of user commands, web pages, documents, etc. The data i.e. document or collection of documents must be a form of unstructured or semi structured
Text Preprocessing
Text preprocessing is an important task in text mining, information retrieval (IR) and Natural Language
…show more content…
1. To reduce the file size of the text documents, because the stop word occurs 20-30% of the total words count in the particular document and the stemming may diminish the indexing size up to 50%.
2. To improve the efficiency and effectiveness of text mining system; stop word has no meanings so it is not useful for mining the text and stemming used for corresponding the related words in a particular document.
The important preprocessing steps in text mining are like tokenization, stop word removal and stemming.
Tokenization
Tokenization is the process of crumbling a stream of textual content in to words, phrases, symbols and some other consequential elements that are called tokens. The main objective of tokenization is the assessment of words in a sentence. Mostly, the process of tokenization happens at the word level. But, it is occasionally tough to describe what is meant by a "word". Commonly a tokenizer requires on simple heuristics, for
…show more content…
• All adjacent strings of alphabetic characters are part of one token; in the same way with numbers.
• Tokens are divided by the way of whitespace characters, such as a space or line break, or by punctuation characters.
Challenges in Tokenization The main challenge of tokenization is it depends on the type of language. Languages like English and French are referred as space delimited as the words are separated from each other by using white spaces. Some of the languages like Chinese and Thai are referred as unsegmented languages as the words do not have clear boundaries. Tokenizing the unsegmented languages, a sentence requires the additional lexical and morphological information. Figure 4 shows the example of tokenization. Figure 4: Tokenization For tokenize a document, various tools are available. Some of the open source tokenization tools are listed as

More about Importance Of Text Mining

Open Document