In this chapter we have delineated the introduction of text mining, text mining processing drift, applications of text mining, challenges and issues in text mining. Mining text data in different languages is also a major problem of this field, since text mining tools and approaches should be able to work with quite a lot of languages and multilingual languages. Integrating domain knowledge with text mining engine would increase its efficiency, especially within the discipline of information retrieval, information extraction and natural language processing
By removing the very frequent words, the document scores will not be affected that much. Stop word removal on the basis of their frequency can be done easily by removing the 200-300 words with the highest collection frequencies. In [WS92] Wilbur takes into consideration that stop words may be the most frequently occurring words. Defining a list of stop words based on frequency is not ideal for our purposes. If words carry little meaning from a linguistic point of view, they might be removed whether their frequency in the collection is high or low.
When LSA is used to compute sentence similarity, a vector for each sentence is formed in the reduced dimension space; similarity is then measured by computing the similarity of these two vectors . I. Because of the computational limit of SVD, the dimension size of the word by context matrix is limited to the several hundred. As the input sentences may be from an unconstrained domain (and thus not represented in the contexts) some important words from the input sentences may not be included in the LSA dimension space. II.
Fig.2 Block diagram of citation recommendation system 1. Tokenization User gives the query, a textual data from which keywords are to be extracted. Hence the given query is tokenized i.e. the process of breaking a stream of text in to words,phrases,symbols or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining.
Cursiveness actually means different shapes of joining characters depending on the sequence of joining . This cursive nature of Urdu text make the correct detection and recognition of Urdu words very difficult and challenging for image processing tasks. Moreover diacritics like zer, zabar, paish, madd are also used in Urdu text. The formation of ligature can include characters joining on both sides of neighboring characters along with those joining at only one side. The shape of the letters in Urdu text depends on the neighboring letter and its position.
However, the latter one is used as a verb that counts for something that completes something else or makes it better. It proves that two words pronouncing similarly can have vastly different meanings and properties. As for the other category, homograph, it describes the situation that words and phrases that are similar in structure but are different in meaning. It rarely appears in English because every single English word is made up of 26 letters and each letter has its own format. The situation of homograph ambiguity only appears in those languages which are not based on an alphabet, such as Chinese.
The high-level language should be integrated with a database or data warehouse query language and optimized for efficient and flexible data mining. Presentation and visualization of data mining results: Discovered knowledge should be expressed in high-level language, visual representations or other expressive forms for easier understanding and directly usable by human. Handling noisy or incomplete data: Data stored in a database may reflect noise; as a result the accuracy of the discovered pattern can be poor. Data cleaning methods, data analysis methods, outlier mining methods can handle noise. Pattern Evaluation-The interestingness problem: A data mining system can uncover thousands of patterns.
In this context the selection of characteristics and also the influence of domain knowledge and domain-specific procedures play an important role. Therefore, an adaptation of the known data mining algorithms to text data is usually necessary. In order to achieve this, one frequently relies on the experience and results of research in information retrieval, natural language processing and information extraction. In all of these areas we also apply data mining methods and statistics to handle their specific tasks: Information Retrieval (IR): Information retrieval is the finding of documents which contain answers to questions and not the finding of answers itself. In order to achieve this goal statistical measures and methods are used for the automatic processing of text data and comparison to the given question.
Its goal is to segment the pixels on the document image into just two classes, regardless of the enormous number of possible text typefaces and the vari- ous types of degradation, which make it an ambitious process. Therefore, document image binarization is of great importance in the document image analysis and recognition pipeline since it affects further stages of the recognition
Segmentation of a text-document into lines, words and characters, is considered to be the crucial stage in Optical Character Recognition. The output of segmentation phase affects the overall recognition rate of the system. Segmentation is a big challenge in Sindhi OCR due to cursive nature of Sindhi. The Arabic text segmentation methods can be classified into two approaches Analytical Approach and Holistic Approach or Segmentation-Free