Tokenization In Language Analysis

1117 Words5 Pages
2.2.1 Tokenisation
In lexical analysis, tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining.
In order to extract keywords, a number of preprocessing steps must be carried out. A piece of text is essentially just a string of characters. This string of characters must be broken up into words. Then these words can be used for consideration as keywords. As a first step in processing a document, it has to be determined what the processing tokens are. One of the most simple approaches to tokenisation defines word symbols and inter-word symbols. For instance, an interword symbol may
…show more content…
Words might carry little meaning from a frequency (or information theoretic) point of view, or alternatively from a linguistic point of view. Words that occur in many of the documents in the collection carry little meaning from a frequency respect. Relative to other words in the document, they have little use as keywords. When scoring words to try to determine the best words to use as keywords, stop words will rank very low because they are so frequent. By removing the very frequent words, the document scores will not be affected that much. Stop word removal on the basis of their frequency can be done easily by removing the 200-300 words with the highest collection frequencies. In [WS92] Wilbur takes into consideration that stop words may be the most frequently occurring words. Defining a list of stop words based on frequency is not ideal for our purposes. If words carry little meaning from a linguistic point of view, they might be removed whether their frequency in the collection is high or low. In fact, they should especially be removed if their frequency is low, because these words affect document scores the most. Removing stop words for linguistic reasons can be done by using a stop list that enumerates all words with little meaning, like for instance “the”, “it” and “a”. This is what is most important for us. There are many lists of stopwords available today.…show more content…
For example the words “computer”, “compute” and “computation” conflate to the stem comput. Stemmers were already developed in the 1960’s when the first retrieval systems were implemented. The two most commonly used stemmers were developed by Lovins [Lov68] and by Porter [Por80]. There are many variations of these algorithms, but these two remain the basis. A report in [Har91] suggests that stemmin does not have a major impact on the overall retrieval system but [KP96] find stemming to aid in recall of documents in an IR system. In [KLJJ04] it is shown that stemming is particularly useful when used with short queries and short documents. Bearing in mind that we will be basing our IR system on a database of tweets, stemming will be used. Stemming also reduces the risk of producing keywords that are very similar. As well as using stemming algorithms, a dictionary lookup can also be used. In [CCB94] they use dictionary lookup as well as stemming. Because we will be generating topic specific keywords, this alleviates the need for a dictionary look up. In our case, adding a dictionary lookup only adds unnecessary
Open Document