Natural Language Processing: Data Preprocessing

1289 Words6 Pages
A. Data preprocessing
Text mining is the process of seeking or extracting the useful information from the textual data. Our data is preprocessed with the help of NLP (Natural Language Processing). Natural Language Processing (NLP) is an area of research and application that explores how computers can be used to understand and manipulate natural language text.

Fig.2 Block diagram of citation recommendation system

1. Tokenization

User gives the query, a textual data from which keywords are to be extracted. Hence the given query is tokenized i.e. the process of breaking a stream of text in to words,phrases,symbols or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text
…show more content…
Mapping query with keyword communities

This work consists of recommending citations of given search query, which requires mapping query with the keyword cluster formed from the keyword-keyword network.

1. Formation of keyword network

Keywords are either extracted from publication database. Some data usually have the keywords which can be directly used to construct the keyword-keyword network. Keyword network is constructed which is an undirected and weighted graph where each node corresponds the keyword. Two nodes are connected by an edge if there is one article that contains both the keywords.

2. Formation of keyword communities

After keyword network construction the next step is clustering the keywords using Louvain community algorithm, a well known state-of-the-art-algorithm which is used to find the communities from the keyword network. This algorithm uses greedy optimization. This optimization is performed of two types, first the method looks for the smaller communities by optimizing modularity locally and aggregates the nodes which is belonging to same community and builds the network. Then the input query is mapped with keyword communities and constituent keywords from cluster are fetched to the next step of the
…show more content…
They are time homogeneous. If one vertex is visited frequently by walk then all its neighbors are likely to be visited. This is called as smoothing process [15].by this way top ranked prestigious articles are viewed. But for diversity in random walk is achieved through Vertex reinforced random walk with restart. It is a time-variant process that takes in account of both prestige and diversity. The probability of jumping form one node to the other is constant over the time. The transition probabilities at each time are influenced by the number of times each state has been visited and by a priori likelihood matrix, which is real, symmetric and

More about Natural Language Processing: Data Preprocessing

Open Document