Improving Holy Quran Search Using Thesaurus and Light Stemming

Abstract: - Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a Boolean expression. The need for effective methods of automated IR has grown in importance because of the tremendous explosion in the amount of unstructured data, both internal, corporate document collections, and the immense and growing number of document sources on the Internet. This paper proposes an intelligent text searching technique for information retrieval using vector
It is written in an early form of classical Arabic known as “Quranic” Arabic. The Quran mixes narrative, exhortation, and legal prescription. The suras (verses) frequently combine all these modes, not always in ways that seem obvious to the reader. The Quran is partly rhymed, partly prose [8]. Traditionally, the Arabic grammarians consider the Quran to be a genre unique unto itself, neither poetry (defined as speech with metre and rhyme) nor prose (defined as normal speech or rhymed but non-metrical speech, saj' [سجع]). Quran is a unique book that has more and more thing to show, there are big effort done toward processing the Holy Quran, some are for teaching and educational purposes and other to discover much more of this book.

2 Statement of the Problem

The standard information retrieval (IR) scenario See Figure-1, The user has an information need. The user types a query that describes the information need. The IR system retrieves a set of documents from a document collection that it believes to be relevant. The documents are ranked according to their likelihood of being relevant Precisely information retrieval system takes as an input a set of documents and a user query, and gives as an output a ranked list of relevant documents. Figure-1 IR
It is made of the following automated steps:
1. Removing the separator symbols between verses such as the following symbols (۝۞۩) which indicate the end of each verse.
2. Remove Arabic diacritics (short vowels) in addition to some Quranic symbols used for recitation, such as (ۜ ۚ ۙ ۘ ۗ ۖ ).
Although that most of the Quran text contains diacritics but we have added additional option to the user to choose whether to search with or without diacritics.

4.2 Building the Inverted files

In this paper we adopted the Vector Space Model which is based on terms weights instead of just frequency. Most IR systems use an inverted file [See figure-3] to represent the texts in the collection.

Figure-3 Example of an inverted file

The phase in which we create the inverted file is called the Pre-Processing phase. There are some calculations done to fill the inverted file with the correct values using a set of parameters, including:
• n Number of Documents
Number of all documents in the collection. We have used term frequency as the

