Line Segmentation In English

2561 Words11 Pages

Challenges and Line Segmentation in Sindhi OCR

Shanky Goel Department of Computer Science Punjabi University, Patiala, India

Dr. Gurpreet Singh Lehal Department of Computer Science Punjabi University, Patiala, India

ABSTRACT: Arabic script based OCR systems are far behind in accuracy as compared to Latin script OCR systems. OCRs developed for many …show more content…

Therefore the OCR used for Arabic or Urdu will not accomplish all the needs for Sindhi (Fig.1). Shaikh et al. [1] presented only character segmentation of Sindhi sub-word by calculating height profile vector of thinned primary strokes. Nizamani and Janjua [3] proposed a recognition system for isolated Sindhi characters which is written in a drawing panel using specific font “MB Lateefi”. Hakro et al [5] presented the recognition issues in Sindhi OCR. It is necessary to have Sindhi OCR application which can convert the printed books of Sindhi into editable computer text files. It would help to increase strength and life of language. At the same time it would also increase the richness of literature of Sindhi …show more content…

Challenges in Sindhi OCR
Sindhi script possesses more challenges because of complexities associated with the script. The cursiveness and context sensitivity are the two major problems in the development of Arabic script based OCRs. But developing a recognition system for a cursive language and a language that has a large set of characters such as Sindhi is a challenging job. The main challenges are:
Writing system:
Sindhi words are written from right to left and numerals are written from left to right. Sindhi language follows Bi-directional property. This poses a challenge for Sindhi OCR because at the recognition time, if a number comes between the characters then the output writing mechanism must be reversed [5]. In the example shown below (Fig.4) the given sentence is presented in a right to left flow while the date set inside (٦٢/٠١/٦١٠٦) takes the Sindhi numerals, is written in left to right form. Fig.4: Bi-directional writing
Segmentation challenge:
Segmentation is most challenging step in Arabic script based OCR systems. Segmentation of a text-document into lines, words and characters, is considered to be the crucial stage in Optical Character Recognition. The output of segmentation phase affects the overall recognition rate of the system. Segmentation is a big challenge in Sindhi OCR due to cursive nature of Sindhi. The Arabic text segmentation methods can be classified into two approaches Analytical Approach and Holistic Approach or Segmentation-Free

Open Document