Guide to OCR for Indic Scripts - Document Recognition and Retrieval

Guide to OCR for Indic Scripts - Document Recognition and Retrieval

von: Venu Govindaraju, Srirangaraj Ranga Setlur

Springer-Verlag, 2009

ISBN: 9781848003309 , 325 Seiten

Format: PDF

Kopierschutz: Wasserzeichen

Windows PC,Mac OSX geeignet für alle DRM-fähigen eReader Apple iPad, Android Tablet PC's

Preis: 149,79 EUR

  • AutoCAD 2012 - Von der 2D-Linie zum 3D-Modell
    Organisiert (DIGITAL lifeguide) - Termine, Kontakte, Aufgaben immer & überall im Griff
    iTunes (DIGITAL lifeguide) - Die besten Tipps und Tricks für entspannten Musikgenuss
    Von PDM zu PLM - Prozessoptimierung durch Integration
    Konstruieren mit CAD - Das Komplettpaket für 3D Modellieren im Maschinenbau

     

     

     

     

 

Mehr zum Inhalt

Guide to OCR for Indic Scripts - Document Recognition and Retrieval


 

Foreword

4

Preface

6

1 Part I: Recognition of Indic Scripts

9

2 Part II: Retrieval of Indic Documents

11

3 Target Audience

11

Acknowledgments

13

Contents

14

Contributors

16

Part I Recognition of Indic Scripts

19

Building Data Sets for Indian Language OCR Research

20

1 Introduction

20

2 Datasets

21

2.1 Image Corpus

21

2.1.1 Digitization

22

2.1.2 Processing and Storage

22

2.2 Text Corpus

23

2.3 Annotated Data Sets

23

3 Annotation

24

3.1 Hierarchical Annotation

26

3.1.1 Different Levels of Annotation

26

3.1.2 Methods of Annotation

27

3.2 Annotation Process

28

3.2.1 Segmentation

28

3.2.2 Components Labeling

29

3.2.3 Annotation Tools

31

4 Representation and Access

32

4.1 Sources of Metainformation

33

4.2 Recognizer-Specific Metainformation

34

4.3 Digitization Meta Information

34

4.4 Annotation Data

35

4.4.1 Page Structure Information

36

4.4.2 Text Block Structure Information

36

4.4.3 Akshara Structure Information

37

4.5 Representation Issues

37

4.5.1 Complex Layout

37

4.5.2 Indian Language Script Issues

37

4.6 Data Access

38

5 Implementation and Execution

39

5.1 Organization of Tasks

39

5.2 Status of the Data Sets

40

6 Conclusions

40

References

41

On OCR of Major Indian Scripts: Bangla and Devanagari

43

1 Introduction

43

2 Basic OCR System

45

2.1 Group and Individual Character Classifiers

48

3 Quantification of Errors

50

4 Post-recognition Error Correction

52

4.1 Forward--Backward Error Correction Scheme

53

5 Discussion

57

References

57

A Complete Machine-Printed Gurmukhi OCR System

59

1 Introduction

59

2 Characteristics of Gurmukhi Script

60

2.1 Character Set

60

2.2 Connectivity of Symbols

60

2.3 Word Partitioning into Zones

61

2.4 Frequently Touching Characters

62

2.5 Broken Characters and Headlines

62

2.6 Similarity of Group of Symbols

62

3 System Overview

62

4 Digitization and Pre-processing

62

5 Splitting Text into Horizontal Text Strips

64

6 Word Segmentation

67

7 Sub-division of Strips into Smaller Units

68

8 Repairing the Word Shape

69

9 Thinning

70

10 Repairing Broken Characters

72

11 Character Segmentation

74

11.1 Touching Characters

77

12 Recognition Stage

78

12.1 Feature Extraction

78

12.2 Classification

80

12.2.1 Design of the Binary Tree Classifier

81

12.3 Merging Sub-symbols

81

13 Post-Processing

84

13.1 Check for the Existence of a Word in the Corpus

84

13.2 Perform Holistic Recognition of a Word

84

14 Experimental Results

85

15 Conclusion

86

References

87

Progress in Gujarati Document Processing and Character Recognition

88

1 Introduction

88

2 Gujarati Script: OCR Perspective

89

3 Segmentation

91

4 Zone Boundary Identification

92

4.1 Using Slopes of the Imaginary Lines Joining Top Left (Bottom Right) Corners

93

4.2 Dynamic Programming Approach

95

5 Extracting Recognizable Units

98

6 Recognition

98

6.1 Feature Extraction

99

6.1.1 Fringe Map

100

6.1.2 Discrete Cosine Transform

100

6.1.3 Wavelet Transform

101

6.1.4 Zone Information

102

6.1.5 Aspect Ratio

102

6.2 Classification

102

6.2.1 Nearest Neighbor Classifier

102

6.2.2 Artificial Neural Networks [ 25 , 26 ]

103

6.2.3 Multi-layer Perceptron (MLP) [ 25 ]

103

6.2.4 Radial Basis Functions (RBF) networks

103

6.2.5 General Regression Neural Network (GRNN)

104

6.3 Experimental Setup and Results

106

7 Text Generation

107

8 Post-processing

108

9 Conclusion

108

References

109

Design of a Bilingual KannadaEnglish OCR

111

1 Introduction

111

2 Kannada Script

112

3 Segmentation

112

3.1 Line Segmentation Based on Connected Components

114

3.2 Word and Character Segmentation

115

4 Script Recognition

115

4.1 Gabor and DCT-Based Identification

116

4.2 Results of Script Identification

117

5 Component Classification

119

5.1 Introduction

119

5.2 Graph Representations for Components

120

5.3 Distance Measures

122

5.4 Classification Strategy

123

5.5 Training

123

5.6 Prediction

124

5.7 Experiments, Results and Discussion

124

5.7.1 Data Sets

124

5.7.2 Features for SVM Classifiers

126

5.7.3 Pre-processing

128

5.7.4 Results and Discussions

128

6 Conclusion

137

References

137

Recognition of Malayalam Documents

139

1 Introduction

139

1.1 The Malayalam Language

140

1.1.1 Origin

140

1.1.2 Literary Culture

140

1.1.3 Word and Sentence Formation

141

1.2 The Malayalam Script

141

1.2.1 Script Revision

143

1.3 Evolution of Printing and Publication

144

1.4 Challenges in Malayalam Recognition

145

2 Character Recognition

146

2.1 Overview of the Approach

146

2.2 Design Guidelines

147

2.3 Features for Component Classification

148

2.4 Classifier Design

148

2.5 Beyond Recognition of Isolated Symbols

150

3 Recognition of Online Handwriting

151

3.1 Stroke Recognition

152

3.1.1 Dealing with Similar Strokes

153

3.2 Word Recognizer

154

4 Experimental Results

154

4.1 Overview of the Data Set

154

4.2 Classifier and Feature Comparisons

155

4.3 Recognition of Online Handwriting

157

5 Conclusions

158

References

159

A Complete OCR System for Tamil Magazine Documents

161

1 Introduction and Background

161

1.1 Preprocessing

162

1.1.1 Skew Estimation

163

1.1.2 Binarization

163

1.2 Page Segmentation and Classification

163

1.2.1 Page Segmentation

163

1.2.2 Block Classification

164

1.3 Optical Character Recognition (OCR)

164

1.3.1 Character Segmentation

164

1.3.2 Character Recognition

165

1.4 Logical Structure

165

1.4.1 Document Models

166

2 Preprocessing

166

2.1 Image Size Reduction

166

2.2 Skew Correction

167

2.2.1 Text Recognition

167

2.2.2 Skew Estimation

168

2.3 Binarization

168

2.4 Noise Removal

168

3 Segmentation and Classification

168

3.1 Page Segmentation

169

3.2 Classification of the Blocks

169

4 Optical Character Recognition

170

4.1 Line, Word, and Character Segmentation

170

4.2 Recognition of Characters

171

5 Reconstruction of the Document Image

171

5.1 Logical Structure Derivation

171

5.2 Reconstruction into HTML Format

172

6 Results and Conclusions

172

6.1 Results

173

6.2 Conclusions

174

References

175

Experiments on Urdu Text Recognition

177

1 Introduction

177

2 Urdu Language Resources

180

3 Prior Work in Urdu Recognition Systems

181

4 Prior Work in Urdu Document Preprocessing

182

5 Experiments

183

References

184

The BBN Byblos Hindi OCR System

186

1 Introduction

186

1.1 Background

186

1.2 Review of Basic OCR System

187

1.3 Model Training and Recognition

188

2 DATA

189

2.1 Hindi Character Set

189

2.2 Corpus

191

3 Experimental Results

191

3.1 Model Configuration

191

3.2 Recognition Performance

192

4 Conclusions

192

References

193

Generalization of Hindi OCR Using Adaptive Segmentation and Font Files

194

1 Introduction

194

1.1 Challenges of Segmentation

195

1.2 Feature Extraction and Classification

196

2 Base Devanagari OCR System

197

2.1 Background

197

2.2 System Design

198

2.3 Character Segmentation

200

2.3.1 Devanagari Script Overview

200

2.3.2 Hindi Character Segmentation

200

2.4 Feature Extraction

206

2.5 Classification

208

2.5.1 Template Matching

208

2.5.2 Generalized Hausdorff Image Comparison (GHIC)

208

2.5.3 Nearest Neighbor Classifier and Weighted Euclidean Distance

209

2.5.4 Hierarchical Classification

209

2.6 Devanagari OCR Evaluation

210

2.7 Additional Challenges

210

3 Font-Based Intelligent Character Segmentation

212

3.1 Benefits and Font Models

212

3.2 Training Using Font Files

214

3.3 Segmentation and Recognition

214

4 Experiments

215

4.1 Data Sets

216

4.2 Protocols for Evaluation

217

4.3 Character Segmentation

217

4.4 Feature Extraction

217

4.5 Recognition Results

218

5 Conclusion and Future Work

218

References

219

Online Handwriting Recognition for Indic Scripts

221

1 Introduction

221

2 The Structure of Indic Scripts

222

3 Challenges for Online HWR

224

3.1 Large Alphabet Size

224

3.2 Two-Dimensional Structure

225

3.3 Inter-class Similarity

225

3.4 Issues with Writing Styles

226

3.5 Language-Specific and Regional Differences in Usage

227

4 Recognition of Isolated Characters

228

4.1 Strategies

229

4.2 Preprocessing

230

4.3 Features

230

4.4 Classification

231

5 Word Recognition

234

5.1 Preprocessing

235

5.2 Analytic Approaches Based on Explicit Segmentation

235

5.3 Analytic Approaches Based on Implicit Segmentation

236

5.4 Holistic Approaches

237

5.5 Language Models

238

6 Applications

238

7 Resources

240

7.1 Data Set Standards

241

7.2 Tools

241

7.3 Data Sets

242

8 Summary

242

References

243

Part II Retrieval of Indic Documents

247

Enhancing Access to Primary Cultural Heritage Materials of India

248

1 Introduction

248

2 Linguistic Tools

251

3 Image-Processing Tools

256

Digital Image Enhancement of Indic Historical Manuscripts

259

1 Introduction

259

2 Image Enhancement

261

2.1 Background Normalization

261

2.1.1 Background Normalization Using a Piece-Wise Linear Model

262

2.1.2 Background Normalization Using a Nonlinear Model

264

2.2 Image Normalization

266

2.3 Background Normalization for Color Images

267

2.4 Color Document Image Enhancement

268

3 Experiments

269

4 Extract Text Lines from Images

270

4.1 ALCM Method

272

4.1.1 ALCM Transform

272

4.1.2 Locations of Possible Text Lines

274

4.1.3 Extraction of Text

275

5 Conclusion

276

References

276

GFG-Based Compression and Retrieval of Document Images in Indian Scripts

278

1 Introduction

278

2 Geometric Feature Graph (GFG) of a Word Image

280

2.1 GFG Extraction

281

2.2 Converting the GFG to a String Representation

282

2.3 Reconstruction of Word Images Using GFG

283

2.4 GFG Compression

284

3 GFG-Based Indexing

285

4 Latent Semantic Indexing Using GFG

285

4.1 Results of Using LSA and PLSA

287

5 Ontology-Based Access with GFG

290

5.1 Concept-Driven Document Image Retrieval

290

5.2 Results

291

6 Conclusion

292

References

293

Word Spotting for Indic Documents to Facilitate Retrieval

294

1 Introduction

294

2 Related Work

296

3 Proposed Methodologies

297

3.1 Recognition-Based Keyword Spotting

297

3.1.1 Performance

302

3.2 Recognition-Free Keyword Spotting

303

3.2.1 Performance

307

4 Conclusion

307

References

308

Indian Language Information Retrieval

309

1 Introduction

309

1.1 Background

311

2 Overview of Indian Language IR

311

2.1 Information Sources

311

2.2 Research Efforts

312

2.2.1 Text Retrieval

313

2.2.2 Information Extraction

316

2.2.3 Question Answering

317

2.2.4 Topic Detection and Tracking

317

2.2.5 Indian Language Subtrack at CLEF 2007

318

3 The CLIA Project

319

3.1 The Forum for Information Retrieval Evaluation (FIRE)

320

4 Conclusion

320

References

321

Colour Plates

323

Index

329