Inhaltsverzeichnis von Guide to OCR for Indic Scripts - Document Recognition and Retrieval vom Springer-Verlag

Suchen und Finden

Service

Guide to OCR for Indic Scripts - Document Recognition and Retrieval

von: Venu Govindaraju, Srirangaraj Ranga Setlur

Springer-Verlag, 2009

ISBN: 9781848003309 , 325 Seiten

Format: PDF

Kopierschutz: Wasserzeichen

Preis: 149,79 EUR

Mehr zum Inhalt

Guide to OCR for Indic Scripts - Document Recognition and Retrieval

Kapitelübersicht
Kurzinformation
Inhaltsverzeichnis
Leseprobe
Blick ins Buch
Fragen zum eBook

Foreword: 4
Preface: 6
1 Part I: Recognition of Indic Scripts: 9
2 Part II: Retrieval of Indic Documents: 11
3 Target Audience: 11
Acknowledgments: 13
Contents: 14
Contributors: 16
Part I Recognition of Indic Scripts: 19
Building Data Sets for Indian Language OCR Research: 20
1 Introduction: 20
2 Datasets: 21
2.1 Image Corpus: 21
2.1.1 Digitization: 22
2.1.2 Processing and Storage: 22
2.2 Text Corpus: 23
2.3 Annotated Data Sets: 23
3 Annotation: 24
3.1 Hierarchical Annotation: 26
3.1.1 Different Levels of Annotation: 26
3.1.2 Methods of Annotation: 27
3.2 Annotation Process: 28
3.2.1 Segmentation: 28
3.2.2 Components Labeling: 29
3.2.3 Annotation Tools: 31
4 Representation and Access: 32
4.1 Sources of Metainformation: 33
4.2 Recognizer-Specific Metainformation: 34
4.3 Digitization Meta Information: 34
4.4 Annotation Data: 35
4.4.1 Page Structure Information: 36
4.4.2 Text Block Structure Information: 36
4.4.3 Akshara Structure Information: 37
4.5 Representation Issues: 37
4.5.1 Complex Layout: 37
4.5.2 Indian Language Script Issues: 37
4.6 Data Access: 38
5 Implementation and Execution: 39
5.1 Organization of Tasks: 39
5.2 Status of the Data Sets: 40
6 Conclusions: 40
References: 41
On OCR of Major Indian Scripts: Bangla and Devanagari: 43
1 Introduction: 43
2 Basic OCR System: 45
2.1 Group and Individual Character Classifiers: 48
3 Quantification of Errors: 50
4 Post-recognition Error Correction: 52
4.1 Forward--Backward Error Correction Scheme: 53
5 Discussion: 57
References: 57
A Complete Machine-Printed Gurmukhi OCR System: 59
1 Introduction: 59
2 Characteristics of Gurmukhi Script: 60
2.1 Character Set: 60
2.2 Connectivity of Symbols: 60
2.3 Word Partitioning into Zones: 61
2.4 Frequently Touching Characters: 62
2.5 Broken Characters and Headlines: 62
2.6 Similarity of Group of Symbols: 62
3 System Overview: 62
4 Digitization and Pre-processing: 62
5 Splitting Text into Horizontal Text Strips: 64
6 Word Segmentation: 67
7 Sub-division of Strips into Smaller Units: 68
8 Repairing the Word Shape: 69
9 Thinning: 70
10 Repairing Broken Characters: 72
11 Character Segmentation: 74
11.1 Touching Characters: 77
12 Recognition Stage: 78
12.1 Feature Extraction: 78
12.2 Classification: 80
12.2.1 Design of the Binary Tree Classifier: 81
12.3 Merging Sub-symbols: 81
13 Post-Processing: 84
13.1 Check for the Existence of a Word in the Corpus: 84
13.2 Perform Holistic Recognition of a Word: 84
14 Experimental Results: 85
15 Conclusion: 86
References: 87
Progress in Gujarati Document Processing and Character Recognition: 88
1 Introduction: 88
2 Gujarati Script: OCR Perspective: 89
3 Segmentation: 91
4 Zone Boundary Identification: 92
4.1 Using Slopes of the Imaginary Lines Joining Top Left (Bottom Right) Corners: 93
4.2 Dynamic Programming Approach: 95
5 Extracting Recognizable Units: 98
6 Recognition: 98
6.1 Feature Extraction: 99
6.1.1 Fringe Map: 100
6.1.2 Discrete Cosine Transform: 100
6.1.3 Wavelet Transform: 101
6.1.4 Zone Information: 102
6.1.5 Aspect Ratio: 102
6.2 Classification: 102
6.2.1 Nearest Neighbor Classifier: 102
6.2.2 Artificial Neural Networks [ 25 , 26 ]: 103
6.2.3 Multi-layer Perceptron (MLP) [ 25 ]: 103
6.2.4 Radial Basis Functions (RBF) networks: 103
6.2.5 General Regression Neural Network (GRNN): 104
6.3 Experimental Setup and Results: 106
7 Text Generation: 107
8 Post-processing: 108
9 Conclusion: 108
References: 109
Design of a Bilingual KannadaEnglish OCR: 111
1 Introduction: 111
2 Kannada Script: 112
3 Segmentation: 112
3.1 Line Segmentation Based on Connected Components: 114
3.2 Word and Character Segmentation: 115
4 Script Recognition: 115
4.1 Gabor and DCT-Based Identification: 116
4.2 Results of Script Identification: 117
5 Component Classification: 119
5.1 Introduction: 119
5.2 Graph Representations for Components: 120
5.3 Distance Measures: 122
5.4 Classification Strategy: 123
5.5 Training: 123
5.6 Prediction: 124
5.7 Experiments, Results and Discussion: 124
5.7.1 Data Sets: 124
5.7.2 Features for SVM Classifiers: 126
5.7.3 Pre-processing: 128
5.7.4 Results and Discussions: 128
6 Conclusion: 137
References: 137
Recognition of Malayalam Documents: 139
1 Introduction: 139
1.1 The Malayalam Language: 140
1.1.1 Origin: 140
1.1.2 Literary Culture: 140
1.1.3 Word and Sentence Formation: 141
1.2 The Malayalam Script: 141
1.2.1 Script Revision: 143
1.3 Evolution of Printing and Publication: 144
1.4 Challenges in Malayalam Recognition: 145
2 Character Recognition: 146
2.1 Overview of the Approach: 146
2.2 Design Guidelines: 147
2.3 Features for Component Classification: 148
2.4 Classifier Design: 148
2.5 Beyond Recognition of Isolated Symbols: 150
3 Recognition of Online Handwriting: 151
3.1 Stroke Recognition: 152
3.1.1 Dealing with Similar Strokes: 153
3.2 Word Recognizer: 154
4 Experimental Results: 154
4.1 Overview of the Data Set: 154
4.2 Classifier and Feature Comparisons: 155
4.3 Recognition of Online Handwriting: 157
5 Conclusions: 158
References: 159
A Complete OCR System for Tamil Magazine Documents: 161
1 Introduction and Background: 161
1.1 Preprocessing: 162
1.1.1 Skew Estimation: 163
1.1.2 Binarization: 163
1.2 Page Segmentation and Classification: 163
1.2.1 Page Segmentation: 163
1.2.2 Block Classification: 164
1.3 Optical Character Recognition (OCR): 164
1.3.1 Character Segmentation: 164
1.3.2 Character Recognition: 165
1.4 Logical Structure: 165
1.4.1 Document Models: 166
2 Preprocessing: 166
2.1 Image Size Reduction: 166
2.2 Skew Correction: 167
2.2.1 Text Recognition: 167
2.2.2 Skew Estimation: 168
2.3 Binarization: 168
2.4 Noise Removal: 168
3 Segmentation and Classification: 168
3.1 Page Segmentation: 169
3.2 Classification of the Blocks: 169
4 Optical Character Recognition: 170
4.1 Line, Word, and Character Segmentation: 170
4.2 Recognition of Characters: 171
5 Reconstruction of the Document Image: 171
5.1 Logical Structure Derivation: 171
5.2 Reconstruction into HTML Format: 172
6 Results and Conclusions: 172
6.1 Results: 173
6.2 Conclusions: 174
References: 175
Experiments on Urdu Text Recognition: 177
1 Introduction: 177
2 Urdu Language Resources: 180
3 Prior Work in Urdu Recognition Systems: 181
4 Prior Work in Urdu Document Preprocessing: 182
5 Experiments: 183
References: 184
The BBN Byblos Hindi OCR System: 186
1 Introduction: 186
1.1 Background: 186
1.2 Review of Basic OCR System: 187
1.3 Model Training and Recognition: 188
2 DATA: 189
2.1 Hindi Character Set: 189
2.2 Corpus: 191
3 Experimental Results: 191
3.1 Model Configuration: 191
3.2 Recognition Performance: 192
4 Conclusions: 192
References: 193
Generalization of Hindi OCR Using Adaptive Segmentation and Font Files: 194
1 Introduction: 194
1.1 Challenges of Segmentation: 195
1.2 Feature Extraction and Classification: 196
2 Base Devanagari OCR System: 197
2.1 Background: 197
2.2 System Design: 198
2.3 Character Segmentation: 200
2.3.1 Devanagari Script Overview: 200
2.3.2 Hindi Character Segmentation: 200
2.4 Feature Extraction: 206
2.5 Classification: 208
2.5.1 Template Matching: 208
2.5.2 Generalized Hausdorff Image Comparison (GHIC): 208
2.5.3 Nearest Neighbor Classifier and Weighted Euclidean Distance: 209
2.5.4 Hierarchical Classification: 209
2.6 Devanagari OCR Evaluation: 210
2.7 Additional Challenges: 210
3 Font-Based Intelligent Character Segmentation: 212
3.1 Benefits and Font Models: 212
3.2 Training Using Font Files: 214
3.3 Segmentation and Recognition: 214
4 Experiments: 215
4.1 Data Sets: 216
4.2 Protocols for Evaluation: 217
4.3 Character Segmentation: 217
4.4 Feature Extraction: 217
4.5 Recognition Results: 218
5 Conclusion and Future Work: 218
References: 219
Online Handwriting Recognition for Indic Scripts: 221
1 Introduction: 221
2 The Structure of Indic Scripts: 222
3 Challenges for Online HWR: 224
3.1 Large Alphabet Size: 224
3.2 Two-Dimensional Structure: 225
3.3 Inter-class Similarity: 225
3.4 Issues with Writing Styles: 226
3.5 Language-Specific and Regional Differences in Usage: 227
4 Recognition of Isolated Characters: 228
4.1 Strategies: 229
4.2 Preprocessing: 230
4.3 Features: 230
4.4 Classification: 231
5 Word Recognition: 234
5.1 Preprocessing: 235
5.2 Analytic Approaches Based on Explicit Segmentation: 235
5.3 Analytic Approaches Based on Implicit Segmentation: 236
5.4 Holistic Approaches: 237
5.5 Language Models: 238
6 Applications: 238
7 Resources: 240
7.1 Data Set Standards: 241
7.2 Tools: 241
7.3 Data Sets: 242
8 Summary: 242
References: 243
Part II Retrieval of Indic Documents: 247
Enhancing Access to Primary Cultural Heritage Materials of India: 248
1 Introduction: 248
2 Linguistic Tools: 251
3 Image-Processing Tools: 256
Digital Image Enhancement of Indic Historical Manuscripts: 259
1 Introduction: 259
2 Image Enhancement: 261
2.1 Background Normalization: 261
2.1.1 Background Normalization Using a Piece-Wise Linear Model: 262
2.1.2 Background Normalization Using a Nonlinear Model: 264
2.2 Image Normalization: 266
2.3 Background Normalization for Color Images: 267
2.4 Color Document Image Enhancement: 268
3 Experiments: 269
4 Extract Text Lines from Images: 270
4.1 ALCM Method: 272
4.1.1 ALCM Transform: 272
4.1.2 Locations of Possible Text Lines: 274
4.1.3 Extraction of Text: 275
5 Conclusion: 276
References: 276
GFG-Based Compression and Retrieval of Document Images in Indian Scripts: 278
1 Introduction: 278
2 Geometric Feature Graph (GFG) of a Word Image: 280
2.1 GFG Extraction: 281
2.2 Converting the GFG to a String Representation: 282
2.3 Reconstruction of Word Images Using GFG: 283
2.4 GFG Compression: 284
3 GFG-Based Indexing: 285
4 Latent Semantic Indexing Using GFG: 285
4.1 Results of Using LSA and PLSA: 287
5 Ontology-Based Access with GFG: 290
5.1 Concept-Driven Document Image Retrieval: 290
5.2 Results: 291
6 Conclusion: 292
References: 293
Word Spotting for Indic Documents to Facilitate Retrieval: 294
1 Introduction: 294
2 Related Work: 296
3 Proposed Methodologies: 297
3.1 Recognition-Based Keyword Spotting: 297
3.1.1 Performance: 302
3.2 Recognition-Free Keyword Spotting: 303
3.2.1 Performance: 307
4 Conclusion: 307
References: 308
Indian Language Information Retrieval: 309
1 Introduction: 309
1.1 Background: 311
2 Overview of Indian Language IR: 311
2.1 Information Sources: 311
2.2 Research Efforts: 312
2.2.1 Text Retrieval: 313
2.2.2 Information Extraction: 316
2.2.3 Question Answering: 317
2.2.4 Topic Detection and Tracking: 317
2.2.5 Indian Language Subtrack at CLEF 2007: 318
3 The CLIA Project: 319
3.1 The Forum for Information Retrieval Evaluation (FIRE): 320
4 Conclusion: 320
References: 321
Colour Plates: 323
Index: 329

AGB
Datenschutz
Impressum
Kontakt
F.A.Q
Widerruf

Alle Preise verstehen sich inklusive der gesetzlichen MwSt.

Guide to OCR for Indic Scripts - Document Recognition and Retrieval

von: Venu Govindaraju, Srirangaraj Ranga Setlur

Guide to OCR for Indic Scripts - Document Recognition and Retrieval

Foreword

Preface

1 Part I: Recognition of Indic Scripts

2 Part II: Retrieval of Indic Documents

3 Target Audience

Acknowledgments

Contents

Contributors

Part I Recognition of Indic Scripts

Building Data Sets for Indian Language OCR Research

1 Introduction

2 Datasets

2.1 Image Corpus

2.1.1 Digitization

2.1.2 Processing and Storage

2.2 Text Corpus

2.3 Annotated Data Sets

3 Annotation

3.1 Hierarchical Annotation

3.1.1 Different Levels of Annotation

3.1.2 Methods of Annotation

3.2 Annotation Process

3.2.1 Segmentation

3.2.2 Components Labeling

3.2.3 Annotation Tools

4 Representation and Access

4.1 Sources of Metainformation

4.2 Recognizer-Specific Metainformation

4.3 Digitization Meta Information

4.4 Annotation Data

4.4.1 Page Structure Information

4.4.2 Text Block Structure Information

4.4.3 Akshara Structure Information

4.5 Representation Issues

4.5.1 Complex Layout

4.5.2 Indian Language Script Issues

4.6 Data Access

5 Implementation and Execution

5.1 Organization of Tasks

5.2 Status of the Data Sets

6 Conclusions

References

On OCR of Major Indian Scripts: Bangla and Devanagari

1 Introduction

2 Basic OCR System

2.1 Group and Individual Character Classifiers

3 Quantification of Errors

4 Post-recognition Error Correction

4.1 Forward--Backward Error Correction Scheme

5 Discussion

References

A Complete Machine-Printed Gurmukhi OCR System

1 Introduction

2 Characteristics of Gurmukhi Script

2.1 Character Set

2.2 Connectivity of Symbols

2.3 Word Partitioning into Zones

2.4 Frequently Touching Characters

2.5 Broken Characters and Headlines

2.6 Similarity of Group of Symbols

3 System Overview

4 Digitization and Pre-processing

5 Splitting Text into Horizontal Text Strips

6 Word Segmentation

7 Sub-division of Strips into Smaller Units

8 Repairing the Word Shape

9 Thinning

10 Repairing Broken Characters

11 Character Segmentation

11.1 Touching Characters

12 Recognition Stage

12.1 Feature Extraction

12.2 Classification

12.2.1 Design of the Binary Tree Classifier

12.3 Merging Sub-symbols

13 Post-Processing

13.1 Check for the Existence of a Word in the Corpus