Suchen und Finden
Foreword
4
Preface
6
1 Part I: Recognition of Indic Scripts
9
2 Part II: Retrieval of Indic Documents
11
3 Target Audience
11
Acknowledgments
13
Contents
14
Contributors
16
Part I Recognition of Indic Scripts
19
Building Data Sets for Indian Language OCR Research
20
1 Introduction
20
2 Datasets
21
2.1 Image Corpus
21
2.1.1 Digitization
22
2.1.2 Processing and Storage
22
2.2 Text Corpus
23
2.3 Annotated Data Sets
23
3 Annotation
24
3.1 Hierarchical Annotation
26
3.1.1 Different Levels of Annotation
26
3.1.2 Methods of Annotation
27
3.2 Annotation Process
28
3.2.1 Segmentation
28
3.2.2 Components Labeling
29
3.2.3 Annotation Tools
31
4 Representation and Access
32
4.1 Sources of Metainformation
33
4.2 Recognizer-Specific Metainformation
34
4.3 Digitization Meta Information
34
4.4 Annotation Data
35
4.4.1 Page Structure Information
36
4.4.2 Text Block Structure Information
36
4.4.3 Akshara Structure Information
37
4.5 Representation Issues
37
4.5.1 Complex Layout
37
4.5.2 Indian Language Script Issues
37
4.6 Data Access
38
5 Implementation and Execution
39
5.1 Organization of Tasks
39
5.2 Status of the Data Sets
40
6 Conclusions
40
References
41
On OCR of Major Indian Scripts: Bangla and Devanagari
43
1 Introduction
43
2 Basic OCR System
45
2.1 Group and Individual Character Classifiers
48
3 Quantification of Errors
50
4 Post-recognition Error Correction
52
4.1 Forward--Backward Error Correction Scheme
53
5 Discussion
57
References
57
A Complete Machine-Printed Gurmukhi OCR System
59
1 Introduction
59
2 Characteristics of Gurmukhi Script
60
2.1 Character Set
60
2.2 Connectivity of Symbols
60
2.3 Word Partitioning into Zones
61
2.4 Frequently Touching Characters
62
2.5 Broken Characters and Headlines
62
2.6 Similarity of Group of Symbols
62
3 System Overview
62
4 Digitization and Pre-processing
62
5 Splitting Text into Horizontal Text Strips
64
6 Word Segmentation
67
7 Sub-division of Strips into Smaller Units
68
8 Repairing the Word Shape
69
9 Thinning
70
10 Repairing Broken Characters
72
11 Character Segmentation
74
11.1 Touching Characters
77
12 Recognition Stage
78
12.1 Feature Extraction
78
12.2 Classification
80
12.2.1 Design of the Binary Tree Classifier
81
12.3 Merging Sub-symbols
81
13 Post-Processing
84
13.1 Check for the Existence of a Word in the Corpus
84
13.2 Perform Holistic Recognition of a Word
84
14 Experimental Results
85
15 Conclusion
86
References
87
Progress in Gujarati Document Processing and Character Recognition
88
1 Introduction
88
2 Gujarati Script: OCR Perspective
89
3 Segmentation
91
4 Zone Boundary Identification
92
4.1 Using Slopes of the Imaginary Lines Joining Top Left (Bottom Right) Corners
93
4.2 Dynamic Programming Approach
95
5 Extracting Recognizable Units
98
6 Recognition
98
6.1 Feature Extraction
99
6.1.1 Fringe Map
100
6.1.2 Discrete Cosine Transform
100
6.1.3 Wavelet Transform
101
6.1.4 Zone Information
102
6.1.5 Aspect Ratio
102
6.2 Classification
102
6.2.1 Nearest Neighbor Classifier
102
6.2.2 Artificial Neural Networks [ 25 , 26 ]
103
6.2.3 Multi-layer Perceptron (MLP) [ 25 ]
103
6.2.4 Radial Basis Functions (RBF) networks
103
6.2.5 General Regression Neural Network (GRNN)
104
6.3 Experimental Setup and Results
106
7 Text Generation
107
8 Post-processing
108
9 Conclusion
108
References
109
Design of a Bilingual KannadaEnglish OCR
111
1 Introduction
111
2 Kannada Script
112
3 Segmentation
112
3.1 Line Segmentation Based on Connected Components
114
3.2 Word and Character Segmentation
115
4 Script Recognition
115
4.1 Gabor and DCT-Based Identification
116
4.2 Results of Script Identification
117
5 Component Classification
119
5.1 Introduction
119
5.2 Graph Representations for Components
120
5.3 Distance Measures
122
5.4 Classification Strategy
123
5.5 Training
123
5.6 Prediction
124
5.7 Experiments, Results and Discussion
124
5.7.1 Data Sets
124
5.7.2 Features for SVM Classifiers
126
5.7.3 Pre-processing
128
5.7.4 Results and Discussions
128
6 Conclusion
137
References
137
Recognition of Malayalam Documents
139
1 Introduction
139
1.1 The Malayalam Language
140
1.1.1 Origin
140
1.1.2 Literary Culture
140
1.1.3 Word and Sentence Formation
141
1.2 The Malayalam Script
141
1.2.1 Script Revision
143
1.3 Evolution of Printing and Publication
144
1.4 Challenges in Malayalam Recognition
145
2 Character Recognition
146
2.1 Overview of the Approach
146
2.2 Design Guidelines
147
2.3 Features for Component Classification
148
2.4 Classifier Design
148
2.5 Beyond Recognition of Isolated Symbols
150
3 Recognition of Online Handwriting
151
3.1 Stroke Recognition
152
3.1.1 Dealing with Similar Strokes
153
3.2 Word Recognizer
154
4 Experimental Results
154
4.1 Overview of the Data Set
154
4.2 Classifier and Feature Comparisons
155
4.3 Recognition of Online Handwriting
157
5 Conclusions
158
References
159
A Complete OCR System for Tamil Magazine Documents
161
1 Introduction and Background
161
1.1 Preprocessing
162
1.1.1 Skew Estimation
163
1.1.2 Binarization
163
1.2 Page Segmentation and Classification
163
1.2.1 Page Segmentation
163
1.2.2 Block Classification
164
1.3 Optical Character Recognition (OCR)
164
1.3.1 Character Segmentation
164
1.3.2 Character Recognition
165
1.4 Logical Structure
165
1.4.1 Document Models
166
2 Preprocessing
166
2.1 Image Size Reduction
166
2.2 Skew Correction
167
2.2.1 Text Recognition
167
2.2.2 Skew Estimation
168
2.3 Binarization
168
2.4 Noise Removal
168
3 Segmentation and Classification
168
3.1 Page Segmentation
169
3.2 Classification of the Blocks
169
4 Optical Character Recognition
170
4.1 Line, Word, and Character Segmentation
170
4.2 Recognition of Characters
171
5 Reconstruction of the Document Image
171
5.1 Logical Structure Derivation
171
5.2 Reconstruction into HTML Format
172
6 Results and Conclusions
172
6.1 Results
173
6.2 Conclusions
174
References
175
Experiments on Urdu Text Recognition
177
1 Introduction
177
2 Urdu Language Resources
180
3 Prior Work in Urdu Recognition Systems
181
4 Prior Work in Urdu Document Preprocessing
182
5 Experiments
183
References
184
The BBN Byblos Hindi OCR System
186
1 Introduction
186
1.1 Background
186
1.2 Review of Basic OCR System
187
1.3 Model Training and Recognition
188
2 DATA
189
2.1 Hindi Character Set
189
2.2 Corpus
191
3 Experimental Results
191
3.1 Model Configuration
191
3.2 Recognition Performance
192
4 Conclusions
192
References
193
Generalization of Hindi OCR Using Adaptive Segmentation and Font Files
194
1 Introduction
194
1.1 Challenges of Segmentation
195
1.2 Feature Extraction and Classification
196
2 Base Devanagari OCR System
197
2.1 Background
197
2.2 System Design
198
2.3 Character Segmentation
200
2.3.1 Devanagari Script Overview
200
2.3.2 Hindi Character Segmentation
200
2.4 Feature Extraction
206
2.5 Classification
208
2.5.1 Template Matching
208
2.5.2 Generalized Hausdorff Image Comparison (GHIC)
208
2.5.3 Nearest Neighbor Classifier and Weighted Euclidean Distance
209
2.5.4 Hierarchical Classification
209
2.6 Devanagari OCR Evaluation
210
2.7 Additional Challenges
210
3 Font-Based Intelligent Character Segmentation
212
3.1 Benefits and Font Models
212
3.2 Training Using Font Files
214
3.3 Segmentation and Recognition
214
4 Experiments
215
4.1 Data Sets
216
4.2 Protocols for Evaluation
217
4.3 Character Segmentation
217
4.4 Feature Extraction
217
4.5 Recognition Results
218
5 Conclusion and Future Work
218
References
219
Online Handwriting Recognition for Indic Scripts
221
1 Introduction
221
2 The Structure of Indic Scripts
222
3 Challenges for Online HWR
224
3.1 Large Alphabet Size
224
3.2 Two-Dimensional Structure
225
3.3 Inter-class Similarity
225
3.4 Issues with Writing Styles
226
3.5 Language-Specific and Regional Differences in Usage
227
4 Recognition of Isolated Characters
228
4.1 Strategies
229
4.2 Preprocessing
230
4.3 Features
230
4.4 Classification
231
5 Word Recognition
234
5.1 Preprocessing
235
5.2 Analytic Approaches Based on Explicit Segmentation
235
5.3 Analytic Approaches Based on Implicit Segmentation
236
5.4 Holistic Approaches
237
5.5 Language Models
238
6 Applications
238
7 Resources
240
7.1 Data Set Standards
241
7.2 Tools
241
7.3 Data Sets
242
8 Summary
242
References
243
Part II Retrieval of Indic Documents
247
Enhancing Access to Primary Cultural Heritage Materials of India
248
1 Introduction
248
2 Linguistic Tools
251
3 Image-Processing Tools
256
Digital Image Enhancement of Indic Historical Manuscripts
259
1 Introduction
259
2 Image Enhancement
261
2.1 Background Normalization
261
2.1.1 Background Normalization Using a Piece-Wise Linear Model
262
2.1.2 Background Normalization Using a Nonlinear Model
264
2.2 Image Normalization
266
2.3 Background Normalization for Color Images
267
2.4 Color Document Image Enhancement
268
3 Experiments
269
4 Extract Text Lines from Images
270
4.1 ALCM Method
272
4.1.1 ALCM Transform
272
4.1.2 Locations of Possible Text Lines
274
4.1.3 Extraction of Text
275
5 Conclusion
276
References
276
GFG-Based Compression and Retrieval of Document Images in Indian Scripts
278
1 Introduction
278
2 Geometric Feature Graph (GFG) of a Word Image
280
2.1 GFG Extraction
281
2.2 Converting the GFG to a String Representation
282
2.3 Reconstruction of Word Images Using GFG
283
2.4 GFG Compression
284
3 GFG-Based Indexing
285
4 Latent Semantic Indexing Using GFG
285
4.1 Results of Using LSA and PLSA
287
5 Ontology-Based Access with GFG
290
5.1 Concept-Driven Document Image Retrieval
290
5.2 Results
291
6 Conclusion
292
References
293
Word Spotting for Indic Documents to Facilitate Retrieval
294
1 Introduction
294
2 Related Work
296
3 Proposed Methodologies
297
3.1 Recognition-Based Keyword Spotting
297
3.1.1 Performance
302
3.2 Recognition-Free Keyword Spotting
303
3.2.1 Performance
307
4 Conclusion
307
References
308
Indian Language Information Retrieval
309
1 Introduction
309
1.1 Background
311
2 Overview of Indian Language IR
311
2.1 Information Sources
311
2.2 Research Efforts
312
2.2.1 Text Retrieval
313
2.2.2 Information Extraction
316
2.2.3 Question Answering
317
2.2.4 Topic Detection and Tracking
317
2.2.5 Indian Language Subtrack at CLEF 2007
318
3 The CLIA Project
319
3.1 The Forum for Information Retrieval Evaluation (FIRE)
320
4 Conclusion
320
References
321
Colour Plates
323
Index
329
Alle Preise verstehen sich inklusive der gesetzlichen MwSt.