dehyphenate OCR result

olive · 2015-02-10 19:25:49

I use tesseract to convert scanned document to plain text. It works pretty well but they are a few caveats in the result that should in theory be not so difficult to fix. I have already make a small python script to replace the ligature by the corresponding sequence of letters. Another problem is the hyphenation of word. Is there a tool that can hyphenate words and rewrap paragraph? It does not look too difficult theoretically but I find no tool to do that.

olive · 2015-02-11 14:43:02

I answer to my own post. Having found no solution, I have made a python script for this. My script limits the character set that are available in latin9, to avoid ligatures, curved apostrophes, etc that are not very suitable for plain text files. Any character not in the latin9 range will be approximated by an ascii character.

My script de-hyphenate words and rewrap paragraph. The only problem is with a hard hyphen that appears at the end of line, like:

a family-
owned cafe
->
a familyowned cafe.

While this is a weakness it is sufficiently uncommon for not being a major problem. With this the output of tesseract is much more easily edited in a word processor. I put my script here if anyone is interested: ~~http://pastie.org/9939574~~
Edited: new version with bugfixes: http://pastie.org/9941695

Last edited by olive (2015-02-12 11:15:44)

Wu · 2015-02-11 14:50:53

thank you for sharing your work

Arch Linux

#1 2015-02-10 19:25:49

dehyphenate OCR result

#2 2015-02-11 14:43:02

Re: dehyphenate OCR result

#3 2015-02-11 14:50:53

Re: dehyphenate OCR result

Board footer