You are not logged in.

#1 2015-02-10 19:25:49

olive
Member
From: Belgium
Registered: 2008-06-22
Posts: 1,490

dehyphenate OCR result

I use tesseract to convert scanned document to plain text. It works pretty well but they are a few caveats in the result that should in theory be not so difficult to fix. I have already make a small python script to replace the ligature by the corresponding sequence of letters. Another problem is the hyphenation of word. Is there a tool that can hyphenate words and rewrap paragraph? It does not look too difficult theoretically but I find no tool to do that.

Offline

#2 2015-02-11 14:43:02

olive
Member
From: Belgium
Registered: 2008-06-22
Posts: 1,490

Re: dehyphenate OCR result

I answer to my own post. Having found no solution, I have made a python script for this. My script limits the character set that are available in latin9, to avoid ligatures, curved apostrophes, etc that are not very suitable for plain text files. Any character not in the latin9 range will be approximated by an ascii character.

My script de-hyphenate words and rewrap paragraph. The only problem is with a hard hyphen that appears at the end of line, like:

a family-
owned cafe
->
a familyowned cafe.


While this is a weakness it is sufficiently uncommon for not being a major problem. With this the output of tesseract is much more easily edited in a word processor. I put my script here if anyone is interested: http://pastie.org/9939574
Edited: new version with bugfixes: http://pastie.org/9941695

Last edited by olive (2015-02-12 11:15:44)

Offline

#3 2015-02-11 14:50:53

Wu
Member
Registered: 2009-12-27
Posts: 80

Re: dehyphenate OCR result

thank you for sharing your work smile

Offline

Board footer

Powered by FluxBB