More OCR questions?
I'm continuing to mess about with this.
What I'm trying to determine, is the best "common" values to use in the settings that will give me the best percentage of accurate results.
The image I have to detect text on uses two sizes of font. One small, one large. The font in question is derived from the Windows Tahoma font, and then rendered onto a canvas, and coloured with some aliasing/dithering applied to smooth it. So it's a font the TC OCR engine can handle out of the box. Pretty much. It will never be perfect, I know that. But in smaller manual trials, I did manage to get it around 90% accurate on the larger text. Using the default available fonts.
But I wanted to see which size of font game me the best results as OCR can be pretty slow. Reducing it to a smaller set of fonts to attempt to match again gives a pretty considerable performance gain.
So, I set up a looped test. It went through all my stored images (40 in all - 22 large text and 18 small) and ran through the full image set using only a single font size.
I got best results on the large text using sizes 14, 16 & 24. On the small text, oddly enough, I got the best results with font size 30. Which seemed a little odd. But whatever, it produced the best results so I'll roll with that.
So then I re-ran the whole image set. But this time I gave it the four best font sizes from the previous single font trials. So it got 14, 16, 24 & 30 to work with.
Everything else the same. Same images. Same OCR options. The ONLY thing that changed was the font sizes available to the OCR engine as it ran through.
On all these runs, I'm using greyscale binarization. It loops though increasing the binarization by 25 on each loop. So I get a full set of results for all 40 images with binarization increasing by 25 on each run through. As I say, this has not changed though any of these test runs.
And yet, when I switch from single font, to four best fonts (from the single font runs), my results change completely?!?!?
Some of the large text results are better. Some are worse. The small text results are pretty much universally worse?!?!?
How is this possible?
If I matched 5 out of 18 with ONLY font size 30 available, why does this drop to only 2 out of 18 when it has font sizes 14/16/24/30 available? Surely the 5 at font size 30 should still produce matches?
I'll do two separate checks if I have to (one with optimal single font setting for each size) but this will come with a performance hit so I'd rather not.
Any ideas why it behaves this way?
I still have more (much more!) settings to play with. Both with the image files themselves and the OCR options applied scanning them. But I'm already seeing unexplained inconsistency just with adding font sizes ....