Applied Physics 186 Blog ni Jica Monsanto: A10: Preprocessing Text

In Activity 1, we digitized a hand drawn plot by estimating pixel locations and using ratio and proportion to compute real world values. In this activity, we're to digitize handwritten text. Additional considerations with the given image (Figure 1) include its tilt and table lines since it's a form. We are also to attempt recognition of typewritten text.

Figure 1. Scanned document with typewritten and handwritten characters.

The chosen ROI is shown in Figure 2.

Figure 2. Region of interest with typewritten and handwritten characters with table lines.

As in most activities, we first import the image in Scilab as grayscale using grayscale_imread. Then we attempt to eliminate the tilt. This is where the table lines are useful. We have seen in Activity 6 that rotating a pattern results to a similar rotation to its Fourier Transform. From the Fourier Transform (log) in Figure 3, we find that the image has been tilted by 0.9^o. We can now use the mogrify command to fix this (Figure 4):

tilt = string(0.9);
imgmog = mogrify(img,['-rotate',tilt]);

Figure 3. FFT of ROI.

Figure 4. ROI with tilt removed.

Figure 5. FFT of ROI with tilt removed.

We then get the FFT of the rotated image (Figure 5) so we can design a filter (Figure 6) for eliminating the table lines as in Acitivity 7. We also binarize using im2bw (Figure 7) and then thin the letters or objects to one pixel in width (Figure 8) using the thin command demonstrated in Activity 9.

Figure 6. Filter mask for removing horizontal table lines.

Figure 7. ROI with horizontal table lines partially removed.

Figure 8. Binarized ROI.

Figure 9. ROI with objects thinned down to 1 pixel.

We then label each feature or letter extracted using bwlabel (also used in Acitvity 9) so we can show the letters one by one and check the quality of the preprocessing done.

for i = [1:max(imglabel)],
letter = 1*(imglabel==i);
imwrite(letter,path+fname+'-letter-'+string(i)+'.bmp');
end;

Figure 10 shows the letters detected. As expected, typewritten letters are easier to detect than handwritten text, only letters from the typewritten word were detected.

Figure 10. Features detected (clockwise): "DE", "SCR", "I", and "P".

We now attempt to recognize the occurrence of "DESCRIPTION" at different areas of the original image in Figure 1. This may be done simply using the technique used in Activity 5: taking the correlation of the image and the pattern "DESCRIPTION" using Scilab's imcorrcoef command. From Figure 11, we demonstrate that the technique works since all occurrences of "DESCRIPTION" was detected (represented by the fine dots or peaks that are not found anywhere else).

Figure 11. Image (left) and correlation with "DESCRIPTION" (right).

I give myself a grade of 8 because I was only able to extract typewritten letters from the document and not the handwritten characters. I was, however, able to demonstrate pattern recognition using correlations.

I would like to thank Ms. Kaye Vergel and Mr. Luis Buño III for useful discussions.

Applied Physics 186 Blog ni Jica Monsanto

A10: Preprocessing Text

0 comments:

Post a Comment

Tungkol sa Awtor

Mga Sinulat

Mga Tagasunod

Plurk hihi

Awooo...