Question

I have 55 000 image files (in both JPG and TIFF format) which are pictures from a book.

The structure of each page is this:

some text

--- (horizontal line) ---

a number

some text

--- (horizontal line) ---

another number

some text

There can be from zero to 4 horizontal lines on any given page.

I need to find what the number is, just below the horizontal line.

BUT, numbers strictly follow each other, starting at one on page one, so in order to find the number, I don't need to read it: I could just detect the presence of horizontal lines, which should be both easier and safer than trying to OCR the page to detect the numbers.

The algorithm would be, basically:

for each image
  count horizontal lines
  print image name, number of horizontal lines
  next image

The question is: what would be the best image library/language to do the "count horizontal lines" part?

Was it helpful?

Solution

Probably the easiest way to detect your lines is using the Hough transform in OpenCV (which has wrappers for many languages).

The OpenCV Hough tranform will detect all lines in the image and return their angles and start/stop coordinates. You should only keep the ones whose angles are close to horizontal and of adequate length.

O'Reilly's Learning OpenCV explains in detail the function's input and output (p.156).

OTHER TIPS

If you have good contrast, try running connected components and analyze the result. It can be an alternative to finding lines through Hough and cover the case when your structured elements are a bit curved or a line algorithm picks up the lines you don’t want it to pick up.

Connected components is a super fast, two raster scan algorithm and will give you a mask with all you connected elements in it marked with different labels and accounted for. You can discard anything short ( in terms of aspect ratio). Overall, this can be more general, faster but probably a bit more involved than running Hough transform. The Hough transform on the other hand will be more tolerable for contrast artifacts and even accidental gaps in lines. OpenCV has the function findContours() that find components for you.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top