Question

I would like to use tesseract OCR to get text on a rectangular area of the screen using c# on Visual Studio C#.

First, what are required to get tesseract working in Visual Studio C#? I am new to use Visual Studio and setting up wrappers. After hours of searching on Google, I found I would need: a wrapper(charlesw), a language pack from official site. Do I also need to install windows tesseract-ocr?

I have followed the steps on charlesw's GitHu to setup the wrapper in my project. Yet, I am still not sure how to use the functions.

I assume this is how to declare an OCR engine: TesseractEngine engine = new TesseractEngine(@"./tessdata", "eng", EngineMode.Default);

To analyze the rectangular region on the screen, I could capture the screen of certain region, and then save it in .bmp or .tif. Next, to use the engine to analyze the image. engine.[unkwonapi](imagepath); //what is the api name going to be? I tried to look it up [here][2]. Or, some people said it could be done by using tesseract's api, where we can input the coordination of the rectangular region.

Était-ce utile?

La solution

The wrapper bundles Tesseract DLL (as libtesseract302.dll). You do not need to install windows tesseract-ocr; as a matter of fact, you should not, as it can interfere with the wrapper.

You can use either of the following to specify a region of interest on the image:

engine.Process(Bitmap image, Rect region, PageSegMode? pageSegMode = null)

or

engine.Process(Pix image, Rect region, PageSegMode? pageSegMode = null)

Autres conseils

Here was my process. I first had to rasterize PDFs (which may not be your requirement)

1.) Install Ghostcript 9.26 from here later versions don't work with the next step

2.) Install Ghostscript.NET NuGet Install-Package Ghostscript.NET -Version 1.2.1

3.) Install Tesseract NuGet Install-Package Tesseract -Version 3.3.0

Here is my PDF rasterization routine, using Ghostscript.NET

public static List<MemoryStream> GetPdfImages(FileInfo pdfFile, DirectoryInfo workingDir, string fileNamingToken, TextWriter _logger)
{
    int desired_x_dpi = 150;
    int desired_y_dpi = 150;

    string inputPdfPath = pdfFile.FullName;
    var streams = new List<MemoryStream>();
    using (var rasterizer = new GhostscriptRasterizer())
    {
    GhostscriptVersionInfo gsVersionInfo = GhostscriptVersionInfo.GetLastInstalledVersion(GhostscriptLicense.GPL | GhostscriptLicense.AFPL, GhostscriptLicense.GPL);

    try
    {
        rasterizer.Open(inputPdfPath, gsVersionInfo, true);
    }
    catch (Ghostscript.NET.GhostscriptAPICallException exc)
    {
        _logger.WriteLine("There is an issue with this version of Ghostscript or how Ghostscript was installed. As of Winter 2020, GS 9.26 will work the best with Ghostscript.NET");
    }
    for (var pageNumber = 1; pageNumber <= rasterizer.PageCount; pageNumber++)
    {

        var memoryStrm = new MemoryStream();

        var img = rasterizer.GetPage(desired_x_dpi, desired_y_dpi, pageNumber);
            //save to a memory stream to be returned
        img.Save(memoryStrm, System.Drawing.Imaging.ImageFormat.Tiff);
            //or save to the file system to see how well it's working
        img.Save($"{workingDir.FullName}\\{fileNamingToken}_{pageNumber}.TIF");
        _logger.WriteLine($"Image Dimensions: {img.Width} x {img.Height}");
        streams.Add(memoryStrm);
    }
    }
    return streams;
}

Once I've created a list of memorystreams, I choose to loop through them and OCR a rectangle out of them with Tesseract. If you have a lot of files to process, you shouldn't be invoking the engine over and over again .. you'd keep it around somewhere else

var _engine = new TesseractEngine("./tessdata", "eng", EngineMode.Default, "letters");

var topHalfPageRect = Rect.FromCoords(1, 1, 1275, 825);//at 150 DPI, get top of 8.5x11 page

for(int i =0;i< _streams.Count;i++)
{
   var imgStm = _streams[i];//my list of memorystreams created by Ghostcript 9.26

   imgStm.Position = 0;//set memorystream playhead back to start

    using (var imageWithText = Pix.LoadTiffFromMemory(imgStm.ToArray()))
    {
      using (var page = _engine.Process(imageWithText, topHalfPageRect , PageSegMode.SparseText))
      {
        var text = page.GetText();
        var processedText = text.Replace("\n", "").Trim();
        Console.WriteLine(processedText);

        if (MyRegexPatterns.Pattern1.IsMatch(processedText))
        {
          Console.WriteLine("*** FOUND IT!! ***");
        }
       }
    
   }

   imgStm.Dispose();//but not matter what, disppose of the stream now
}
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top