How would one use lynx.exe on windows to convert html string (with css & malformed html) to plain text

StackOverflow https://stackoverflow.com/questions/13511475

  •  01-12-2021
  •  | 
  •  

質問

I have tried using (and have been advised against) using regular expressions for this task (here) - and so instead i tried using the HTMLAgilityPack in this manner however its resulting text is very poor, html lists (<ol><li></ol>) are completely lost and just result in clumped together paragraphs.

In this question i saw that lynx (compiled for windows) was recommended as a good alternative, however i am having trouble getting this working - how would one use lynx.exe to convert html (stored in a .net string) to a presentable plain text string with line breaks etc.

The only way i can think off is by writing the html to a file, using .nets system.process to call lynx.exe -dump and read the resulting file - this seems very clumsy.

Is there a better way of doing it? What would the exact lynx.exe command line be for such a task?

The LYNX implementation i am using is this one:

http://invisible-island.net/datafiles/release/lynx-cs-setup.exe

Edit: Made some progress, this is the command line i've been using:

lynx.exe -dump "d:\test.html" >d:\output.txt

It sort of works but if i open the resulting file in notepad its all on one line (because lynx is only using Line Feed characters for new lines whereas notepad needs carriage returns to render properly.

Also, its inserting way too many line feeds after </li> & <br /> tags its doing two Line Feeds:

   Hello, this is a normal line of text.


   Next an ordered list:


   1.       The

   2.       Quick

   3.       Brown Fox

   4.       Jumped

I can work around this by replacing two consecutive LF's with just the one LF, but i'm still after a c# wrapper for all this.

Edit 2 - My final solution based on Christian's answer:

Function ConvertHtmlToPlainText(ByVal HtmlString As String) As String

    '#### Define FileBuffer Path
    Dim HtmlBuffer As String = WorkingRoot & "HtmlBuffer.html"

    '#### Delete any old buffer files
    Try
        If File.Exists(HtmlBuffer) = True Then
            File.Delete(HtmlBuffer)
        End If
    Catch ex As Exception
        Return "Error: Deleting old buffer file: " & ex.Message
    End Try

    '#### Write the HTML to the buffer file
    Try
        File.WriteAllText(WorkingRoot & "HtmlBuffer.html", HtmlString)
    Catch ex As Exception
        Return "Error: Writing new buffer file: " & ex.Message
    End Try

    '#### Check the file was written OK
    If File.Exists(HtmlBuffer) = False Then
        Return "Error: HTML Buffer file was not written successfully."
    End If

    '#### Read the buffer file with Lynx and capture plain text output
    Try
        Dim p = New Process()
        p.StartInfo = New ProcessStartInfo(LynxPath, "-dump -width 1000 " & HtmlBuffer)
        p.StartInfo.WorkingDirectory = WorkingRoot
        p.StartInfo.UseShellExecute = False
        p.StartInfo.RedirectStandardOutput = True
        p.StartInfo.RedirectStandardError = True
        p.StartInfo.WindowStyle = ProcessWindowStyle.Hidden
        p.StartInfo.CreateNoWindow = True
        p.Start()
        p.WaitForExit()

        '#### Grab the text rendered by Lynx
        Dim text As String = p.StandardOutput.ReadToEnd()
        Return text.Replace(vbLf & vbLf, vbLf)

    Catch ex As Exception
        Return "Error: Error running LYNX to parse the buffer: " & ex.Message
    End Try
End Function
役に立ちましたか?

解決

Using this you can invoke Lynx, grab the output from the redirected StandardOutput into a string without writing it to a file first.

using System;
using System.Diagnostics;

namespace Lynx.Dumper
{
  public class Dampler
  {
      public void fdksfjh()
      {
          var url = "http://www.google.com";

          var p = new Process();

          p.StartInfo = new ProcessStartInfo("c:/tools/lynx_w32/lynx.exe", "-dump -nolist " + url)
          {
              WorkingDirectory = "c:/tools/lynx_w32/",
              UseShellExecute = false,
              RedirectStandardOutput = true,
              RedirectStandardError = true,
              WindowStyle = ProcessWindowStyle.Hidden,
              CreateNoWindow = true
          };

          p.Start();
          p.WaitForExit();

          //grab the text rendered by Lynx
          var text = p.StandardOutput.ReadToEnd();

          Console.WriteLine(text);
      }
  }
}
ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top