Frage

I use the sgml library of prolog to extract information about a web page. I use this instruction to extract all:

load_structure('file.html', List, [dialect(sgml), shorttag(false), max_errors(-1)])

the system loads the page but i have some warnings, for instance:

WARNING:SGML2PL(sgml): inserted omitted end-tag for "img"
WARNING:SGML2PL(sgml): inserted omitted end-tag for "br"
WARNING:SGML2PL(sgml): entity "amp" does not exist

How can i do to eliminate this warnings?

War es hilfreich?

Lösung

I use this syntax

get_html_file(FileOrStream, P) :-
        dtd(html, DTD),
        load_structure(FileOrStream, [P],
                       [ dtd(DTD),
                         dialect(sgml),
                         shorttag(false),
                         syntax_errors(quiet),
                         max_errors(-1)
                       ]).

the option syntax_errors(quiet) should do.

I recall I had some hard time parsing old pages with errors. Error handling can be complicated, some tool like tags soup, being more tolerant, could help in getting the work sone...

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top