I use the sgml library of prolog to extract information about a web page. I use this instruction to extract all:

load_structure('file.html', List, [dialect(sgml), shorttag(false), max_errors(-1)])

the system loads the page but i have some warnings, for instance:

WARNING:SGML2PL(sgml): inserted omitted end-tag for "img"
WARNING:SGML2PL(sgml): inserted omitted end-tag for "br"
WARNING:SGML2PL(sgml): entity "amp" does not exist

How can i do to eliminate this warnings?

有帮助吗?

解决方案

I use this syntax

get_html_file(FileOrStream, P) :-
        dtd(html, DTD),
        load_structure(FileOrStream, [P],
                       [ dtd(DTD),
                         dialect(sgml),
                         shorttag(false),
                         syntax_errors(quiet),
                         max_errors(-1)
                       ]).

the option syntax_errors(quiet) should do.

I recall I had some hard time parsing old pages with errors. Error handling can be complicated, some tool like tags soup, being more tolerant, could help in getting the work sone...

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top