Question

I have a folder of HTML files which have the below DOCTYPE declaration which I need to remove, so that a not-very-good parser can successfully load it as XML.

I've been trying to use perl to do the substitution in place, but no change is made when I run the substitution and I can't figure out why. Can anyone identify the correct flags or specification I need to make in order to remove the DOCTYPE processing instruction here.

Here's an example file I'd like to manipulate.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta name="generator" content=
  "HTML Tidy for Linux/x86 (vers 25 March 2009), see www.w3.org" />
  <title></title>
</head>
  <body>
  </body>
</html>

Here's the perl one-liner I'm trying to use, which looks for the angle brackets, the exclamation mark, and everything before the close angle bracket. It incorporates perl substitution flags which other postings suggest should work for a multiline match - m for multiline, s for allowing newlines to be matched by regex. I'm then replacing the match with the empty string.

perl -i -e 's/<![^>]+>//gsm' `find . -name '*.html'`

I can't figure out why, but the DOCTYPE is not removed from the file after running this command. Does anyone else know why?

Was it helpful?

Solution

What you need is the -0777 switch which will cause the entire file to be read into a single string. If this is not used, the files will be read in line-by-line mode, and you can never match a multi-line statement that way.

Also, as Andomar points out, you are missing the -p switch, but I assume you figured that out.

The modifiers on the regex won't matter in this case, except the /g modifier. /m only affects ^ and $, and /s causes wildcard . to also match newlines. None of this applies to your regex.

So basically, you want something like:

perl -0777 -pi -e 's/<![^>]+>//g' ...

Side note:

Html should be handled with parsers, ideally, so I spent a few minutes working on using HTML::Parser which has a convenient option to strip declarations by adding a handler. Something like this seems to print ok for a single file:

perl -MHTML::Parser -we '
    $p = HTML::Parser->new(default_h => [sub {print @_},'text'] ); 
    $p->handler(declaration => ''); 
    $p->parse_file(shift) or die $!; " yourfile.html

I figured it would be overkill so I abandoned trying to fix it with the -pi in-place edit switches, but it is (probably) easily implemented in a script.

OTHER TIPS

First, you seem to be missing the -p parameter, for processing the input line by line. -i doesn't seem to do much without -p.

Second, since -pi processes the input line-by-line, it can't replace a regex that spans more than one line.

You could write a Perl script instead. This script should run your regex on the entire content of all files passed on the command line:

use IO::All;

foreach my $file (@ARGV) {
    my $content = io($file)->slurp;
    $content =~ s/<![^>]+>//g;
    $content > io($file);
}

The command cpan IO:All should install the IO:All module, if it is not present on your system.

P.S. The m and s options only affect ., ^ and $. I think you can omit them.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top