Regex match spaces in html attribute

https://stackoverflow.com/questions/862353

21-08-2019
|

Question

I have a bunch of html with lines like this:

<a href="#" rel="this is a test">

I need to replace the spaces in the rel-attribute with underscores, but I'm sort of a regex-noob!

I'm using Textmate.

Can anyone help me?

/Jakob

Solution

I don't think you can do this properly. Though I wonder why you need to do it at one go?

I can think of a really poor way of doing it, but even if I don't recommend it, here goes:

You could sort of do it with the regex below. However, you would have to increase the number of captures and outputs with a _ on the end to the potential number of spaces in the rel. I bet that is a requirement which disallows this solution.

Search:

{\<a *href\=\"[^\"]*" *rel\=\"}{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*{([^ ]*|[^\"]*)}( |\")*

Replace:

\1\2_\3_\4_\5_\6_\7_\8_

This way has two downsides, one is there might be limitations to the number of captures you can have in Textmate, two is you'll end up with a large number of _'s on the end of each line.

With your current test, with the regex above, you would end up with:

<a href="#" rel="this_is_a_test">____

PS: This regex is of the format of the visual studio search/replace box. You'll probably need to change some characters to make it fit textpad.

 {} => capturing group

  () => grouping

  [^A] => anything but A

  ( |\")* => space or "

  \1 => is the first capture

OTHER TIPS

Suppose you already received the value of rel:

var value = document.getElementById(id).getAttribute( "rel");
var rel = (new String( value)).replace( /\s/g,"_");
document.getElementById(id).setAttribute( "rel", rel);

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.

I have to get on-board the "you're using the wrong tool for the job" train here. You have Textmate, so that means OSX, which means you have sed, awk, ruby and perl that can all do this much much better and easier.

Learning how to use one of these tools to do text manipulation will give you uncountable benefits in the future. Here is a URL that will ease you into sed: http://www.grymoire.com/Unix/Sed.html

Find: (rel="[^\s"]*)\s([^"]*")

Replace: \1_\2

This replaces only the first white space so click on "Replace All" until nothing is replaced anymore. It's not pretty but easy to understand and works with every editor.

Change rel in the find pattern if you need to clean other attributes.

If you're using TextMate, then you're on a Mac, and therefore have Python.

Try this:

#!/usr/bin/env python

import re

input = open('test.html', 'r')

p_spaces = re.compile(r'^.*rel="[^"]+".*$')

for line in input:
    matches = p_spaces.findall(line)

    for match in matches:
        new_rel = match.replace(' ', '_')
        line = line.replace(match, new_rel)

    print line,

Sample output:

 $ cat test.html
testing, testing, 1, 2, 3
<a href="#" rel="this is a test">
<unrelated line>
Stuff
<a href="#" rel="this is not a test">
<a href="#" rel="this is not a test" rel="this is invalid syntax (two rels)">
aoseuaoeua

 $ ./test.py
testing, testing, 1, 2, 3
<a_href="#"_rel="this_is_a_test">
<unrelated line>
Stuff
<a_href="#"_rel="this_is_not_a_test">
<a_href="#"_rel="this_is_not_a_test"_rel="this_is_invalid_syntax_(two_rels)">
aoseuaoeua

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow