Question

What regex would match a nested table with identifiable text in the table cell? I've tried but failed to come up with a regular expression to extract the specific table I want with out grabbing the beginning and end of both tables in the example. Here is something to get started: "<table>.*?</table>"

<table>
    <tr>
        <td>
            <table>
                <tr><td>Code1</td></tr>
                <tr><td>some data</td></tr>
                <tr><td>etc ...</td></tr>
            </table>
        </td>
    </tr>
    <tr>
        <td>
            <table>
                <tr><td>Code2</td></tr>
                <tr><td>some data</td></tr>
                <tr><td>etc ...</td></tr>
            </table>
        </td>
    </tr>
</table>

Say I want to extract the table containing "Code2". What regex will match specifically and only that table?

Was it helpful?

Solution

The following regex will find your table:

(?ms)<table>((?!<table>).)*<td>Code2</td>.*?</table>

With (?ms) you turn on "multiline matches" (m) and "dot matches newlines, too" (s). Then you have a negative lookahead (?!) to make sure you have no second start of a table inside your match.

OTHER TIPS

I wouldn't use a regexp on this, since HTML isn't regular, and there are no end of edge cases to trip you up. You're better off using an HTML parser. Whichever language or platform you're using, there'll be one available.

Don't use a regex. Use an HTML parser!

However, in Perl (assuming you don't have nested tables):

$xml =~ /<table>.*<td>Code2<\/td>.*<\/table>/s;
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top