Question

I have to convert spreadsheet data (name, image name, & bio) to HTML, so I use a RegEx find/replace with variables in DW which is easy enough. The issue is that one column contains a bio that is HTML (paragraphs and italics mainly) and RegEx ignores that "row" for reasons beyond my researching capabilities.

I don't want to strip then manually add the HTML again, so show me the way!

TL;DR: Is there a way to paste HTML as a RegEx variable?


Here's some example table data I quickly paste/format from Excel to DW:

<tr>
  <td>James Brian Hellwig</td>
  <td>James_Brian_Hellwig</td>
  <td><p>Lorem ipsum dolor sit amet, <em>consectetur adipisicing</em> elit. Sunt, ut iste tempore laborum aperiam nostrum obcaecati neque natus adipisci fugit. </p>
  <p>Dolores, eligendi animi ea totam nobis cumque ullam eveniet accusamus!</p></td>
</tr>
<tr>
  <td>Jiminy Cricket</td>
  <td>Jiminy_Cricket</td>
  <td><p>Lorem ipsum dolor sit amet, <em>consectetur adipisicing</em> elit. Sunt, ut iste tempore laborum aperiam nostrum obcaecati neque natus adipisci fugit. </p>
  <p>Dolores, eligendi animi ea totam nobis cumque ullam eveniet accusamus!</p></td>
</tr>

Here's the "Find" RegEx:

<tr>
  <td>([^<]*)</td>
  <td>([^<]*)</td>
  <td>([^<]*)</td>
</tr>

Here's the "Replace" RegEx:

<div>
  <img class="floatleft" src="$2.jpg" alt="$1" />
  <h2 class="name">$1</h2>
  $3
</div>

I will either mouth-kiss or buy a beer for the first person to answer this. Your choice.

Was it helpful?

Solution

Your problem is that [^<]* matches anything except an opening angle bracket. That's good idea in general, so you don't accidentally match across tag boundaries, but in this case it's unfortunate because there's a <p> tag right after the <td>.

Therefore, I propose a different solution. Allow other tags, just not <td> tags within a <td> tag:

<tr>
  <td>((?:(?!</?td)[\s\S])*)</td>
  <td>((?:(?!</?td)[\s\S])*)</td>
  <td>((?:(?!</?td)[\s\S])*)</td>
</tr>

Explanation:

(?:         # Start non-capturing group that matches...
 (?!</?td)  # (unless we're at the start of a <td> or </td> tag)
 [\s\S]     # ... any character (whitespace or non-whitespace).
)*          # Repeat as needed

OTHER TIPS

You can use

<tr>
  <td>.*?</td>
  <td>.*?</td>
  <td>.*?</td>
</tr>

Explanation: .(dot) matches any character except a newline. If you need to go across multiple lines, you can use [\s\S] like Tim suggested.

* makes it look for 0 or more of the .(dot). ? makes that reluctant, meaning we grab as FEW characters as we possibly can while still matching the END TD TAG.

Since there is whitespace between your TR and TD tags, we must include that in our regex. Sorry, but I should have caught this sooner! Also, we can't put spaces in our regex unless we are searching for a space, which is why regex's look like a long chain of complicated characters. Here is what it should look like:

<tr>\s*<td>.*?</td>\s*<td>.*?</td>\s*<td>.*?</td>\s*</tr>

As you can see, I used \s which means a whitespace character, followed by a * which means 0 or more times.

Since you have the same pattern repeating 3 times, you can actually use the following notation for repetition:

<tr>\s*(<td>.*?</td>\s*){3}</tr>

Repetition notation is great. Let's say, for example, that you not only want to match tables with exactly 3 TD's, but you want to match table's that have anywhere from 1 to 4 TD's. You would use:

<tr>\s*(<td>.*?</td>\s*){1,4}</tr>

FYI, A co-worker just found a great alternative to using RegEx in the example above by using Dreamweaver XSLT files to dynamically add XML data to the HTML. We simply use an XML-mapped spreadsheet to export the XML and voilà...content updated.

Once the spreadsheet's schema is set and the XSL file is formatted with the appropriate HTML "repeating regions", it's smooth sailing.

Resources:

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top