Question

I have the following source code from the Wikipedia page of a list of Games. I need to grab the name of the game from the source, which is located within the title attribute, as follows:

<td><i><a href="/wiki/007:_Quantum_of_Solace" title="007: Quantum of Solace">007: Quantum of Solace</a></i><sup id="cite_ref-4" class="reference"><a href="#cite_note-4"><span>[</span>4<span>]</span></a></sup></td>

As you can see above, in the title attribute there's a string. I need to use GREP to search through every single line for when that occurs, and remove everything excluding:

title="Game name"

I have the following (in TextWrangler) which returns every single occurrence:

title="(.*)"

How can I now set it to remove everything surrounding that, but to ensure it keeps either the string alone, or title="string".

No correct solution

OTHER TIPS

I use a multi-step method to process these kind of files.

  1. First you want to have only one HTML tag per line, GREP works on each line so you want to minimise the need for complicated patterns. I usually replace all: > with >\n

  2. Then you want to develop a pattern for each occurrence of the item you want. In this case 'title=".?"'. Put that in between parentheses (). Then you want add some filling to that statement to find and replace all occurrences of this pattern: .?(title=".?").

  3. Replace everything that matches .?(title=".?").* with \1
  4. Finally, make smart use of the Textwrangler function process lines containing, to filter any remaining rubbish.

Notes

the \1 refers to the first occurrence of a match between () you can also reorder stuff using multiple parentheses and use something like (.?), (.) with \2, \1 to shuffle columns.

Learn how to do lazy regular expressions. The use of ? in these patterns is very powerfull. Basically ? will have the pattern looking for the next occurrence of the next part of the pattern not the latest part that the next part of your pattern occurs.

I've figured this problem out, it was quite simple. Instead of retrieving the content in the title attribute, I'd retrieve the page name.

To ensure I only struck the correct line where the content was, I'd use the following string for searching the code.

(.)/wiki/(.)" Returning \2

After that, I simply remove any cases where there is HTML code:

<(.*) Returning ''

Finally, I'll remove the remaining content after the page name:

"(.*) Returning ''

A bit of cleaning up the spacing and I have a list for all game names.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top