Frage

I have this file:

    <table>
    <tr>
    <td WIDTH="49%">
    <p><a href="...1.htm"> cell to remove</a></p></td>
    <td WIDTH="51%"> some text </td>
    </tr>

I need as result this:

    <table>
    <tr>
    <td> 
    </td>
    <td WIDTH="51%"> some text </td>
    </tr>

I am trying to read the file with this html and replace my first tag with an empty one:

   ret = open('rec1.txt').read()
   re.sub('<td[^/td>]+>','<td> </td>',ret, 1)
   final= open('rec2.txt', 'w')
   final.write(ret)
   final.close()

As you can see i am new using python and something, when i read rec2.txt it contains exactly the same text of the previous file.

tks

War es hilfreich?

Lösung

Using regex to parse HTML is a very bad practice (see @Lutz Horn's link in the comment).

Use an HTML parser instead. For example, here's how you can set the value of the first td tag to empty using BeautifulSoup:

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

from bs4 import BeautifulSoup


data = """
<table>
    <tr>
        <td WIDTH="49%">
            <p><a href="...1.htm"> cell to remove</a></p>
        </td>
        <td WIDTH="51%">
            some text
        </td>
    </tr>
</table>"""

soup = BeautifulSoup(data, 'html.parser')
cell = soup.table.tr.td
cell.string = ''
cell.attrs = {}

print soup.prettify(formatter='html')

prints:

<table>
 <tr>
  <td>
  </td>
  <td width="51%">
   some text
  </td>
 </tr>
</table>

See also:

Hope that helps.

Andere Tipps

Using regex to parse HTML is a very bad practice. If you are actually trying to modify HTML, use an HTML parser.

If the question is academic, or you are only trying to make the limited transformation you describe in the question, here is a regex program that will do it:

#!/usr/bin/python
import re
ret = open('rec1.txt').read()
ret = re.sub('<td.*?/td>','<td> </td>',ret, 1, re.DOTALL)
final= open('rec2.txt', 'w')
final.write(ret)
final.close()

Notes:

  • The expression [/td] means match any one of /, t, or d in any order. Note instead how I used .* to match an arbitrary string followed by /td.
  • The final, optional, argument to re.sub() is a flags argument. re.DOTALL allows . to match new lines.
  • The ? means to perform a non-greedy search, so it will only consume one cell.
  • re.sub() returns the resulting string, it does not modify the string in place.
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top