Instead of pre-processing your HTML, trust in BeautifulSoup and use regular expression searches:
soup.find_all('td', text=re.compile(','))
finds all <td>
elements with the direct text in the tag containing a comma.
Question
Sorry about the confusing title. I am trying to figure out a simple Regex problem, but cannot figure out what the solution is.
I have a HTML snippet from a larger HTML document.
<td class="grade">100.0</td>
<td class="teacher">Mathias, Jordan</td>
Other Regex separates the two, giving them those class-names. I use a positive look-ahead
to check for a .
or a ,
(period or comma), and assign them the class of grade or teacher (respectively).
The problem comes later, when I want to check if the code in-between these tags is blank.
<td class="grade"></td>
I would like to use a positive look-behind to check if the class is either grade or teacher (grade|teacher)
. In addition, I would like to check that there is truly nothing in between the ><
(conjunction of the empty tags).
So-far, this is what I have: (?<=.*(teacher|grade)*.+>?)[^.](?=</td>)
NOTE: This is in Python
La solution
Instead of pre-processing your HTML, trust in BeautifulSoup and use regular expression searches:
soup.find_all('td', text=re.compile(','))
finds all <td>
elements with the direct text in the tag containing a comma.