Python 3.2 remove newline from HTML code with urllib

https://stackoverflow.com/questions/21691679

09-10-2022
|

Question

I used urllib to get HTML to a string, I want to perform serach on the string but can't due to HTML format, so is there a way to 'unformatting' a string, don't need to take out HTML code, I just need to remove all the new lines. Here is my code:

import urllib.request

url = "http://www.internetworldstats.com/emarketing.htm"
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
Whole=(response.read().decode('ISO-8859-1'))

Whole.strip('/n')
print(Whole[11631:12631])
YearPos=Whole.find('December, 1996')
print(YearPos)

The strip bit did not work.... The result I got is like this

    December, 1995</b></font></p>
</td>
<td width="112" bgcolor="#FFFFFF">
<div align="right"><font size="-1" face="Arial" color=
"#000099">16 millions</font></div>
</td>
<td width="120" bgcolor="#FFFFFF">
<div align="right"><font size="-1" face="Arial" color=
"#000099">0.4 %</font></div>
</td>
<td width="120" bgcolor="#FFFFFF">
<p><font size="-1" face="Arial" color="#000099">IDC</font></p>
</td>
</tr>
<tr>
<td width="103" bgcolor="#FFFFFF">
<p><font size="-1" face="Arial" color="#000099">December,
1996</font></p>
</td>
<td width="112" bgcolor="#FFFFFF">
<div align="right"><font size="-1" face="Arial" color=
"#000099">36 millions</font></div>
</td>
<td width="120" bgcolor="#FFFFFF">
<div align="right"><font size="-1" face="Arial" color=
"#000099">0.9 %</font></div>
</td>
<td width="120" bgcolor="#FFFFFF">
<p><font size="-1" face="Arial" color="#000099">IDC</font></p>
</td>
</tr>
<tr>
<td width="103" bgcolor="#FFFFFF">
<p><font size="-1" face="Arial" color="#000099">December,
1997</font></p
-1

Solution

There are a few issues here...

As Vasili mentioned, the newline character should be \n, not /n
str.strip() does not modify the string directly. It returns a copy of the modified string. So it should be Whole = Whole.strip('\n')
str.strip() remove leading and trailing characters. In your case, you wanted to remove the newline characters which are in the middle of the string. So you should use str.replace() instead, e.g. Whole = Whole.replace('\n', '')

OTHER TIPS

You have written the newline character incorrectly, it is \n, not /n.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow