Question

I have downloaded a web page(charset=iso-8859-1) using curl

curl "webpage_URL" > site.txt

The encoding of my terminal is utf-8. Here I try to see the encoding of this file:

file -i site.txt
site.txt: regular file

Now: the strange thing: If I open the file with nano I find all the words that are visible in a normal browser. BUT when I use:

cat site.txt

some words are missing. This fact makes me curious and after some hours of research I didn't figure out why.

In python too, it does't find all the words:

def function(url):
   p = subprocess.Popen(["curl", url], stdout=subprocess.PIPE)
   output, err = p.communicate()
   print output
   soup=BeautifulSoup(output)
   return soup.body.find_all(text=re.compile('common_word'))

I also tried to use urllib2 but I had no success.

What am I doing wrong?

Was it helpful?

Solution

If somebody will face the same problem:

The root of my problem were some carriage return characters (\r) that are present in the web page. The terminal cannot print them. This wouldn't be a big problem, but the whole line that contains a \r is skipped.

So, in order to see the content of the entire file: this characters should be escaped with the -v or -e option:

cat -v site.txt

(thanks to MendiuSolves who has suggested to use the cat command options)

In order to solve a part of the python problem: I changed the return value from soup.body.find_all(text=re.compile('common_word')) to soup.find_all(text=re.compile('common_word'))

It is obvious that if the word you search for is on one of the line containing a \r and you will print it you will not see the result. The solution could be either filter the character or write the content in a file.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top