If somebody will face the same problem:
The root of my problem were some carriage return characters (\r
) that are present in the web page. The terminal cannot print them. This wouldn't be a big problem, but the whole line that contains a \r
is skipped.
So, in order to see the content of the entire file: this characters should be escaped with the -v
or -e
option:
cat -v site.txt
(thanks to MendiuSolves who has suggested to use the cat command options)
In order to solve a part of the python problem: I changed the return value from soup.body.find_all(text=re.compile('common_word'))
to soup.find_all(text=re.compile('common_word'))
It is obvious that if the word you search for is on one of the line containing a \r
and you will print it you will not see the result. The solution could be either filter the character or write the content in a file.