Pregunta

there

I'm checking a range of URLs like http://www.example.com/XX/content/1~100000.html whether they are exist. But The website has it't own 404 handle page. Every url returns 200 even it doesn't exist at all. I tried curl in command line. It show like the picture following: http://xxxx">

I also use HttpURLConnection.setFollowRedirects(false); but it didn't work.

Is there still a way to handle this problem? Thanks in advance!

¿Fue útil?

Solución

These are commonly known as Soft 404s. The only way to detect them is by examining the content, as the page headers do not indicate any error.

If you want to build something generic, then maybe you could try fetching a page that you know for sure does not exist as use it as your reference, and compare any other page you crawl to that to determine whether it is an error page or not (you may need to use some kind of a somewhat insensitive comparison algorithm as the page content may slightly change between different pages that do not exist). Still, it will be error prone if you are going to be crawling random websites.

Otros consejos

You can try to look at the content of the page to identify the error page. There might be some text indicating that it is the error page.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top