Question

I'm trying to do very basic normalization and realize that to a large extent url normalization is an impossible task.

Regardless, different search engines return the same search results with different schemes, hosts etc. What are the most basic parts I need to collect and can you collect more then one part with parse_url to leave only the vital parts of the url?

Results 1: http://dogs.com Result 2: http://www.dogs.com

Need t account for these kinds of inconsistencies that are possible and can be generated by different search engines

No correct solution

OTHER TIPS

Results 1: http://dogs.com Result 2: http://www.dogs.com

These 2 aren't the same: one is the main domain, the other is a subdomain. There's no guarantee that they serve the same content.

What you're asking for is basically impossible: any part of the URL is important and changing it may result in a different page.

That said, there's a <meta> tag for canonical which indicates the normalized URL of a page. Only that URL is (somewhat) guaranteed to be correct.

Also, you could just pull the content from pages and compare them. But, again, no guarantees.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top