문제

I am building a web application crawler that crawls for HTTP requests (GET, PUT, POST, ...). It is designed for one specific purpose; bug bounty hunting. It enables pentesters to insert exploit payloads at specific parts of the HTTP requests.

Problem

When using the crawler I sometimes run in to the problem of crawling a lot of similar requests (e.g. /article/1, /article/2, /article/3, ...). This is a problem since, if I know /article/1 is not vulnerable, there is a big chance that /article/2 and /article/3 are also not vulnerable. This is because they probably run the same code on the back-end (they only get a different article out of the database). I therefore do not want to crawl them.

Example

Lets say my crawler crawled the URLs below.

https://example.ltd/
https://example.ltd/news/some-news-alias
https://example.ltd/news/another-news-alias
https://example.ltd/contact
https://example.ltd/news/some-other-news-alias
https://example.ltd/news/and-yet-another-one

Then I could assume that all other URLs that match the pattern /news/[alphabet&dash] do not have to be crawled because they probably run the same back-end code.

However, lets say my crawler crawled these URLs.

https://example.ltd/
https://example.ltd/users/sign-up
https://example.ltd/users/sign-in
https://example.ltd/contact
https://example.ltd/users/forgot-password

Then I cannot assume that all other URLs that match the pattern /users/[alphabet&dash] do not have to be crawled because they probably do not run the same back-end code.

Question

How can I decide (with an as high as possible correctness rate) which requests are similar as requests I have crawled before?

The request and response data (headers, body, ...) of all the previously crawled requests (in the crawling runtime) are available for analysis to decide if the current request is similar to previously crawled requests.

The solution does not have to work right away but can start working after enough information has been gathered (maybe after about 200 requests of a certain (possible) route have been crawled).


I thought about first detecting possible routes based on URLs and afterwards checking if the HTML structure/tree of a certain route looks similar across all requests with that route. However, this seems to be kind of difficult since HTML structures may vary if you have e.g. a comment section below news articles.

도움이 되었습니까?

해결책

So in generic terms you are looking for a fitness function to determine the probability that a web request will be handled by a code path that has not already been probed, based on the URL and the set of other URLs that have already been discovered (and possibly probed).

A very simple rule of thumb that may be good enough would be to just judge the number of unique child path-segments a given path-segment has.

E.g. Using your example, if https://example.ltd/news/ has hundreds of "children":

https://example.ltd/news/first-child
https://example.ltd/news/second-child
...
https://example.ltd/news/one-hundred-and-thirty-second-child

it is a safe bet that those requests are handled by the same code.

This would work whether the path segments were word-, integer- or ID-based.

Obviously the choice of threshold to determine how many children is too many would be informed by your existing datasets, and setting it would be a trade-off between false-positives (e.g. ignored unique code paths for handling requests) and false-negatives (redundant probing of repeat code paths).

I expect that using this heuristic with a relatively high threshold (20?) would be effective at reducing the number of redundant URLs crawled, since if a URL path has a low number of children, then the cost of redundantly crawling all of them is low, compared with a URL path with a high number of children.

You could combine this with other obvious and easy to implement fitness measures, such as whether the path includes numeric or ID-based segments.

Beyond that, I think you would need to use some kind of semantic-analysis of segment paths (e.g. words like "article" should score highly), but I suspect that such an approach would have a high effort/reward ratio.

UPDATE to address comment:

One scenario where the latter two techniques could be applied would be to address sites where "duplicate" type pages sit as siblings of unique pages according to a site's path organization. E.g.:

When you identify a site with a large number of 'siblings', your fitness function could:

  • increase 'probability of uniqueness score' for things like 'matches dictionary word'
  • decrease 'probability of uniqueness score' if the path segment matches one of set of regular expressions for id-like strings (e.g. integers, GUIDs, fixed-length alphanumeric strings) AND there are a high number of siblings matching the same regular expression.
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 softwareengineering.stackexchange
scroll top