Question

Does anyone know of a tool or script that will crawl my website and count the number of headings on every page within my website? I would like to know how many pages in my website have more than 4 headings (h1). I have Screaming Frog, but it only counts the first two H1 elements. Any help is appreciated.

Was it helpful?

Solution 3

I found a tool in Code Canyon: Scrap(e) Website Analyser: http://codecanyon.net/item/scrap-website-analyzer/3789481.

As you will see from some of my comments, there was a small amount of configuration, but it is working well so far.

Thanks BeniBela, I will also look at your solution and report back.

OTHER TIPS

My Xidel can do that, e.g.:

 xidel http://stackoverflow.com/questions/14608312/seo-web-crawling-tool-to-count-number-of-headings-h1-h2-h3 -e 'concat($url, ": ", count(//h1))' -f '//a[matches(@href, "http://[^/]*stackoverflow.com/")]'

The xpath expression in the -e argument tells it to count the h1-tags and the -f option on which pages

This is such a specific task that I would just recommend you write it yourself. The simplest thing you need is an XPATH selector to give you the h1/h2/h3 tags.

Counting the headings:

  1. Pick any one of your favorite programming languages.
  2. Issue a web request for a page on your website (Ruby, Perl, PHP).
  3. Parse the HTML.
  4. Invoke the XPATH heading selector and count the number of elements that it returns.

Crawling your site:

Do step 2 through 4 for all of your pages (you'll probably have to have a queue of pages that you want to crawl). If you want to crawl all of the pages, then it will be just a little more complicated:

  1. Crawl your home page.
  2. Select all anchor tags.
  3. Extract the URL from each href and discard any URLs that don't point to your website.
  4. Perform a URL-seen test: if you have seen it before, then discard, otherwise queue for crawling.

URL-Seen test:

The URL-seen test is pretty simple: just add all the URLs you've seen so far to a hash map. If you run into a URL that is in your hash map, then you can ignore it. If it's not in the hash map, then add it to the crawl queue. The key for the hash map should be the URL and the value should be some kind of a structure that allows you to keep statistics for the headings:

Key = URL
Value = struct{ h1Count, h2Count, h3Count...}

That should be about it. I know it seems like a lot, but it shouldn't be more than a few hundred lines of code!

You might use xPather chrome extension or similar, and the xPath query:

count(//*[self::h1 or self::h2 or self::h3])

Thanks to:

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top