Question

I am trying to work out the overhead of the ASP.NET auto-naming of server controls. I have a page which contains 7,000 lines of HTML rendered from hundreds of nested ASP.NET controls, many of which have id / name attributes that are hundreds of characters in length.

What I would ideally like is something that would extract every HTML attribute value that begins with "ctl00" into a list. The regex Find function in Notepad++ would be perfect, if only I knew what the regex should be?

As an example, if the HTML is:
<input name="ctl00$Header$Search$Keywords" type="text" maxlength="50" class="search" />

I would like the output to be something like:
name="ctl00$Header$Search$Keywords"
A more advanced search might include the element name as well (e.g. control type):
input|name="ctl00$Header$Search$Keywords"

In order to cope with both Id and Name attributes I will simply rerun the search looking for Id instead of Name (i.e. I don't need something that will search for both at the same time).

The final output will be an excel report that lists the number of server controls on the page, and the length of the name of each, possibly sorted by control type.

Was it helpful?

Solution 3

Answering my own question, the easiest way to do this is to use BeautifulSoup, the 'dirty HTML' Python parser whose tagline is:

"You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like. Neither does this parser."

It works, and it's available from here - http://crummy.com/software/BeautifulSoup

OTHER TIPS

Quick and dirty:

Search for

\w+\s*=\s*"ctl00[^"]*"

This will match any text that looks like an attribute, e.g. name="ctl00test" or attr = "ctl00longer text". It will not check whether this really occurs within an HTML tag - that's a little more difficult to do and perhaps unnecessary? It will also not check for escaped quotes within the tag's name. As usual with regexes, the complexity required depends on what exactly you want to match and what your input looks like...

"7000"? "Hundreds"? Dear god.

Since you're just looking at source in a text editor, try this... /(id|name)="ct[^"]*"/

I suggest xpath, as in this question

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top