Question

I was thinking of writing a PHP script that would analyse a CMS'd page's content (i.e. database field) and then auto-generate (X)HTML META description & keyword tags, but as always there's no point reinventing the wheel so I'm wondering if anyone knows of such a beastie?

The former I imagine would be something like a relatively straightforward regex to grab the first sentence or two, whereas the latter would probably involve elimination of words against a common-words dictionary and then weighting of frequency or similar.

Was it helpful?

Solution

The problems you're considering are twofold: one of keyword extraction and one of document summarization. The first, which I'd obviously use for keywords has a very simple naive approach: pick the most frequent word in the content, minus all stopwords (look this up in Wikipedia if you don't know what these are). There are many more advanced methods, including weighting for the inclusion of synonyms, location in text or markup, and more. There are a few examples of easy keyword extraction scripts in PHP you can implement probably without trouble. Just Google search something like "PHP keyword extraction" and you'll find a few.

The second problem, on the other hand, is a little more difficult, and is still the source of a lot of academic work. You'd need summarization for a very thorough meta description tag. It may actually not be worth your time if you aren't looking for a long-scale AI project which may still come off as rigid or incoherent. Another approach would be simply a heuristic which uses keyword extraction: "This article is about (first most common keyword), (second most common keyword), and (third most common keyword)." You're at least getting the benefit of fitting in some content in both keyword and description. If you'd like to shake it up, use some synonyms instead. There is a semi-functional PHP implementation of WordNet, but I'd suggest outsourcing to the Natural Language Toolkit for Python for the heavy lifting there, as most of the work is already done for you.

I'd like to take a brief moment to encourage your research in this area and ignore the naysaying from Mr. Warnica. Meta information is important both for document classification and information extraction in the area of search. It would be foolish not to have the data, and it is, in fact, worthwhile to automate it for large-scale content management systems. Good luck with your efforts.

OTHER TIPS

The Yahoo Pipes Term Extractor module does something similar to what you want. Unfortunately I am not aware of the source to pipes modules being open.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top