Late to the party but I also needet this and wanted to use full text search.
Which conveniently removes html tags.
So basically you convert the text to a tsvector
and then use ts_stat
:
select word, nentry
from ts_stat($q$
select to_tsvector('simple', '<div id="main">a b c <b>b c</b></div>')
$q$)
order by nentry desc
Result:
|word|nentry|
|----|------|
|c |2 |
|b |2 |
|a |1 |
But this does not scale well, so here is what I endet up with:
Setup:
-- table with a gist index on the tsvector column
create table wordcloud_data (
html text not null,
tsv tsvector not null
);
create index on wordcloud_data using gist (tsv);
-- trigger to update the tsvector column
create trigger wordcloud_data_tsvupdate
before insert or update on wordcloud_data
for each row execute function tsvector_update_trigger(tsv, 'pg_catalog.simple', html);
-- a view for the wordcloud
create view wordcloud as select word, nentry from ts_stat('select tsv from wordcloud_data') order by nentry desc;
Usage:
-- insert some data
insert into wordcloud_data (html) values
('<div id="id1">aaa</div> <b>bbb</b> <i attribute="ignored">ccc</i>'),
('<div class="class1"><span>bbb</span> <strong>ccc</strong> <pre>ddd</pre></div>');
After that your wordcloud
view should look like this:
|word|nentry|
|----|------|
|ccc |2 |
|bbb |2 |
|ddd |1 |
|aaa |1 |
Bonus features:
Replace simple
with for example english
and postgres will strip out stop words and do stemming for you.