Postgres and Word Clouds

Question 1

There is a simple way, but it can be slow (depending on your table size). You can split your text into an array:

SELECT string_to_array(lower(words), ' ') FROM table;

With those arrays, you can use unnest to aggregate them:

WITH words AS (
    SELECT unnest(string_to_array(lower(words), ' ')) AS word
    FROM table
)
SELECT word, count(*) FROM words
GROUP BY word;

This is a simple way of doing that and, has some issues, like, it only split words by space not punctuation marks.

Other, and probably better option, is to use PostgreSQL full text search.

Question 2

Late to the party but I also needet this and wanted to use full text search.
Which conveniently removes html tags.

So basically you convert the text to a tsvector and then use ts_stat:

select word, nentry 
from ts_stat($q$ 
    select to_tsvector('simple', '<div id="main">a b c <b>b c</b></div>') 
$q$)
order by nentry desc

Result:

|word|nentry|
|----|------|
|c   |2     |
|b   |2     |
|a   |1     |

But this does not scale well, so here is what I endet up with:

Setup:

-- table with a gist index on the tsvector column
create table wordcloud_data (
    html text not null,
    tsv tsvector not null
);
create index on wordcloud_data using gist (tsv);

-- trigger to update the tsvector column
create trigger wordcloud_data_tsvupdate 
    before insert or update on wordcloud_data 
    for each row execute function tsvector_update_trigger(tsv, 'pg_catalog.simple', html);

-- a view for the wordcloud
create view wordcloud as select word, nentry from ts_stat('select tsv from wordcloud_data') order by nentry desc;

Usage:

-- insert some data
insert into wordcloud_data (html) values 
    ('<div id="id1">aaa</div> <b>bbb</b> <i attribute="ignored">ccc</i>'), 
    ('<div class="class1"><span>bbb</span> <strong>ccc</strong> <pre>ddd</pre></div>');

After that your wordcloud view should look like this:

|word|nentry|
|----|------|
|ccc |2     |
|bbb |2     |
|ddd |1     |
|aaa |1     |

Bonus features:
Replace simple with for example english and postgres will strip out stop words and do stemming for you.