Question

In a previous question, i asked the community on how to count the frequency of each consecutive two words in a sentence and I got a great answer! now I'm trying to build a word cloud out from the results using the package,pytagcloud.

The issue that I do have is that the pictures produced is crowded and words are smooching together. any idea if there's a function to separate words and make them readable or if there's any alternative way to do that in python.
Thanks!

My code is bellow. this is the link of the text I used for test I tried to use a smaller number of word combination but that didn't change the crowdness of the text in the picture.
I also added few function like playing with "layout" and "size" and "fontname='Lobster' and fontzoom=1" but none of them give the optimal results which is a clean word cloud picture where the words are not crowded.

import operator
import urllib2

from roundup.backends.indexer_common import STOPWORDS
import requests, collections, bs4
Data = "TEXT FROM The link above- TEXT file"
two_words = [' '.join(ws) for ws in zip(Data, Data[1:])]
wordscount = {w:f for w, f in Counter(two_words).most_common() if f > 12}
sorted_wordscount = sorted(wordscount.iteritems(), key=operator.itemgetter(1))

print sorted_wordscount;

from pytagcloud import create_tag_image, create_html_data, make_tags, LAYOUT_HORIZONTAL, LAYOUTS, LAYOUT_MIX, LAYOUT_VERTICAL, LAYOUT_MOST_HORIZONTAL, LAYOUT_MOST_VERTICAL
from pytagcloud.colors import COLOR_SCHEMES
from pytagcloud.lang.counter import get_tag_counts

create_tag_image(make_tags(sorted_wordscount), 'filename.png', size=(1300,1150), background=(0, 0, 0, 255), layout=LAYOUT_MIX, fontname='Molengo', rectangular=True)

This is an example of the output results I get : HERE
The optimal result will be something similar to one of the images HERE

Was it helpful?

Solution

You are sorting the tags in ascending order instead of descending, as probably pytagcloud expects. You should change the sorting line to:

sorted_wordscount = sorted(wordscount.iteritems(), key=operator.itemgetter(1),reverse=True)

Once that is fixed, the key parameter is maxsize in make_tags :

create_tag_image(make_tags(sorted_wordscount[:],maxsize=200), 'filename.png', size=(1300,1150), background=(0, 0, 0, 255), layout=LAYOUT_MIX, fontname='Molengo', rectangular=True)

If I understand correctly this sets the maximum font size (that of the tag with the highest frequency) and it calculates all the other sizes in relation to this one. The other parameter that influences how the strings are distributed is the size of the window.

You will have to play with these parameters.

Take into account that the library function get_tag_counts does more than just returning the frequency : it also filters common words, apply lowercase, and in general should give you a better distribution of tags than a simple sorting, as you are doing.

With these changes you should get something like this (obtained with get_tag_counts over the file you linked in your post, in a 1000x1000 window, maxsize=260 and capping to the first 50 tags):

enter image description here

Edit - As requested, the code for creating the image above :

import operator
import os
import urllib2

from roundup.backends.indexer_common import STOPWORDS
import requests, collections, bs4
with open("./const11.txt") as file:
  Data1 = file.read().lower()
  Data = Data1.split()
two_words = [' '.join(ws) for ws in zip(Data, Data[1:])]
wordscount = {w:f for w, f in collections.Counter(two_words).most_common() if f > 5}
sorted_wordscount = sorted(wordscount.iteritems(), key=operator.itemgetter(1),reverse=True)

from pytagcloud import create_tag_image, create_html_data, make_tags, LAYOUT_HORIZONTAL, LAYOUTS, LAYOUT_MIX, LAYOUT_VERTICAL, LAYOUT_MOST_HORIZONTAL, LAYOUT_MOST_VERTICAL
from pytagcloud.colors import COLOR_SCHEMES
from pytagcloud.lang.counter import get_tag_counts

tags = make_tags(get_tag_counts(Data1)[:50],maxsize=260)
create_tag_image(tags,'filename.png', size=(1000,1000), background=(0, 0, 0, 255), layout=LAYOUT_MIX, fontname='Lobster', rectangular=True)`

Using python 2.7.5, on Ubuntu 13.04 with pygame installed with apt-get, and the rest of the packages with pip. "const11.txt" is the text file linked in the question.

OTHER TIPS

EDIT: While the TAG_PADDING parameter referenced below in my answer might be of interest for some cases, vinaut's answer is clearly the better one to start with.


Looking at https://github.com/atizo/PyTagCloud/blob/master/pytagcloud/__init__.py, it looks like TAG_PADDING might be the parameter that controls the spacing between words.

Because it's set to a literal value in the source code and it's referenced in several places, you will either have to alter the source code to a parameter that suits you better (and repackage/reinstall) or else copy the source into your own project and alter it accordingly.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top