Frage

I've written a script which generates possible twitter handles and checks them for availability. It just iterates through different combinations of allowed symbols: a-z, 0-9, _. Currently it has checked 1926220 combinations, i.e. every one containing 1-5 symbols. Here are brief results: 0 free accounts for 1, 2 and 3 symbols, 750 free for 4, 442711 for 5.

I'm wondering if it is possible to write an algorythm which will analyze these lists and find human-readable words among them. Here is an example:

elnsv
elnt8
eloq4
elosu
elq0_
elq15
elq46

The word elosu differs from anothers and it turns out that there is even a town in Spain called Elosu. How do humans distiguish such words? I think I can try to make a dictionary of syllabels from different languages and try comparing words with it. Can you help me with the formula or with other ideas?

Update: for those ones who want to try implementing it, here is the link to 5-symbol handles.

War es hilfreich?

Lösung

I'd try to use the wisdom of the crowd to solve this.

  1. Google shows an approximate number of pages containing the query, for example, for me the query elnsv from your example (by not using the "did you mean to..") is giving ~60k results, the query elq0_ has ~23k pages, and the "real" word elosu has ~330k matching pages. This is a strong signification that the word is more likely to be meaningful than the others. So, basically this approach means: use some search engine and use its results to determine what is meaningful and what isn't.

  2. The word elosu has a wikipedia article, though it is not the elosu you meant, it still helps. Note that the wikipedia approach will be great and very accurate to decide which term is a meaningful word, but will be problematic for eliminating terms, so I'd use it as first level 'judge' in a pipeline, and feed the rest to other judges.

Andere Tipps

Well you may have to think like human when programming that what string will be recognised by you first when you look at them. For algorithm like these you should either use artificial intelligence or use google API for searching.

Lets take example of given words above. You have 5 letter words with number.

So probability of words with least number of numeric characters will easily be identified by Human. In your case i will follow this rule and would be creating a program for it.

Words having Higher Priority in Descending Order

Word with 5 alphabets are in higher priority.

Word with 4 alphabets (exception: Number should not be at first 4 place)

Word with 3 alphabets (exception: Number should not be at first 3 place)

and so on....

Last priority will be of word with special character at first or last place.

Word consisting special character in middle of word should not have priority.

I might be searching google using API and tried elnsv and result changed that word to ensv and that is stock symbol of ENSERVCO CORP. So i will be either skipping this word or adding relationship.

In your case algorithm goes like that, Make a statistical data of words that does make sense and what they look like, Is words having numbers make sense or not. Add them in array and use Insertion sort algorithm to sort them out. Use dictionary array to find relationship and forget words with special characters for dictionary. For words that are left with special characters or number you should try a web based search if there exist meaning, basically words that are left at last should not be identified either by human or machine, So you should take help of any search engine.

Dont know if my answer is correct so will definitely try my code on list provided by you.

Learn a Markov model for English words (using letters, bigrams etc) and check how probable the generated word is. This is, of course, not foolproof but should give you decent results.

The problem of generating pronouncable passwords is very similar, and there has been some work in that area. See for example this related question

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top