Extracting all the words starting with a particular letter from wordnet

https://stackoverflow.com/questions/3429918

26-09-2019
|

Question

how can extract all the words that start with a particular letter from wordnet. Example if i type A the wordnet should return all the words that start with letter A.

Solution

Easiest way I can see is to download their database from here and then parse the space separated data files (data.adj,data.adv,data.noun,data.verb) for the 5th element in each line and place them into a relevant data structure.

Possibly a Hash table with starting letter as key and each element as an array of words that start with that letter.

Whether you use dynamic arrays or regular arrays and you then first parse of the file to get the number of words of each letter (array size) is up to you.

The following code sample is written in C, and reads through a wordnet datafile and prints the word in question. It is by no means polished and was quickly made.

#include <stdio.h>
#include <string.h>
int main(int argc,char**argv)
{
  FILE *fp;

  fp=fopen("data.noun", "r");
  char line [ 3000 ];
  while ( fgets ( line, sizeof line, fp ) != NULL )
  {
      char *result = NULL;
      int count =0;
      result = (char*)strtok( line, " ");
      count++; 
      while( result != NULL ) 
      {
      if (count == 5) 
      {
          printf( "result is \"%s\"\n", result );
      }
      result = (char*)strtok( NULL, " ");
      count++;
      }
  }
  return 0;
}

For further documentation on the WordNet database format see here

If you wanted to use the WordNet C API instead then see the findtheinfo function documented here, though I don't think it is designed to return the sort of information you want using that API call.

OTHER TIPS

In python, after you've downloaded the .tab file from Open Multilingual Wordnet, you can try this recipe:

# Read Open Multi WN's .tab file
def readWNfile(wnfile, option="ss"):
  reader = codecs.open(wnfile, "r", "utf8").readlines()
  wn = {}
  for l in reader:
    if l[0] == "#": continue
    if option=="ss":
      k = l.split("\t")[0] #ss as key
      v = l.split("\t")[2][:-1] #word
    else:
      v = l.split("\t")[0] #ss as value
      k = l.split("\t")[2][:-1] #word as key
    try:
      temp = wn[k]
      wn[k] = temp + ";" + v
    except KeyError:
      wn[k] = v  
  return wn

princetonWN = readWNfile('wn-data-eng.tab', 'word')

for i in princetonWN:
    if i[0] == "a":
        print i, princetonWN[i].split(";")

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow