Question

I currently have a (natural language) corpus, and these are the steps already taken:

  1. Generated the symbol table after concatenating the corpus into one big file:

    $ ngramsymbols <corpus.txt >corpus.syms
    
  2. Given this symbol table, converted the corpus to a binary FST archive (FAR):

    $ farcompilestrings -symbols=corpus.syms -keep_symbols=1 corpus.txt > corpus.far
    

I want to take the union of all the FSTs in the FAR, and compute the highest-weight path from start state to final state. To test from shell, this is what I did:

$ farextract corpus.far # generates fst files corpus-01, corpus-02, ...
$ fstarcsort --sort_type=olabel corpus.txt-01 1.fst
$ fstarcsort --sort_type=ilabel corpus.txt-02 2.fst
$ fstunion 1.fst 2.fst 12.fst

But I keep running into the following error:

WARNING: CompatSymbols: first symbol table present but second missing

ERROR: Union: input/output symbol tables of 1st argument do not match input/output symbol tables of 2nd argument

This error, of course, persists if I try to run a binary operation without sorting the FSTs first.

I think I am not sorting the FSTs correctly, or ... I have completely misunderstood how to use the symbol tables. Any idea why the union (or any other binary operation, for that matter) is failing like this?

Was it helpful?

Solution

When you extract the components from the far archive the symbol table is attached to the first fst from the archive. When combining FSTs the symbols table embedded into the individual FSTs an need to match each other. For example, the union operation would need the input symbols across the components to be the same each other, and the output symbol across the components to be the same each other. Composition needd the output symbols of the left machine to match the input symbols of the right machine.

You can clear symbols from an FST using the fstsymbols command:

fstsymbols --clear_isymbols ---clear_osymbols with-syms.fst > no-syms.fst

Removing the symbols from corpus.txt-01 should solve this problem. Alternatively, you can compile the far file without the --keep_symbol flag.

For the union command you don't need sort the arcs from the component machines before combing them, however you would normally need to sort them before composing them.

If you text corpus is large you might find it much quicker just to directly construct the unioned FST direcly from the text file using the C++ interface or some other bindings such as pyfst.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top