Can someone explain how this mgiza script works?

https://stackoverflow.com/questions/5375980

28-10-2019
|

Question

$:~/mgizapp/scripts$ ./plain2snt-hasvcb.py
Error, the input should be 
./plain2snt-hasvcb.py evcb fvcb etxt ftxt esnt(out) fsnt(out) evcbx(out) fvcbx(out)
You should concatenate the evcbx and fvcbx to existing vcb files

can someone explain what the all the acrane inputs for the plain2snt script? the script is from the mgiza++ program for word alignment from http://geek.kyloo.net/software/doku.php/mgiza:forcealignment

evcb = ? #is it the source.vcb file? fvcb = ? #is it the target.vcb file?

esnt(out) = ? fsnt(out) = ?

evcbx(out) = ? fvcbx (out) = ?

ANSWER

I managed to get it to work

$mkcls -n10 -psourcelangfile.vcb -Vsourcelangfile.vcb.classes
$mkcls -n10 -psourcelangfile.vcb -Vtargetlangfile.vcb.classes
$plain2snt sourcelangfile targetlangfile
$snt2cooc sourcelang_targetlang.cooc sourcelangfile.vcb targetlangfile.vcb sourcelangfile_targetlangfile.snt

Solution

Based on my (not equivalent) experience with GIZA++ and the page you link to, I'd say evcb and fvcb are the "English" and "Foreign" vocab files you've generated already and that etxt and ftxt are the "English" and "Foreign" text inputs. It seems then that esnt and fsnt are the "English" and "Foreign" sentence output files (probably the sentences with the words replaced by their unique identifiers from the vcb files). Finally, evcbx and fvcbx seem to be output locations for eXtending the original vocab files by concatenation.

I hope this helps, and I hope someone else who's used MGIZA can jump in and correct me if I am wrong.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow