Let’s imagine for a second that we have a massive gold standard annotated corpus of Guaraní. We could use that to assign different weights to the different analyses of our surface forms. For example our generator might look like:
$ hfst-fst2strings grn.lexc.hfst
jagua<n>:jagua
ja<n><gen>:jagua
If you don’t already have them, add these two nouns to your grn.lexc
file:
ja:ja N ; ! "ocasión"
jagua:jagua N ; ! "perro"
Let’s say that we saw jagua as “dog” 150 times in our corpus and jagua as “of the occasion” 3 times. It would be fairly straightforward to write a script to convert this corpus into a file with weights that looks like:
jagua<n>:jagua 0.0198
ja<n><gen>:jagua 3.9318
You don’t need to create the script now, just copy the above into a file called grn.weights
. There should
be a tab character between the surface form, jagua and the weight.
Remember we define weights as negative log probabilities, e.g. where r is the relative frequency in the corpus.
First we make a transducer that contains the string to weight mapping from our corpus:
$ cat grn.weights | hfst-strings2fst -j -m grn.symbols -o grn.strweights.hfst
The file grn.symbols
is a file with a list of all of the multicharacter symbols (e.g. <n>
) specified
one per line, e.g.
$ cat grn.symbols
<n>
<gen>
Then we subtract the weighted analyses from the unweighted analyses and reweight those analyses to some large number:
hfst-subtract -1 grn.gen.hfst -2 grn.strweights.hfst | hfst-reweight -e -a 10 -o grn.unweighted.hfst
And finally we take the union (merging) of the weighted and the unweighted transducers:
$ hfst-union -1 grn.unweighted.hfst -2 grn.strweights.hfst -o grn.weighted.hfst
And then invert it to make an analyser:
$ hfst-invert grn.weighted.hfst -o grn.mor.weighted.hfst
$ echo "jagua" | hfst-lookup grn.mor.weighted.hfst
jagua jagua<n><nom> 0,019800
jagua ja<n><gen> 3,931800
$ echo "irũ" | hfst-lookup grn.mor.weighted.hfst
irũ irũ<n><nom> 10,000000
And once we have a weighted analyser we can use beam search -b
to get only the analysis with the lowest weight:
$ echo "jagua" | hfst-lookup -b 1 grn.mor.weighted.hfst
jagua jagua<n><nom> 0,019800