Francis Tyers

Weighting

Analyses

Let’s imagine for a second that we have a massive gold standard annotated corpus of Guaraní. We could use that to assign different weights to the different analyses of our surface forms. For example our generator might look like:

$ hfst-fst2strings grn.lexc.hfst 
jagua<n>:jagua
ja<n><gen>:jagua

If you don’t already have them, add these two nouns to your grn.lexc file:

ja:ja N ; ! "ocasión"
jagua:jagua N ; ! "perro"

Let’s say that we saw jagua as “dog” 150 times in our corpus and jagua as “of the occasion” 3 times. It would be fairly straightforward to write a script to convert this corpus into a file with weights that looks like:

jagua<n>:jagua	0.0198
ja<n><gen>:jagua	3.9318

You don’t need to create the script now, just copy the above into a file called grn.weights. There should be a tab character between the surface form, jagua and the weight.

Remember we define weights as negative log probabilities, e.g. w = -log(r) where r is the relative frequency in the corpus.

First we make a transducer that contains the string to weight mapping from our corpus:

$ cat grn.weights | hfst-strings2fst -j -m grn.symbols -o grn.strweights.hfst

The file grn.symbols is a file with a list of all of the multicharacter symbols (e.g. <n>) specified one per line, e.g.

$ cat grn.symbols
<n>
<gen>

Then we subtract the weighted analyses from the unweighted analyses and reweight those analyses to some large number:

hfst-subtract -1 grn.gen.hfst -2 grn.strweights.hfst | hfst-reweight -e -a 10 -o grn.unweighted.hfst

And finally we take the union (merging) of the weighted and the unweighted transducers:

$ hfst-union -1 grn.unweighted.hfst -2 grn.strweights.hfst -o grn.weighted.hfst

And then invert it to make an analyser:

$ hfst-invert grn.weighted.hfst -o grn.mor.weighted.hfst
$ echo "jagua" | hfst-lookup grn.mor.weighted.hfst
jagua	jagua<n><nom>	0,019800
jagua	ja<n><gen>	3,931800

$ echo "irũ" | hfst-lookup grn.mor.weighted.hfst
irũ	irũ<n><nom>	10,000000

And once we have a weighted analyser we can use beam search -b to get only the analysis with the lowest weight:

$ echo "jagua" | hfst-lookup -b 1 grn.mor.weighted.hfst
jagua	jagua<n><nom>	0,019800