see hints

Segmentation

Extract 50 paragraphs of text at random from a Wikipedia dump of a language of your choice and compare two sentence segmenters. You can choose any two segmenters you like (including segmenters you've written yourself!).

Data

You will need the file called -pages-articles.xml.bz2. You can find it on the WikiMedia dumps site. For better service, use this Swedish mirror; choose an XXwiki folder, where XX is the 2-letter language code.

To extract the text, you can use WikiExtractor. Then use the segmenters to segment the raw text output into sentences.

Suggested segmenters:

Report

The comparison should include:

Tokenisation

First download the UD treebank for Japanese (UD_Japanese-GSD) from the Universal dependencies GitHub repository. Then implement the left-to-right longest-match algorithm (also known as maxmatch). For a description of the algorithm see Section 3.9.1 in Jurafsky and Martin.

Hints: you might write a python program which

so tokenization looks like:
$ cat > dictionary-file
sentence
to
tokenize
^D

$ echo 'sentence to tokenize.' | python maxmatch.py dictionary-file
sentence
 
to
 
tokenize
.

If you have time, test the algorithm with other treebanks for languages which do not use word separators, e.g. Chinese, Thai.

Report

Submit
  1. your implementation of maxmatch,
  2. instructions on how to use it,
  3. brief description of its performance, with examples to support your findings