Extract 50 paragraphs of text at random from a Wikipedia dump of a language of your choice and compare two sentence segmenters. You can choose any two segmenters you like (including segmenters you've written yourself!).
You will need the file called -pages-articles.xml.bz2. You can find it on the WikiMedia dumps site. For better service, use this Swedish mirror; choose an XXwiki folder, where XX is the 2-letter language code.
To extract the text, you can use WikiExtractor. Then use the segmenters to segment the raw text output into sentences.
Requires ruby, which you should install using your package manager. To install pragmatic_segmenter, do the following:
$ git clone https://github.com/diasks2/pragmatic_segmenterThis will check out the code.
$ cd pragmatic_segmenter/libYou will need to put this code into a file, call it segmenter.rb:
require 'pragmatic_segmenter' lang = "en" if ARGV[0] lang=ARGV[0] end STDIN.each_with_index do |line, idx| ps = PragmaticSegmenter::Segmenter.new(text: line, language: lang) ps.segment for i in ps.segment print(i,"\n") end endYou can then run it like this:
$ echo "This is a test.This is a test. This is a test." | ruby -I . segmenter.rb This is a test. This is a test. This is a test.If you get errors like cannot load such file -- unicode (LoadError), then you need to install ruby-unicode, on apt-based systems, you can use:
$ sudo apt-get install ruby-unicode
Install with your package manager; debian and ubuntu "sudo apt-get install python3-nltk".
For example usage, see this tutorial starting at "How to use sentence tokenize in NLTK?"
The comparison should include:
First download the UD treebank for Japanese (UD_Japanese-GSD) from the Universal dependencies GitHub repository. Then implement the left-to-right longest-match algorithm (also known as maxmatch). For a description of the algorithm see Section 3.9.1 in Jurafsky and Martin.
Hints: you might write a python program which
$ cat > dictionary-file sentence to tokenize ^D $ echo 'sentence to tokenize.' | python maxmatch.py dictionary-file sentence to tokenize .
If you have time, test the algorithm with other treebanks for languages which do not use word separators, e.g. Chinese, Thai.