Tokenizer

A text segmenter

Tokenizer allows to segment a text in tokens, then in word-forms. The tokens match regular expressions, and the word-forms match lexical entries compiled with lexed. A word-form is a concatenation of tokens for a compound name. Ambiguity between simple and coumpound words is represented through a direct acyclic graph (DAG).

Downloading
tokenizer

Compilation

Caution: you need lexed >= 4.3.3 to build.
To install under unix type
./configure [--prefix=<directory>] [--with-amalgam] [--with-composition] (./configure --help for the help)
make
make install
make clean

Use

For help

tokenizer -h

To build an automata

lexed [ -d <directory > ] [ -p <filename prefix> ] <lexicon1> <lexicon2> ...
The lexicons contain for every line the word followed by associated information, separated by a character (tabulation or space by default).
"." is default directory.
"lexicon" is default filename prefix.

To configure the segmenter

You have to edit tokenizer.ll and rebuild.

To use the segmenter

tokenizer [ -d <directory> ] [ -p <filename prefix> ] [ --encode <encoding> ] < inputfile > outputfile