nlp - bad tokenization in stanford postagger -


i'm trying use stanford pos tagger tag french text. that, use following command:

cat file.txt | java -mx10000m -cp 'stanford-postagger.jar:' edu.stanford.nlp.tagger.maxent.maxenttagger -model models/french.tagger -sentencedelimiter newline > output.txt

(there 1 sentence per line.)

but noticed tags pretty bad, , real issue comes french tokenization itself. think tokenization done english tokenizer.

so tried tokenize text in french doing this:

cat file.txt | java -mx10000m -cp 'stanford-postagger.jar:' edu.stanford.nlp.international.french.process.frenchtokenizer -sentencedelimiter newline > tokenized.txt

and there french tokens good.

how can tell tagger use french model tagging, french tokenizer @ same time?

you can use -tokenizerfactory , -tokenizeroptions flags control tokenization. "tagging , testing command line" section of javadoc maxenttagger has complete list of available options.

i believe following command want:

java -mx10000m -cp 'stanford-postagger.jar:' \   edu.stanford.nlp.tagger.maxent.maxenttagger \   -model models/french.tagger \   -tokenizerfactory 'edu.stanford.nlp.international.french.process.frenchtokenizer$frenchtokenizerfactory' \   -sentencedelimiter newline 

Comments

Popular posts from this blog

javascript - Using jquery append to add option values into a select element not working -

Android soft keyboard reverts to default keyboard on orientation change -

Rendering JButton to get the JCheckBox behavior in a JTable by using images does not update my table -