nlp - bad tokenization in stanford postagger -
i'm trying use stanford pos tagger tag french text. that, use following command:
cat file.txt | java -mx10000m -cp 'stanford-postagger.jar:' edu.stanford.nlp.tagger.maxent.maxenttagger -model models/french.tagger -sentencedelimiter newline > output.txt
(there 1 sentence per line.)
but noticed tags pretty bad, , real issue comes french tokenization itself. think tokenization done english tokenizer.
so tried tokenize text in french doing this:
cat file.txt | java -mx10000m -cp 'stanford-postagger.jar:' edu.stanford.nlp.international.french.process.frenchtokenizer -sentencedelimiter newline > tokenized.txt
and there french tokens good.
how can tell tagger use french model tagging, french tokenizer @ same time?
you can use -tokenizerfactory
, -tokenizeroptions
flags control tokenization. "tagging , testing command line" section of javadoc maxenttagger has complete list of available options.
i believe following command want:
java -mx10000m -cp 'stanford-postagger.jar:' \ edu.stanford.nlp.tagger.maxent.maxenttagger \ -model models/french.tagger \ -tokenizerfactory 'edu.stanford.nlp.international.french.process.frenchtokenizer$frenchtokenizerfactory' \ -sentencedelimiter newline
Comments
Post a Comment