machine learning - OpenNLP: Training a custom NER Model for multiple entities -
i trying training custom ner model multiple entities. here sample training data:
count <start:item_type> operating tables <end> on <start:location_id> third <end> <start:location_type> floor <end> count <start:item_type> items <end> on <start:location_id> third <end> <start:location_type> floor <end> how many <start:item_type> beds <end> in <start:location_type> room <end> <start:location_id> 2 <end>
the namefinderme.train(.)
method takes string parameter type
. use of parameter? and, how can train model multiple entities (e.g. item_type
, location_type
, location_id
in case)
public static void main(string[] args) { string trainingdatafile = "/home/opennlptest/lib/training_data.txt"; string outputmodelfile = "/tmp/model.bin"; string sentence = "how many beds in hospital"; train(trainingdatafile, outputmodelfile, "location_type"); predict(sentence, outputmodelfile); } private static void train(string trainingdatafile, string outputmodelfile, string tagtofind) { file infile = new file(trainingdatafile); namesampledatastream nss = null; try { nss = new namesampledatastream(new plaintextbylinestream(new java.io.filereader(infile))); } catch (exception e) {} tokennamefindermodel model = null; int iterations = 100; int cutoff = 5; try { // 'type' parameter mean entity type trying train model for? // if need train multiple entities? model = namefinderme.train("en", tagtofind, nss, (adaptivefeaturegenerator) null, collections.<string,object>emptymap(), iterations, cutoff); } catch(exception e) {} try { file outfile = new file(outputmodelfile); fileoutputstream outfilestream = new fileoutputstream(outfile); model.serialize(outfilestream); } catch (exception ex) {} } private static void predict(string sentence, string modelfile) throws exception { fileinputstream modelintoken = new fileinputstream("/tmp/en-token.bin"); tokenizermodel modeltoken = new tokenizermodel(modelintoken); tokenizer tokenizer = new tokenizerme(modeltoken); string tokens[] = tokenizer.tokenize(sentence); fileinputstream modelin = new fileinputstream(modelfile); tokennamefindermodel model = new tokennamefindermodel(modelin); namefinderme namefinder = new namefinderme(model); span namespans[] = namefinder.find(tokens); double[] spanprobs = namefinder.probs(namespans); for( int = 0; i<namespans.length; i++) { system.out.println(namespans[i]); }
}
the type
argument namefinderme.train
used default type training data not include type parameter. relevant if have sample looks this:
<start> operating tables <end>
instead of this:
<start:item_type> operating tables <end>
to train multiple types of entities, developer documentation says
a training file can contain multiple types. if training file contains multiple types created model able detect these multiple types. recommended train single type models, since multi type support still experimental.
so try training on sample question, includes multiple types, , see how works. in this mailing list message, asks status of training multiple types , gets answer:
the code path stable, reason put there didn't have performance on english data.
anyway, there performance might highly depend on data set , language.
if don't performance model handles multiple types, alternative create multiple copies of training data each copy modified include 1 type. train separate model on each set of training data. @ point should have (for example) item_type model, location_type model, , location_id model. run input through each model detect different types.
Comments
Post a Comment