python - Using semantic word representation (e.g. word2vec) to build a classifier -
i want build classifier forum posts automatically categorize these posts defined categories(so multiclass classification not binary classification) using semantic word representations. task want make use of word2vec , doc2vec , check feasability of using these models support fast selection of training data classifier. @ moment have tried both models , work charm. however, not want manually label each sentence predict describing, want leave task word2vec or doc2vec models. so, question : algorithm can use in python classifier? ( thinking apply clustering on word2vec or doc2vec - manually label each cluster (this require time , not best solution). previously, made use of "linearsvc"(from svm) , onevsrestclassifier, however, labeled each sentence (by manually training vector "y_train" ) in order predict class new test sentence belong to. alghorithm , method in python use type of classifier(making use of semantic word representations train data)?
the issue things word2vec/doc2vec , on - usupervised classifier - uses context. so, example if have sentence "today hot day" , "today cold day" thinks hot , cold very similar , should in same cluster.
this makes pretty bad tagging. either way, there implementation of doc2vec , word2vec in gensim module python - can use google-news dataset's prebuilt binary , test whether meaningful clusters.
the other way try implement simple lucene/solr system on computer , begin tagging few sentences randomly. on time lucene/solr suggest tags clearfor document, , come out pretty decent tags if data not bad.
the issue here problem youre trying solve isnt particularly easy nor solvable - if have good/clear data, may able auto classify 80-90% of data ... if bad, wont able auto classify much.
Comments
Post a Comment