memory - Text analytics in Python -


i working large text data millions of lines in it. basic step of text analytics, need split text individual words , store number of words in each line.

1) line.split() efficient way split text words? (not bothered punctuation)

2) efficient way store word count? through arrays/lists/tuples? 1 faster.

sorry if seems basic. getting started.

have @ nltk python.

it handles operations tokenization (splitting text words, including punctuation , other non-trivial cases) efficiently large files , provides cool features dispersion plots (where words occur in text) , word count.

an example latter (taken this ntlk cheatsheet):

>>> len(text1)                    # number of words >>> text1.count("heaven")         # how many times word occur? >>> fd = nltk.freqdist(text1)     # information word frequency >>> fd["the"]                     # how many occurences of word ‘the’  >>> fd.plot(50, cumulative=false) # generate chart of 50 frequent words 

about second part of question, here depends on how want further use these numbers. if you're interested in raw numbers, list fine:

word_count = [len(text1), len(text2), len(text3), ...]  # how words per average? print(sum(word_count)/len(word_count)) 

if want store text has how many words/tokens , want access them names, maybe you're better off dictionary:

word_count = {'first text' = len(text1), 'second text' = len(text2), ...}  # how words in first text? print(word_count['first text']) 

when storing word counts simple numbers isn't matter of speed data structure you're using, either dict or list fine.


Comments

Popular posts from this blog

searchKeyword not working in AngularJS filter -

sequelize.js - Sequelize: sort by enum cases -

user interface - how to replace an ongoing process of image capture from another process call over the same ImageLabel in python's GUI TKinter -