apache spark - How to Transpose an rdd in pyspark (list is not a matrix) -


i have rdd list of strings ['abc', 'ccd', 'xyz'...'axd']

when "print rdd.take(2), expecting return me ['abc', 'ccd'], instead gives me everything. new spark or python please forgive me if dumb question. there way transpose list rows?

eventually need convert dataframe , insert hive table.

here pice of code

domainsrdd = zonerdd.reducebykey(lambda x,y: x + ' ' + y).map(lambda a:     (a[0], a[1].split(' ')))  print domainsrdd.take(2)  [(u'cool', [u'shirtmaker.cool', u'videocandy.cool', u'the-happy-factory.cool', u'vic.cool', u'atl.cool',...... u'booze.cool'])]  def sampler(l, tldvar):     tld = l[0]     domain_data = l[1]     domains = []     ct = tldvar.value[tld]     item in domain_data:         domains.extend([item])         if len(domains) == ct:             break     return domains  domainslist = domainsrdd.map(lambda l: sampler(l, tldvar))  print domainslist.take(2) # still returns  [[u'shirtmaker.cool', u'videocandy.cool', u'the-happy-factory.cool',...])] 

long story short, trying loop thru set of domains grouped tld's , producing sample of domain names, tldvar dictionary has set of domains need return specific tld. tld = com, net, org etc!

domainslist of type rdd[array[string]], when take, you're going array[array[string]]. which, in case filled arrays never limited down based on saying (len(domains) == ct never true)


Comments

Popular posts from this blog

searchKeyword not working in AngularJS filter -

sequelize.js - Sequelize: sort by enum cases -

user interface - how to replace an ongoing process of image capture from another process call over the same ImageLabel in python's GUI TKinter -