apache spark - How to Transpose an rdd in pyspark (list is not a matrix) -
i have rdd list of strings ['abc', 'ccd', 'xyz'...'axd']
when "print rdd.take(2), expecting return me ['abc', 'ccd'], instead gives me everything. new spark or python please forgive me if dumb question. there way transpose list rows?
eventually need convert dataframe , insert hive table.
here pice of code
domainsrdd = zonerdd.reducebykey(lambda x,y: x + ' ' + y).map(lambda a: (a[0], a[1].split(' '))) print domainsrdd.take(2) [(u'cool', [u'shirtmaker.cool', u'videocandy.cool', u'the-happy-factory.cool', u'vic.cool', u'atl.cool',...... u'booze.cool'])] def sampler(l, tldvar): tld = l[0] domain_data = l[1] domains = [] ct = tldvar.value[tld] item in domain_data: domains.extend([item]) if len(domains) == ct: break return domains domainslist = domainsrdd.map(lambda l: sampler(l, tldvar)) print domainslist.take(2) # still returns [[u'shirtmaker.cool', u'videocandy.cool', u'the-happy-factory.cool',...])] long story short, trying loop thru set of domains grouped tld's , producing sample of domain names, tldvar dictionary has set of domains need return specific tld. tld = com, net, org etc!
domainslist of type rdd[array[string]], when take, you're going array[array[string]]. which, in case filled arrays never limited down based on saying (len(domains) == ct never true)
Comments
Post a Comment