python - Flatten a RDD in PySpark -

i trying process data using pyspark. following sample code:

rdd = sc.parallelize([[u'9', u'9', u'hf', u'63300001', u'in hf', u'03/09/2004', u'9', u'hf'], [u'10', u'10', u'hf', u'63300001', u'in hf', u'03/09/2004', u'9', u'hf']])   out = rdd.map(lambda l : (l[0:3],str(l[3]).zfill(8)[:4],l[4:]))  out.take(2)  [([u'9', u'9', u'hf'], '6330', [u'in hf', u'03/09/2004', u'9', u'hf']), ([u'10', u'10', u'hf'], '6330', [u'in hf', u'03/09/2004', u'9', u'hf'])]  expected output: [[u'9', u'9', u'hf', '6330', u'in hf', u'03/09/2004', u'9', u'hf'], [u'10', u'10', u'hf', '6330', u'in hf', u'03/09/2004', u'9', u'hf']]

is there method flatten rdd in spark?

you don't need spark specific here. should more enough:

out = rdd.map(lambda l : (l[0:3] + [str(l[3]).zfill(8)[:4]] + l[4:])

destructuring inside lambda more readable though. mean this:

rdd = sc.parallelize([(1, 2, 3), (4, 5, 6)]) rdd.map(lambda (x, y, z): (x, str(y).zfill(8), z))

Search This Blog

Brant

python - Flatten a RDD in PySpark -

Comments

Post a Comment

Popular posts from this blog

Rendering JButton to get the JCheckBox behavior in a JTable by using images does not update my table -

javascript - Using jquery append to add option values into a select element not working -

Android soft keyboard reverts to default keyboard on orientation change -