apache spark - RDD with no Partioner & Partition size -
i curios know relation b/w rdd no partioner & partition size. take example of map()
transformation. returns rdd has no partitioner(as expected).
scala> val input = sc.parallelize(list(1, 2, 2, 3)) input: org.apache.spark.rdd.rdd[int] = parallelcollectionrdd[0] @ parallelize @ <console>:21 scala> val sum = input.map(x => x + 1) sum: org.apache.spark.rdd.rdd[int] = mappartitionsrdd[1] @ map @ <console>:23 scala> sum.partitioner; res0: option[org.apache.spark.partitioner] = none
when try find partition size, see partition size 8
scala> sum.partitions.size res1: int = 8
considering there no partitioner rdd sum
, expecting partition size 1
(ie no partition). how have partition size of > 1
without partioner rdd(sum
)?
if patitioner none, means partitioning not based upon characteristic of data distribution random , guaranteed uniform across nodes. patitions.size >1
reference: https://techmagie.wordpress.com/2015/12/19/understanding-spark-partitioning/
Comments
Post a Comment