apache spark - RDD with no Partioner & Partition size -


i curios know relation b/w rdd no partioner & partition size. take example of map() transformation. returns rdd has no partitioner(as expected).

scala> val input = sc.parallelize(list(1, 2, 2, 3)) input: org.apache.spark.rdd.rdd[int] = parallelcollectionrdd[0] @ parallelize @ <console>:21  scala>  val sum = input.map(x => x + 1) sum: org.apache.spark.rdd.rdd[int] = mappartitionsrdd[1] @ map @ <console>:23  scala> sum.partitioner; res0: option[org.apache.spark.partitioner] = none 

when try find partition size, see partition size 8

scala> sum.partitions.size res1: int = 8 

considering there no partitioner rdd sum, expecting partition size 1(ie no partition). how have partition size of > 1 without partioner rdd(sum)?

if patitioner none, means partitioning not based upon characteristic of data distribution random , guaranteed uniform across nodes. patitions.size >1

reference: https://techmagie.wordpress.com/2015/12/19/understanding-spark-partitioning/


Comments

Popular posts from this blog

javascript - Using jquery append to add option values into a select element not working -

Android soft keyboard reverts to default keyboard on orientation change -

Rendering JButton to get the JCheckBox behavior in a JTable by using images does not update my table -