apache spark - How can I find the size of a RDD -


i have rdd[row], needs persisted third party repository. third party repository accepts of maximum of 5 mb in single call.

so want create partition based on size of data present in rdd , not based on number of rows present in rdd.

how can find size of rdd , create partitions based on it?

as justin , wang mentioned not straight forward size of rdd. can estimate.

we can sample rdd , use sizeestimator size of sample. wang , justin mentioned, based on size data sampled offline, say, x rows used y gb offline, z rows @ runtime may take z*y/x gb

here sample scala code size/estimate of rdd.

i new scala , spark. below sample may written in better way

def gettotalsize(rdd: rdd[row]): long = {   // can parameter   val no_of_sample_rows = 10l;   val totalrows = rdd.count();   var totalsize = 0l   if (totalrows > no_of_sample_rows) {     val samplerdd = rdd.sample(true, no_of_sample_rows)     val samplerddsize = getrddsize(samplerdd)     totalsize = samplerddsize.*(totalrows)./(no_of_sample_rows)   } else {     // rdd smaller sample rows count, can calculate total rdd size     totalsize = getrddsize(rdd)   }    totalsize }  def getrddsize(rdd: rdd[row]) : long = {     var rddsize = 0l     val rows = rdd.collect()     (i <- 0 until rows.length) {        rddsize += sizeestimator.estimate(rows.apply(i).toseq.map { value => value.asinstanceof[anyref] })     }      rddsize } 

Comments

Popular posts from this blog

javascript - Using jquery append to add option values into a select element not working -

Android soft keyboard reverts to default keyboard on orientation change -

jquery - javascript onscroll fade same class but with different div -