apache spark - How can I find the size of a RDD -
i have rdd[row]
, needs persisted third party repository. third party repository accepts of maximum of 5 mb in single call.
so want create partition based on size of data present in rdd , not based on number of rows present in rdd.
how can find size of rdd
, create partitions based on it?
as justin , wang mentioned not straight forward size of rdd. can estimate.
we can sample rdd , use sizeestimator size of sample. wang , justin mentioned, based on size data sampled offline, say, x rows used y gb offline, z rows @ runtime may take z*y/x gb
here sample scala code size/estimate of rdd.
i new scala , spark. below sample may written in better way
def gettotalsize(rdd: rdd[row]): long = { // can parameter val no_of_sample_rows = 10l; val totalrows = rdd.count(); var totalsize = 0l if (totalrows > no_of_sample_rows) { val samplerdd = rdd.sample(true, no_of_sample_rows) val samplerddsize = getrddsize(samplerdd) totalsize = samplerddsize.*(totalrows)./(no_of_sample_rows) } else { // rdd smaller sample rows count, can calculate total rdd size totalsize = getrddsize(rdd) } totalsize } def getrddsize(rdd: rdd[row]) : long = { var rddsize = 0l val rows = rdd.collect() (i <- 0 until rows.length) { rddsize += sizeestimator.estimate(rows.apply(i).toseq.map { value => value.asinstanceof[anyref] }) } rddsize }
Comments
Post a Comment