amazon web services - How can I make my python code run on the AWS slave nodes using Apache-Spark? -
i learning apache-spark interface aws. i've created master node on aws 6 slave nodes. have following python code written spark:
from pyspark import sparkconf, sparkcontext conf = sparkconf().setappname("print_num").setmaster("aws_master_url") sc = sparkcontext(conf = conf) # make list distributed rdd = sc.parallelize([1,2,3,4,5]) # want each of 5 slave nodes mapping work. temp = rdd.map(lambda x: x + 1) # want slave node reducing work. x in temp.sample(false, 1).collect(): print x
my question how can set 6 slave nodes in aws, such 5 slave nodes mapping work mentioned in code, , other slave node reducing work. appreciate if helps me.
from understand, cannot specify 5 nodes serve map nodes , 1 reduce node within single spark cluster.
you have 2 clusters running, 1 5 nodes running map tasks , 1 reduce tasks. then, break code 2 different jobs , submit them 2 clusters sequentially, writing results disk in between. however, might less efficient letting spark handle shuffle communication.
in spark, call .map() "lazy" in sense not execute until call "action." in code, call .collect().
see https://spark.apache.org/docs/latest/programming-guide.html
out of curiosity, there reason want 1 node handle reductions?
also, based on documentation .sample() function takes 3 parameters. post stderr , stdout code?
Comments
Post a Comment