hadoop - Does Spark not support arraylist when writing to elasticsearch? -
i have following structure:
mylist = [{"key1":"val1"}, {"key2":"val2"}] myrdd = value_counts.map(lambda item: ('key', { 'field': somelist }))
i error: 15/02/10 15:54:08 info scheduler.tasksetmanager: lost task 1.0 in stage 2.0 (tid 6) on executor ip-10-80-15-145.ec2.internal: org.apache.spark.sparkexception (data of type java.util.arraylist cannot used) [duplicate 1]
rdd.saveasnewapihadoopfile( path='-', outputformatclass="org.elasticsearch.hadoop.mr.esoutputformat", keyclass="org.apache.hadoop.io.nullwritable", valueclass="org.elasticsearch.hadoop.mr.linkedmapwritable", conf={ "es.nodes" : "localhost", "es.port" : "9200", "es.resource" : "mboyd/mboydtype" })
what want document end when written es is:
{ field:[{"key1":"val1"}, {"key2":"val2"}] }
a bit late game, solution came after running in yesterday. add 'es.input.json': 'true'
conf, , run json.dumps()
on data.
modifying example, like:
import json rdd = sc.parallelize([{"key1": ["val1", "val2"]}]) json_rdd = rdd.map(json.dumps) json_rdd.saveasnewapihadoopfile( path='-', outputformatclass="org.elasticsearch.hadoop.mr.esoutputformat", keyclass="org.apache.hadoop.io.nullwritable", valueclass="org.elasticsearch.hadoop.mr.linkedmapwritable", conf={ "es.nodes" : "localhost", "es.port" : "9200", "es.resource" : "mboyd/mboydtype", "es.input.json": "true" } )
Comments
Post a Comment