hadoop - How to get s3distcp to merge with newlines -
i have many millions of small 1 line s3 files i'm looking merge together. have s3distcp syntax down, however, i've discovered after merging files no newlines contained in merged set.
i wondering if s3distcp includes option force newline in, or there way accomplish without modifying source files directly (or copying them , doing same)
if text files begin/end unique sequence of characters, can first merge them single file s3distcp
(i did by setting --targetsize
large number), use sed
hadoop streaming add in new lines; in following example, each file contains single json (the filenames begin 0
), , sed
command inserts newline between each instance of }{
:
hadoop fs -mkdir hdfs:///tmpoutputfolder/ hadoop fs -mkdir hdfs:///finaloutputfolder/ hadoop jar lib/emr-s3distcp-1.0.jar \ --src s3://inputfolder \ --dest hdfs:///tmpoutputfolder \ --targetsize 1000000000 \ --groupby ".*(0).*" hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming.jar \ -d mapred.reduce.tasks=1 \ --input hdfs:///tmpoutputfolder \ --output hdfs:///finaloutputfolder \ --mapper /bin/cat \ --reducer '/bin/sed "s/}{/}\n{/g"'
Comments
Post a Comment