hadoop - How to get s3distcp to merge with newlines -

i have many millions of small 1 line s3 files i'm looking merge together. have s3distcp syntax down, however, i've discovered after merging files no newlines contained in merged set.

i wondering if s3distcp includes option force newline in, or there way accomplish without modifying source files directly (or copying them , doing same)

if text files begin/end unique sequence of characters, can first merge them single file s3distcp (i did by setting --targetsize large number), use sed hadoop streaming add in new lines; in following example, each file contains single json (the filenames begin 0), , sed command inserts newline between each instance of }{:

hadoop fs -mkdir hdfs:///tmpoutputfolder/ hadoop fs -mkdir hdfs:///finaloutputfolder/ hadoop jar lib/emr-s3distcp-1.0.jar \                --src s3://inputfolder \                --dest hdfs:///tmpoutputfolder \                --targetsize 1000000000 \                --groupby ".*(0).*" hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming.jar \                -d mapred.reduce.tasks=1 \                --input hdfs:///tmpoutputfolder \                --output hdfs:///finaloutputfolder \                --mapper /bin/cat \                --reducer '/bin/sed "s/}{/}\n{/g"'

Search This Blog

Brant

hadoop - How to get s3distcp to merge with newlines -

Comments

Post a Comment

Popular posts from this blog

Rendering JButton to get the JCheckBox behavior in a JTable by using images does not update my table -

javascript - Using jquery append to add option values into a select element not working -

Android soft keyboard reverts to default keyboard on orientation change -