regex - Pig - Remove embedded newlines and commas in gzip files -
i have gzip file data field separated commas. using pigstorage load file shown below:
a = load 'myfile.gz' using pigstorage(',') (id,date,text);
the data in gzip file has embedded characters - embedded newlines , commas. these characters exist in 3 fields - id, date , text. embedded characters within "" quotes.
i replace or remove these characters using pig before doing further processing.
i think need first occurrence of "" quotes. once find these quotes, need @ string within these quotes , search commas , new line characters in it. once found, need replace them space or remove them.
how can achieve via pig?
try :
register piggybank.jar; = load 'myfile.gz' using org.apache.pig.piggybank.storage.csvexcelstorage() (id:chararray,date:chararray,text:chararray); b = foreach generate replace(replace(id,'\n',''),',','') id, replace(replace(date,'\n',''),',','') date, replace(replace(text,'\n',''),',','') text;
we can use either : org.apache.pig.piggybank.storage.csvexcelstorage() or org.apache.pig.piggybank.storage.csvloader().
refer below api links details
Comments
Post a Comment