regex - Pig - Remove embedded newlines and commas in gzip files -


i have gzip file data field separated commas. using pigstorage load file shown below:

a = load 'myfile.gz' using pigstorage(',') (id,date,text); 

the data in gzip file has embedded characters - embedded newlines , commas. these characters exist in 3 fields - id, date , text. embedded characters within "" quotes.

i replace or remove these characters using pig before doing further processing.

i think need first occurrence of "" quotes. once find these quotes, need @ string within these quotes , search commas , new line characters in it. once found, need replace them space or remove them.

how can achieve via pig?

try :

register piggybank.jar;  = load 'myfile.gz' using org.apache.pig.piggybank.storage.csvexcelstorage() (id:chararray,date:chararray,text:chararray); b = foreach generate  replace(replace(id,'\n',''),',','') id, replace(replace(date,'\n',''),',','') date, replace(replace(text,'\n',''),',','') text; 

we can use either : org.apache.pig.piggybank.storage.csvexcelstorage() or org.apache.pig.piggybank.storage.csvloader().

refer below api links details

  1. http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/csvexcelstorage.html
  2. http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/csvloader.html

Comments

Popular posts from this blog

javascript - Using jquery append to add option values into a select element not working -

Android soft keyboard reverts to default keyboard on orientation change -

Rendering JButton to get the JCheckBox behavior in a JTable by using images does not update my table -