mysql - speed up LOAD DATA INFILE with duplicates - 250 GB -


am looking advice whether there way speed import of 250 gb of data mysql table (innodb) 8 source csv files of approx. 30 gb each. csv's have no duplicates within themselves, contain duplicates between files -- in fact individual records appear in 8 csv files. duplicates need removed @ point in process. current approach creates empty table primary key, , uses 8 “load data infile [...] ignore” statements sequentially load each csv file, while dropping duplicate entries. works great on small sample files. real data, first file takes 1 hour load, second takes more 2 hours, third 1 more 5, fourth 1 more 9 hours, i’m @ right now. appears table grows, time required compare new data existing data increasing... of course makes sense. 4 more files go, looks might take 4 or 5 days complete if let run course.

would better off importing no indexes on table, , removing duplicates after? or should import each of 8 csv's separate temporary tables , union query create new consolidated table without duplicates? or approaches going take long?

plan a

you have column dedupping; lets call name.

create table new (     name ...,     ...     primary key (name) -- no other indexes ) engine=innodb; 

then, 1 csv @ time:

* sort csv name (this makes caching work better)  load data ... 

yes, similar plan done temp tables, might not faster.

plan b

sort csv files (probably unix "sort" can in single command?).

plan b fastest, since extremely efficient in i/o.


Comments

Popular posts from this blog

javascript - Using jquery append to add option values into a select element not working -

Android soft keyboard reverts to default keyboard on orientation change -

Rendering JButton to get the JCheckBox behavior in a JTable by using images does not update my table -