mysql - speed up LOAD DATA INFILE with duplicates - 250 GB -
am looking advice whether there way speed import of 250 gb of data mysql table (innodb) 8 source csv files of approx. 30 gb each. csv's have no duplicates within themselves, contain duplicates between files -- in fact individual records appear in 8 csv files. duplicates need removed @ point in process. current approach creates empty table primary key, , uses 8 “load data infile [...] ignore” statements sequentially load each csv file, while dropping duplicate entries. works great on small sample files. real data, first file takes 1 hour load, second takes more 2 hours, third 1 more 5, fourth 1 more 9 hours, i’m @ right now. appears table grows, time required compare new data existing data increasing... of course makes sense. 4 more files go, looks might take 4 or 5 days complete if let run course.
would better off importing no indexes on table, , removing duplicates after? or should import each of 8 csv's separate temporary tables , union query create new consolidated table without duplicates? or approaches going take long?
plan a
you have column dedupping; lets call name
.
create table new ( name ..., ... primary key (name) -- no other indexes ) engine=innodb;
then, 1 csv @ time:
* sort csv name (this makes caching work better) load data ...
yes, similar plan done temp tables, might not faster.
plan b
sort csv files (probably unix "sort" can in single command?).
plan b fastest, since extremely efficient in i/o.
Comments
Post a Comment