sql server - RData takes longer to load than querying the database again -
i running rstudio server on 256gb ram server, , ms-sql-server 2012 on another. db contains data allows me build graph ~100 million nodes , ~150 million edges.
i have timed how long takes build graph data:
- 1st select query = ˜22m rows = 12 minutes = df1 (dataframe1)
- 2nd select query = ˜30m rows = 8 minutes = df2
- 3rd select query = ˜32m rows = 8 minutes = df3
- 4th select query = ˜63m rows = 70 minutes = df4
edges = rbind(df1, df2, df3, df4)
= 6 minutesmygraph = graph.data.frame(edges)
= 30 minutes
so little on 2 hours. since data quite stable, figured speed things saving mygraph
disk. when tried load it, wouldn't. gave after 4 hour wait, thinking had gone wrong.
so reboot server, delete .rstudio folder , start over, time saving dataframes each sql query plus edges
dataframe, in both rdata , rds formats (save()
, saverds()
, compress = false
everytime). after each save, timed load()
, readrds()
times of 5 dataframes. times pretty same load()
, readrds()
:
- df1 = 1.1 gb file = 1 minute
- df2 = 1.4 gb file = 2 minutes
- df3 = 1.7 gb file = 6 minutes
- df4 = 3.1 gb file = 13 minutes
- edges = 6.8 gb file = 21 minutes
good enough, thought. today when started new session , tried load(df1)
make changes it, again got feeling wrong. after 20 minutes waiting load, gave up. memory, disk , cpu shouldn't issues, i'm 1 using server. have reboot server , deleted .rstudio folder, thinking maybe in there hanging session, dataframe still won't load. while load()
supposedly running, iotop
shows no disk activity , ps
ps -c rsession -o %cpu,%mem,cmd %cpu %mem cmd 99.5 0.3 /usr/lib/rstudio-server/bin/rsession -u myusername
i have no idea try next. makes no sense me loading rdata file take longer querying sql database lives on different server. , if did, why fast when timing load()
, readrds()
times after saving dataframes?
it's first time ask here @ stackoverflow, sorry if forgot mention important able answer question. if did, please let me know.
edit: additional info requested brandon in comments. os centos 7. dataframes contain lists of edges in first 2 columns (col1=node1; col2=node2) , 2 additional columns edge attributes. columns strings, varying between 5 , 14 characters long. have added approximate number of rows of each dataframe original post. thanks!
Comments
Post a Comment