High R memory usage with only small objects when web scraping -
i'm scraping website , calling scraping function for-loop. around iteration 4,000 of loop, computer warned me rstudio using memory. after breaking loop escape key don't see large objects in r environment.
i tried tips on these two posts don't reveal cause. when call mem_used() pryr package get:
2.3 gb
which aligns windows task manager said initially. said 2.3 gb, dropped 1.7 gb ten minutes after terminating loop , 1.2 gb twenty minutes after loop. mem_used() continues 2.3 gb.
but r objects small, according lsos() function in first post linked above:
> lsos() type size rows columns all_raw tbl_df 17390736 89485 12 all_clean tbl_df 14693336 89485 15 all_no_pavs tbl_df 14180576 86050 15 all_no_dupe_names tbl_df 13346256 79646 15 sample_in tbl_df 1917128 9240 15 testdat tbl_df 1188152 5402 15 username_res tbl_df 792936 4091 14 getusername function 151992 na na dupe_names tbl_df 132040 2802 3 time_per_iteration numeric 65408 4073 na that says largest object 17 mb, not close 2.3 gb. how can find culprit of memory use , fix it? there in loop gradually tying memory?
here reproducible test example, scraping imdb.com:
library(rvest) # rvest 0.2.0 needed produce error, fixed in 0.3.0 library(stringr) library(dplyr) search_list <- make.names(names(precip)) scrape_top_titles <- function(search_string) { urltoget <- paste0("http://www.imdb.com/find?q=", search_string) print(urltoget) page <- html(urltoget) top_3_hits <- page %>% html_nodes(xpath='//td[contains(@class, "result_text")]') %>% html_text %>% str_trim %>% .[1:3] result <- list(search_term = search_string, hit_1 = top_3_hits[1], hit_2 = top_3_hits[2], hit_3 = top_3_hits[3], page_length = nchar(page %>% html_text)) result } # these 70 scrapes start filling memory scrapes <- bind_rows( lapply(search_list, function(x) {data.frame(scrape_top_titles(x), stringsasfactors = false)}) ) # more dramatic memory filling, scrape 770 pages instead longer_list <- as.vector(outer(search_list, names(mtcars), paste, sep="_")) long_scrapes <- bind_rows( lapply(longer_list, function(x) {data.frame(scrape_top_titles(x), stringsasfactors = false)}) ) update: looks memory leak xml package called rvest, similar described in this question , manifests in this other question. 0.3.0 release of rvest calls xml2 package instead , has solved memory leak, above code no longer generates error unless old version of rvest used.
i'm still looking answer describe going on here: can explain "memory leak"? problem fixed i'm curious happening.
Comments
Post a Comment