High R memory usage with only small objects when web scraping -


i'm scraping website , calling scraping function for-loop. around iteration 4,000 of loop, computer warned me rstudio using memory. after breaking loop escape key don't see large objects in r environment.

i tried tips on these two posts don't reveal cause. when call mem_used() pryr package get:

2.3 gb

which aligns windows task manager said initially. said 2.3 gb, dropped 1.7 gb ten minutes after terminating loop , 1.2 gb twenty minutes after loop. mem_used() continues 2.3 gb.

but r objects small, according lsos() function in first post linked above:

> lsos()                        type     size  rows columns all_raw              tbl_df 17390736 89485      12 all_clean            tbl_df 14693336 89485      15 all_no_pavs          tbl_df 14180576 86050      15 all_no_dupe_names    tbl_df 13346256 79646      15 sample_in            tbl_df  1917128  9240      15 testdat              tbl_df  1188152  5402      15 username_res         tbl_df   792936  4091      14 getusername        function   151992    na      na dupe_names           tbl_df   132040  2802       3 time_per_iteration  numeric    65408  4073      na 

that says largest object 17 mb, not close 2.3 gb. how can find culprit of memory use , fix it? there in loop gradually tying memory?

here reproducible test example, scraping imdb.com:

library(rvest) # rvest 0.2.0 needed produce error, fixed in 0.3.0 library(stringr) library(dplyr)  search_list <- make.names(names(precip))   scrape_top_titles <- function(search_string) {   urltoget <- paste0("http://www.imdb.com/find?q=", search_string)   print(urltoget)   page <- html(urltoget)   top_3_hits <- page %>%     html_nodes(xpath='//td[contains(@class, "result_text")]') %>%     html_text %>%     str_trim %>%     .[1:3]    result <- list(search_term = search_string, hit_1 = top_3_hits[1], hit_2 = top_3_hits[2], hit_3 = top_3_hits[3], page_length = nchar(page %>% html_text))  result }  # these 70 scrapes start filling memory scrapes <- bind_rows(   lapply(search_list, function(x) {data.frame(scrape_top_titles(x), stringsasfactors = false)}) )  # more dramatic memory filling, scrape 770 pages instead longer_list <- as.vector(outer(search_list, names(mtcars), paste, sep="_")) long_scrapes <- bind_rows(   lapply(longer_list, function(x) {data.frame(scrape_top_titles(x), stringsasfactors = false)}) )  

update: looks memory leak xml package called rvest, similar described in this question , manifests in this other question. 0.3.0 release of rvest calls xml2 package instead , has solved memory leak, above code no longer generates error unless old version of rvest used.

i'm still looking answer describe going on here: can explain "memory leak"? problem fixed i'm curious happening.


Comments

Popular posts from this blog

searchKeyword not working in AngularJS filter -

sequelize.js - Sequelize: sort by enum cases -

user interface - how to replace an ongoing process of image capture from another process call over the same ImageLabel in python's GUI TKinter -