r - Why do save and saveRDS act differently inside dopar? -


(this first time trying create reproducible example question - please feel free comment better ways describe or illustrate issues!)

main issue statement

i training ~25,000 models in parallel using foreach's %dopar% , caretlist (from caretensemble package). due r crashing , memory issues, need save each of forecasts individual object, workflow looks - see below reproducible example.

cl <- makepsockcluster(4) clusterevalq(cl, library(foreach)) registerdoparallel(cl)  multiple.forecasts <- foreach(x=1:1,.combine='rbind',.packages=c('zoo','earth','caret',"glmnet","caretensemble")) %dopar% {   trycatch({     results <- caretlist(mpg ~ cyl,data=mtcars,trcontrol=fitcontrol,methodlist=c("glmnet","lm","earth"),continue_on_fail = true)     (i in 1:length(results)) {       results[[i]]$trainingdata <- c() ## should trimming out trainingdata     }     save(results,file="foreach_results.rdata") ## export each caretlist own object     1   },   error = function(e) {     write.csv(e$message,file="foreach_failure.txt") ## monitor failures needed     0   }   ) } 

(irl project not involve mtcars data - each iteration of foreach loop iterates on 1 of data frames in list , saves new forecast object each data frame.)

when object saved inside foreach loop, object size approximately 136 kb in windows due compression.

however, when object created , saved not using foreach, so:

results <- caretlist(mpg ~ cyl,data=mtcars,trcontrol=fitcontrol,methodlist=c("glmnet","lm","earth"),continue_on_fail = true) (i in 1:length(results)) {     results[[i]]$trainingdata <- c() } save(results,file="no_foreach_results.rdata") 

this object, same, approximately 156kb in windows. what's adding saved object size in windows?

in real workflow, smaller non-foreach object 4 mb on average , larger foreach object 10 mb on average, creates real storage issues when saving ~25,000 of these files.

  • why object size when saved within foreach loop larger, , if can it?

notes

  • my hypothesis save within foreach saves entire environment: instead of saving object, when it's commanded using saverds (see below), there implicit saving of environment exported each of workers.
  • trim doesn't seem working within caretlist: trim traincontrol option doesn't seem trimming it's supposed to, had manually add command trim trainingdata.
  • my current workaround set save compression xz: need foreach loop take advantage of multiple cores, need larger objects. slows down workflow 3-4x however, why i'm looking solution.
  • the psock cluster needed work around issue in caret parallelization: see answer here.
  • saverds not issue: i've tested using saverds instead of save, difference in object sizes pervades.
  • removing trycatch not issue: without trycatch in foreach loop, difference in object size pervades.

technical details

reproducible example:

library(caret) library(caretensemble)  ## train caretlist without foreach loop fitcontrol <- traincontrol(## 10-fold cv   method = "repeatedcv",   number = 10,   ## repeated ten times   repeats = 10,   trim=true)  results <- caretlist(mpg ~ cyl,data=mtcars,trcontrol=fitcontrol,methodlist=c("glmnet","lm","earth"),continue_on_fail = true) (i in 1:length(results)) {     results[[i]]$trainingdata <- c() } object.size(results) ##returns 546536 bytes save(results,file="no_foreach_results.rdata") ##in windows, object 136 kb  ## train caretlist foreach loop library(doparallel)  cl <- makepsockcluster(4) clusterevalq(cl, library(foreach)) registerdoparallel(cl)  multiple.forecasts <- foreach(x=1:1,.combine='rbind',.packages=c('zoo','earth','caret',"glmnet","caretensemble")) %dopar% {   trycatch({     results <- caretlist(mpg ~ cyl,data=mtcars,trcontrol=fitcontrol,methodlist=c("glmnet","lm","earth"),continue_on_fail = true)     (i in 1:length(results)) {       results[[i]]$trainingdata <- c()     }     save(results,file="foreach_results.rdata") ## in windows, object 160 kb     ## loading file in , running object.size gives 546504 bytes, approximately same     1   },   error = function(e) {     write.csv(e$message,file="foreach_failure.txt")     0   }   ) } 

sessioninfo() output:

r version 3.2.2 (2015-08-14) platform: x86_64-w64-mingw32/x64 (64-bit) running under: windows server 2012 x64 (build 9200)  locale: [1] lc_collate=english_united states.1252  lc_ctype=english_united states.1252    [3] lc_monetary=english_united states.1252 lc_numeric=c                           [5] lc_time=english_united states.1252      attached base packages: [1] parallel  stats     graphics  grdevices utils     datasets  methods   base       other attached packages:  [1] doparallel_1.0.10   iterators_1.0.8     earth_4.4.4         plotmo_3.1.4        teachingdemos_2.10   [6] plotrix_3.6-2       glmnet_2.0-5        foreach_1.4.3       matrix_1.2-4        caretensemble_2.0.0 [11] caret_6.0-64        ggplot2_2.1.0       revoutilsmath_8.0.1 revoutils_8.0.1     revomods_8.0.1      [16] revoscaler_8.0.1    lattice_0.20-33     rpart_4.1-10         loaded via namespace (and not attached):  [1] rcpp_0.12.4        compiler_3.2.2     nloptr_1.0.4       plyr_1.8.3         tools_3.2.2         [6] lme4_1.1-11        digest_0.6.9       nlme_3.1-126       gtable_0.2.0       mgcv_1.8-12        [11] sparsem_1.7        gridextra_2.2.1    stringr_1.0.0      matrixmodels_0.4-1 stats4_3.2.2       [16] grid_3.2.2         nnet_7.3-12        data.table_1.9.6   pbapply_1.2-1      minqa_1.2.4        [21] reshape2_1.4.1     car_2.1-2          magrittr_1.5       scales_0.4.0       codetools_0.2-14   [26] mass_7.3-45        splines_3.2.2      pbkrtest_0.4-6     colorspace_1.2-6   quantreg_5.21      [31] stringi_1.0-1      munsell_0.4.3      chron_2.3-47   


Comments