(this first time trying create reproducible example question - please feel free comment better ways describe or illustrate issues!)
main issue statement
i training ~25,000 models in parallel using foreach's %dopar% , caretlist (from caretensemble package). due r crashing , memory issues, need save each of forecasts individual object, workflow looks - see below reproducible example.
cl <- makepsockcluster(4) clusterevalq(cl, library(foreach)) registerdoparallel(cl) multiple.forecasts <- foreach(x=1:1,.combine='rbind',.packages=c('zoo','earth','caret',"glmnet","caretensemble")) %dopar% { trycatch({ results <- caretlist(mpg ~ cyl,data=mtcars,trcontrol=fitcontrol,methodlist=c("glmnet","lm","earth"),continue_on_fail = true) (i in 1:length(results)) { results[[i]]$trainingdata <- c() ## should trimming out trainingdata } save(results,file="foreach_results.rdata") ## export each caretlist own object 1 }, error = function(e) { write.csv(e$message,file="foreach_failure.txt") ## monitor failures needed 0 } ) } (irl project not involve mtcars data - each iteration of foreach loop iterates on 1 of data frames in list , saves new forecast object each data frame.)
when object saved inside foreach loop, object size approximately 136 kb in windows due compression.
however, when object created , saved not using foreach, so:
results <- caretlist(mpg ~ cyl,data=mtcars,trcontrol=fitcontrol,methodlist=c("glmnet","lm","earth"),continue_on_fail = true) (i in 1:length(results)) { results[[i]]$trainingdata <- c() } save(results,file="no_foreach_results.rdata") this object, same, approximately 156kb in windows. what's adding saved object size in windows?
in real workflow, smaller non-foreach object 4 mb on average , larger foreach object 10 mb on average, creates real storage issues when saving ~25,000 of these files.
- why object size when saved within foreach loop larger, , if can it?
notes
- my hypothesis
savewithinforeachsaves entire environment: instead of saving object, when it's commanded usingsaverds(see below), there implicit saving of environment exported each of workers. trimdoesn't seem working withincaretlist:trimtraincontroloption doesn't seem trimming it's supposed to, had manually add command trimtrainingdata.- my current workaround set
savecompressionxz: need foreach loop take advantage of multiple cores, need larger objects. slows down workflow 3-4x however, why i'm looking solution. - the psock cluster needed work around issue in
caretparallelization: see answer here. saverdsnot issue: i've tested usingsaverdsinstead ofsave, difference in object sizes pervades.- removing
trycatchnot issue: withouttrycatchinforeachloop, difference in object size pervades.
technical details
reproducible example:
library(caret) library(caretensemble) ## train caretlist without foreach loop fitcontrol <- traincontrol(## 10-fold cv method = "repeatedcv", number = 10, ## repeated ten times repeats = 10, trim=true) results <- caretlist(mpg ~ cyl,data=mtcars,trcontrol=fitcontrol,methodlist=c("glmnet","lm","earth"),continue_on_fail = true) (i in 1:length(results)) { results[[i]]$trainingdata <- c() } object.size(results) ##returns 546536 bytes save(results,file="no_foreach_results.rdata") ##in windows, object 136 kb ## train caretlist foreach loop library(doparallel) cl <- makepsockcluster(4) clusterevalq(cl, library(foreach)) registerdoparallel(cl) multiple.forecasts <- foreach(x=1:1,.combine='rbind',.packages=c('zoo','earth','caret',"glmnet","caretensemble")) %dopar% { trycatch({ results <- caretlist(mpg ~ cyl,data=mtcars,trcontrol=fitcontrol,methodlist=c("glmnet","lm","earth"),continue_on_fail = true) (i in 1:length(results)) { results[[i]]$trainingdata <- c() } save(results,file="foreach_results.rdata") ## in windows, object 160 kb ## loading file in , running object.size gives 546504 bytes, approximately same 1 }, error = function(e) { write.csv(e$message,file="foreach_failure.txt") 0 } ) } sessioninfo() output:
r version 3.2.2 (2015-08-14) platform: x86_64-w64-mingw32/x64 (64-bit) running under: windows server 2012 x64 (build 9200) locale: [1] lc_collate=english_united states.1252 lc_ctype=english_united states.1252 [3] lc_monetary=english_united states.1252 lc_numeric=c [5] lc_time=english_united states.1252 attached base packages: [1] parallel stats graphics grdevices utils datasets methods base other attached packages: [1] doparallel_1.0.10 iterators_1.0.8 earth_4.4.4 plotmo_3.1.4 teachingdemos_2.10 [6] plotrix_3.6-2 glmnet_2.0-5 foreach_1.4.3 matrix_1.2-4 caretensemble_2.0.0 [11] caret_6.0-64 ggplot2_2.1.0 revoutilsmath_8.0.1 revoutils_8.0.1 revomods_8.0.1 [16] revoscaler_8.0.1 lattice_0.20-33 rpart_4.1-10 loaded via namespace (and not attached): [1] rcpp_0.12.4 compiler_3.2.2 nloptr_1.0.4 plyr_1.8.3 tools_3.2.2 [6] lme4_1.1-11 digest_0.6.9 nlme_3.1-126 gtable_0.2.0 mgcv_1.8-12 [11] sparsem_1.7 gridextra_2.2.1 stringr_1.0.0 matrixmodels_0.4-1 stats4_3.2.2 [16] grid_3.2.2 nnet_7.3-12 data.table_1.9.6 pbapply_1.2-1 minqa_1.2.4 [21] reshape2_1.4.1 car_2.1-2 magrittr_1.5 scales_0.4.0 codetools_0.2-14 [26] mass_7.3-45 splines_3.2.2 pbkrtest_0.4-6 colorspace_1.2-6 quantreg_5.21 [31] stringi_1.0-1 munsell_0.4.3 chron_2.3-47
Comments
Post a Comment