(this first time trying create reproducible example question - please feel free comment better ways describe or illustrate issues!)
main issue statement
i training ~25,000 models in parallel using foreach
's %dopar%
, caretlist
(from caretensemble
package). due r crashing , memory issues, need save each of forecasts individual object, workflow looks - see below reproducible example.
cl <- makepsockcluster(4) clusterevalq(cl, library(foreach)) registerdoparallel(cl) multiple.forecasts <- foreach(x=1:1,.combine='rbind',.packages=c('zoo','earth','caret',"glmnet","caretensemble")) %dopar% { trycatch({ results <- caretlist(mpg ~ cyl,data=mtcars,trcontrol=fitcontrol,methodlist=c("glmnet","lm","earth"),continue_on_fail = true) (i in 1:length(results)) { results[[i]]$trainingdata <- c() ## should trimming out trainingdata } save(results,file="foreach_results.rdata") ## export each caretlist own object 1 }, error = function(e) { write.csv(e$message,file="foreach_failure.txt") ## monitor failures needed 0 } ) }
(irl project not involve mtcars
data - each iteration of foreach
loop iterates on 1 of data frames in list , saves new forecast object each data frame.)
when object saved inside foreach
loop, object size approximately 136 kb in windows due compression.
however, when object created , saved not using foreach
, so:
results <- caretlist(mpg ~ cyl,data=mtcars,trcontrol=fitcontrol,methodlist=c("glmnet","lm","earth"),continue_on_fail = true) (i in 1:length(results)) { results[[i]]$trainingdata <- c() } save(results,file="no_foreach_results.rdata")
this object, same, approximately 156kb in windows. what's adding saved object size in windows?
in real workflow, smaller non-foreach
object 4 mb on average , larger foreach
object 10 mb on average, creates real storage issues when saving ~25,000 of these files.
- why object size when saved within foreach loop larger, , if can it?
notes
- my hypothesis
save
withinforeach
saves entire environment: instead of saving object, when it's commanded usingsaverds
(see below), there implicit saving of environment exported each of workers. trim
doesn't seem working withincaretlist
:trim
traincontrol
option doesn't seem trimming it's supposed to, had manually add command trimtrainingdata
.- my current workaround set
save
compressionxz
: need foreach loop take advantage of multiple cores, need larger objects. slows down workflow 3-4x however, why i'm looking solution. - the psock cluster needed work around issue in
caret
parallelization: see answer here. saverds
not issue: i've tested usingsaverds
instead ofsave
, difference in object sizes pervades.- removing
trycatch
not issue: withouttrycatch
inforeach
loop, difference in object size pervades.
technical details
reproducible example:
library(caret) library(caretensemble) ## train caretlist without foreach loop fitcontrol <- traincontrol(## 10-fold cv method = "repeatedcv", number = 10, ## repeated ten times repeats = 10, trim=true) results <- caretlist(mpg ~ cyl,data=mtcars,trcontrol=fitcontrol,methodlist=c("glmnet","lm","earth"),continue_on_fail = true) (i in 1:length(results)) { results[[i]]$trainingdata <- c() } object.size(results) ##returns 546536 bytes save(results,file="no_foreach_results.rdata") ##in windows, object 136 kb ## train caretlist foreach loop library(doparallel) cl <- makepsockcluster(4) clusterevalq(cl, library(foreach)) registerdoparallel(cl) multiple.forecasts <- foreach(x=1:1,.combine='rbind',.packages=c('zoo','earth','caret',"glmnet","caretensemble")) %dopar% { trycatch({ results <- caretlist(mpg ~ cyl,data=mtcars,trcontrol=fitcontrol,methodlist=c("glmnet","lm","earth"),continue_on_fail = true) (i in 1:length(results)) { results[[i]]$trainingdata <- c() } save(results,file="foreach_results.rdata") ## in windows, object 160 kb ## loading file in , running object.size gives 546504 bytes, approximately same 1 }, error = function(e) { write.csv(e$message,file="foreach_failure.txt") 0 } ) }
sessioninfo() output:
r version 3.2.2 (2015-08-14) platform: x86_64-w64-mingw32/x64 (64-bit) running under: windows server 2012 x64 (build 9200) locale: [1] lc_collate=english_united states.1252 lc_ctype=english_united states.1252 [3] lc_monetary=english_united states.1252 lc_numeric=c [5] lc_time=english_united states.1252 attached base packages: [1] parallel stats graphics grdevices utils datasets methods base other attached packages: [1] doparallel_1.0.10 iterators_1.0.8 earth_4.4.4 plotmo_3.1.4 teachingdemos_2.10 [6] plotrix_3.6-2 glmnet_2.0-5 foreach_1.4.3 matrix_1.2-4 caretensemble_2.0.0 [11] caret_6.0-64 ggplot2_2.1.0 revoutilsmath_8.0.1 revoutils_8.0.1 revomods_8.0.1 [16] revoscaler_8.0.1 lattice_0.20-33 rpart_4.1-10 loaded via namespace (and not attached): [1] rcpp_0.12.4 compiler_3.2.2 nloptr_1.0.4 plyr_1.8.3 tools_3.2.2 [6] lme4_1.1-11 digest_0.6.9 nlme_3.1-126 gtable_0.2.0 mgcv_1.8-12 [11] sparsem_1.7 gridextra_2.2.1 stringr_1.0.0 matrixmodels_0.4-1 stats4_3.2.2 [16] grid_3.2.2 nnet_7.3-12 data.table_1.9.6 pbapply_1.2-1 minqa_1.2.4 [21] reshape2_1.4.1 car_2.1-2 magrittr_1.5 scales_0.4.0 codetools_0.2-14 [26] mass_7.3-45 splines_3.2.2 pbkrtest_0.4-6 colorspace_1.2-6 quantreg_5.21 [31] stringi_1.0-1 munsell_0.4.3 chron_2.3-47
Comments
Post a Comment