reading docs spark see
the data broadcasted way cached in serialized form , deserialized before running each task. means explicitly creating broadcast variables useful when tasks across multiple stages need same data or when caching data in deserialized form important.
i understand why broadcasts variables useful when re-using them in multiple tasks. don't want re-send them closures.
however second part, in bold, says when caching data in deserialized form important. when , why important? if you're going use data in 1 task still serialized/deserialized once, no?
i think ignored following part:
and deserialized before running each task.
a single stage typically consist of multiple tasks (it not common have single partition, it?) , multiple tasks belonging same stage can processed same executor. since deserialization can quite expensive may prefer perform once.
Comments
Post a Comment