serialization - When is it useful to have broadcast data in deserialized form? -


reading docs spark see

the data broadcasted way cached in serialized form , deserialized before running each task. means explicitly creating broadcast variables useful when tasks across multiple stages need same data or when caching data in deserialized form important.

i understand why broadcasts variables useful when re-using them in multiple tasks. don't want re-send them closures.

however second part, in bold, says when caching data in deserialized form important. when , why important? if you're going use data in 1 task still serialized/deserialized once, no?

i think ignored following part:

and deserialized before running each task.

a single stage typically consist of multiple tasks (it not common have single partition, it?) , multiple tasks belonging same stage can processed same executor. since deserialization can quite expensive may prefer perform once.


Comments