over past few days i've been working on trying understand how spark executors know how use module given name upon import. working on aws emr. situation: initialize pyspark on emr typing
pyspark --master yarn
then, in pyspark,
import numpy np ## notice naming def myfun(x): n = np.random.rand(1) return x*n rdd = sc.parallelize([1,2,3,4], 2) rdd.map(lambda x: myfun(x)).collect() ## works!
my understanding when import numpy np
, master node node importing , identifying numpy
through np
. however, emr cluster (2 worker nodes), if run map function on rdd, driver program sends function worker nodes execute function each item in list (for each partition), , successful result returned.
my question this: how workers know numpy should imported np? each worker has numpy installed, i've not defined explicitly defined way each node import module as np
.
please refer following post cloudera further details on dependencies: http://blog.cloudera.com/blog/2015/09/how-to-prepare-your-apache-hadoop-cluster-for-pyspark-jobs/
under complex dependency have example (code) pandas module explicitly imported on each node.
one theory i've heard being thrown around driver program distributes code passed in pyspark interactive shell. skeptical of this. example bring counter idea is, if on master node type:
print "hello"
is every worker node printing "hello"? don't think so. maybe wrong on this.
when function serialized there number of objects being saved:
- code
- globals
- defaults
- closure
- dict
which can later used restore complete environment required given function.
since np
referenced function can extracted code:
from pyspark.cloudpickle import cloudpickler cloudpickler.extract_code_globals(myfun.__code__) ## {'np'}
and binding can extracted globals
:
myfun.__globals__['np'] ## <module 'numpy' ...
so serialized closure (in broad sense) captures information required restore environment. of course modules accessed in closure have importable on every worker machine.
everything else reading , writing machinery.
on side note master node shouldn't execute python code. responsible resources allocation not running application code.
Comments
Post a Comment