Spark is a distributed computing framework, which is built on top of some ideas such as ClusterManager (ResourceManager), applications, tasks and etc.
- Distributed computing is usually across a cluster by (1) requesting resources and (2) scheduling tasks, which can be done via a uniform interface called ClusterManager. Examples are yarn (hadoop, mapr, ec2) , Mesos, standalone and local(multi-core mode on a single machine).
- The interface between Spark framework and a ClusterManager is the
SparkContext
object in your driver program.
- Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager or Mesos/YARN), which allocate resources across applications.
- Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks for the executors to run.
- Each spark application has its own SparkContext (on scheduling side) and its own couple of executors on different nodes in its own JVM (on executor side). Data sharing among different applications can only be done via external storage system.
- Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN).
- References: