In this lecture, I’ll introduce the evolution of Apache Spark from Hadoop and how Databricks enhances Apache Spark with additional benefits.
Hadoop was a significant advancement in the ability to store and process large datasets. However, it has a few limitations.
First, Hadoop’s processing is relatively slow because MapReduce reads and writes data to and from the disk at every stage, which makes the process slow, especially for smaller datasets.
Additionally, Hadoop uses a batch processing engine, which can’t process real-time data. It is also difficult to use and administer, requiring relatively complicated coding with more lines of code.
This is where Apache Spark comes in.
Spark is a unified analytics engine for large-scale distributed data processing. Its design philosophy centers around four key characteristics:
1. Speed: Unlike MapReduce, which reads and writes data to disk, Spark offers in-memory processing, reading and writing data to RAM. This makes Spark up to 100 times faster.
2. Ease of Use: Spark is easier to use, working with a data frame structure and offering many APIs for Java, Scala, Python, and Spark SQL, simplifying coding.
3. Modularity: Spark operations can be applied across many workloads and expressed in various programming languages (Scala, Java, Python, R).
Spark’s unified libraries, with well-documented APIs, include core components such as Spark SQL, Spark Structured Streaming, Spark MLib, and GraphX, allowing multiple workloads to run under one engine.
4. Extensibility: Spark focuses on fast parallel computation rather than storage, unlike Hadoop, which includes both storage and compute. Spark decouples the two, enabling it to read data from multiple sources.
Spark applications run as independent sets of processes on a cluster, coordinated by the Spark context object in your main driver node.
The Spark context can connect to several types of cluster managers responsible for managing and allocating resources for the cluster. Currently, Spark supports four cluster managers:
– The standalone cluster manager built into Spark.
– Apache Hadoop YARN.
– Apache Mesos.
– Kubernetes.
At a high level, the Spark architecture involves the Driver program orchestrating parallel operations on the Spark cluster. Each worker node in the cluster runs a Spark executor, which communicates with the driver program and executes tasks on the workers.
Typically, there is one executor per node, with each worker node having slots determined by the number of cores and CPUs. The driver assigns units of work or tasks to these slots for parallel execution, allowing the cluster to scale horizontally by adding new nodes as the workload increases.
Databricks, founded by the creators of Apache Spark in 2013, addresses some limitations of Spark.
For instance, managing Spark environments can be complex and time-consuming, involving setting up clusters, provisioning virtual machines, and installing Spark and libraries.
Databricks offers a fully managed, cloud-based data analytics platform that simplifies these tasks. It provides a collaborative environment with an easy-to-use interface for coding on notebooks, automatic cluster scaling, and seamless integration with other platforms.
There are many additional benefits to using Databricks, which I’ll cover in the next lecture.