In the previous lesson, I touched on Apache Spark. Spark is the engine behind Databricks, so it’s worth exploring in more detail.
In this lesson, I’ll cover the main components of Spark.
Starting with the high-level architecture from the official documentation, at the base, we have Spark Core. This is the underlying execution engine supporting APIs for SQL, Python, Scala, and Java. The main abstraction of Spark Core is the Resilient Distributed Dataset (RDD). An RDD is a collection of elements partitioned across the nodes in a cluster, allowing for parallel operations. RDDs are immutable, meaning they are read-only, and they form the fundamental data structure of Apache Spark.
The low-level layer, Spark Core, is quite complex and was primarily used in the early days of Apache Spark. However, it is no longer recommended due to its difficulty of use and lack of performance optimization.
The top layer provides a higher-level abstraction over the Spark Core API and is widely used by data engineers, developers, and data scientists. Internally, all components utilize the Spark Core API but offer a more user-friendly approach, making it the recommended method.
Spark SQL and DataFrames will be the primary focus for this course. While Spark Core uses RDDs, this API employs DataFrames as the data structure. Like RDDs, DataFrames are immutable distributed collections of data. However, DataFrames organize data into named columns, similar to tables in a relational database, making them more intuitive and easier to work with.
DataFrames are designed to simplify the processing of large datasets, allowing developers to impose structure onto a distributed collection of data. We will extensively use Spark SQL and DataFrame APIs in this course, invoked through Python. These APIs can also be used with Scala, Java, and R.
Additionally, other APIs in this higher-level layer support streaming, machine learning, and graph computation.
With that overview, let’s move on to the next lecture.