Initial Setup for Spark Structured Streaming Demo

### Lesson: Initial Setup for Spark Structured Streaming Demo

#### 1. **Setting Up the Data Lake Storage**

Before we begin working with Spark Structured Streaming, we need to prepare our environment by setting up the necessary storage and directories.

**Step 1: Create a New Container in Data Lake Gen2**
– Navigate to your Data Lake Gen2 storage account.
– Create a new container and name it **`streaming-demo`**.
– Click **Create** to add the container.

**Step 2: Create Directories within the Container**
– Inside the **`streaming-demo`** container, create a new directory named **`full_dataset`**.
– This directory will hold the full dataset that we’ll be working with.

**Step 3: Upload the Dataset**
– Download the **`orders_full.csv`** file from the course resources to your local computer.
– Upload the **`orders_full.csv`** file into the **`full_dataset`** directory.
– Navigate to the **`full_dataset`** directory.
– Click on **Upload**, browse for the file on your computer, and upload it.

– Return to the **`streaming-demo`** container’s main directory.
– Create another directory named **`streaming_dataset`**. You can leave this directory empty for now.

#### 2. **Mounting the Container in Databricks**

Next, we’ll mount the **`streaming-demo`** container to Databricks so that we can easily access the data stored in it.

**Step 1: Access the Databricks Workspace**
– In your Databricks workspace, create a new folder named **`14_Spark_Structured_Streaming_and_Auto_Loader`**.
– Inside this folder, create a new notebook titled **`00_Mounting_Streaming_Demo_Container`**.

**Step 2: Prepare the Notebook**
– This notebook will contain the code necessary to mount the **`streaming-demo`** container to the Databricks File System (DBFS).
– Make sure the notebook has the necessary code for mounting, which should include:
– **Container Name:** `streaming-demo`
– **Account Name:** This should be your Data Lake Gen2 storage account name (e.g., `DataLake639`).
– **Mount Point:** Define the mount point as `/mnt/streaming-demo`.

**Step 3: Run the Notebook**
– Execute the cell in your notebook to mount the container.
– Once the code runs successfully, it should return `True`, indicating that the container has been successfully mounted.

#### 3. **Verifying the Mount in Databricks**

After mounting, it’s essential to verify that the container and its directories are accessible:

– Navigate to the **Data** section in Databricks.
– Browse through **DBFS** under the **mount** directory.
– You should see the **`streaming-demo`** container listed.
– Inside, you should find the **`full_dataset`** directory containing the **`orders_full.csv`** file, as well as the **`streaming_dataset`** directory.

With this setup complete, you are now ready to proceed with the Spark Structured Streaming section of the course, where we’ll work with streaming data and implement real-time data processing.

Azure Databricks with PySpark

Curriculum

Initial Setup for Spark Structured Streaming Demo

Leave a Reply Cancel reply

Modal title