Accessing via SAS

In our previous lesson, we learned how to access data files in our data lake using access keys. Today, we’ll explore another method using SAS tokens.

1. Understanding SAS Tokens

SAS stands for Shared Access Signature. It’s a secure way to grant limited access to resources in Azure, such as data stored in Blob Storage or Azure Data Lake Storage Gen2.

https://learn.microsoft.com/en-us/azure/databricks/connect/storage/azure-storage

2. Generating SAS Tokens

Let’s go through the steps to generate and use a SAS token:

Navigate to your Azure storage account (e.g., Data Lake 639).
Select “Shared Access Signature (SAS)” from the settings.
Configure the permissions based on your needs. Typically, you would specify read permissions for containers or specific objects.
You can set optional parameters like start and expiry dates, IP address restrictions, and protocols.
Generate the SAS token and copy it.

3. Using SAS Tokens in Databricks

Now, let’s integrate the SAS token into our Databricks environment to access data:

In your Databricks notebook, create a new cell or use an existing one.
Paste the SAS token enclosed in double quotes, ensuring there are no extraneous characters like a leading question mark.
Replace the storage account name placeholders in your notebook with your actual storage account name.
You can now access your data using methods like `spark.read.csv`, specifying the file path within your storage.

4. Example Code

Here’s a simplified example of how to use SAS tokens in Databricks to read a CSV file:

spark.conf.set("fs.azure.account.auth.type.adlsv2.dfs.core.windows.net", "SAS")

spark.conf.set("fs.azure.sas.token.provider.type.adlsv2.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")

#spark.conf.set("fs.azure.sas.fixed.token.adlsv2.dfs.core.windows.net", dbutils.secrets.get(scope="<scope>", key="<sas-token-key>"))

spark.conf.set("fs.azure.sas.fixed.token.adlsv2.dfs.core.windows.net","sv=2024-11-02&ss=bfqt&srt=sco&sp=rwdlacupyx&se=2024-07-11T02:03:28Z&st=2024-07-10T18:03:28Z&spr=https,http&sig=prapuKnCdsUVtyePxhv4T%2F2PF9RNNJyaqsQYZVF0peg%3D41")

spark.read.csv("abfss://data@adlsv2.dfs.core.windows.net/Employees.csv", header=True).display()

Ensure to adjust the path (`your_container`, `your_folder`, and `countries.csv`) based on your storage structure.

Conclusion

By utilizing SAS tokens, you’ve learned an alternative method to securely access data stored in Azure from Databricks. Remember, the method of access (e.g., `spark.read.csv` for CSV files) may vary based on the file format, such as Parquet files, which require different methods like `spark.read.parquet`.

This approach ensures secure and controlled access to your data lake resources, enhancing data management capabilities within your cloud environment.

Azure Databricks with PySpark

Curriculum