Assignment 1

Okay, so now I’ll walk through the solutions to the assignment exercise.

So I have a notebook here with all of the code ready, along with additional notes and documentation.

So I’ll run through all of that.

And this will be included as part of the course resources for you to review and refer to later on.

So here we have a summary of the first requirement, which was to create the orders table.

So for the first step, I’m going to import all of the relevant data types and I’ll run that cell.

I’ll then I’ll then read the orders CSV file into a dataframe called orders.

As you can see, I’ve defined the schema.

So all of the columns, apart from order underscore date time are the final data type that I’d like

them to be.

Or to underscore datetime is a string type column.

This is because if I tried to read it as a timestamp data type, it would fail.

This is because the format is not correct and it needs to be changed before I can do that.

So let me first run this cell.

And now let’s display the orders, DataFrame.

So here is the orders DataFrame.

So currently, as you can see, the order underscore date time is not in the appropriate format to be

converted to a timestamp.

So to do that, we first must take this string column and convert it to a timestamp using the to underscore

timestamp function.

So let me run the cell here that imports the function.

Okay.

So now, in the following code, I’ve used the select method to select all of the relevant columns.

However, notice that I’m using the to underscore timestamp function in this line here and the to underscore

timestamp function takes two arguments.

The first argument is the column that I’d like to convert to a timestamp.

And the second argument is the format that the data in this column is in.

Now the column must be a string column.

And as you can see.

The format here mirrors the format here.

And in the comments for this code, I’ve got the references to the to underscore datetime to underscore

timestamp function and the date time patterns.

So you can just copy this and paste it into your URL so you can see all of the relevant details as well.

Great.

So let me run this cell.

I note that it’s assigned to the variable orders.

So now let’s check the data types.

So as you can see, the data type is a timestamp.

Now, let me display that.

And as you can see, the format has changed because it’s been converted to a timestamp.

Another ask was to include records where the order status.

Is complete.

So let’s run the cell here that filters the data frame to include only the complete records, and it’s

assigned back to the orders variable.

And another ask was to have the store name instead of the store ID.

So we have the stores table that I’m reading in here from the bronze layer.

And I’ll run this cell.

So I’ve assigned all the data types here.

And as a note, even in the previous even in the previous schema, I’ve made sure that the third argument

is false.

So this is for the column to be knowledgeable.

Remember, that was a requirement as well.

So I’ll run the cell.

And now I’ll display the store’s data frame.

So now, as you can see, we can join the water table here to the stores table here using the store

underscore ID column.

And that is what the cell is doing.

So it’s joining the tables via left join because the orders table is the left table and I’m doing a

left join because I want to retain all of the records from the orders table.

And then finally, I do select where I’m selecting the columns that I’d like that I’d like to retain.

And it’s all being assigned to the variable orders.

So let’s run the cell and then the next cell as well, which displays the records.

So now, as you can see, we have the store name as well.

So now we’re ready to save this into our silver layer.

So I’m going to run this cell which writes the dataframe as a parquet file.

Note the path.

So I’m specifying a silver folder, and then the file name, which is orders.

And at present, this doesn’t exist.

So if I go to my So if I go to data, I’ll open this a new tab so you can see Databricks file system

file store.

So at present we only have the bronze layer.

So I’ll go back here and now I will run the cell.

And now if I go back here, go to tables.

You can see we’ve got a silver layer with the orders table in file format.

Okay, So another task was to write the order, underscore items parquet file to the silver layer.

So let me run this cell.

It reads the Bruns file and specifies the schema and it signs it to the dataframe order underscore items.

Now let’s display the dataframe.

Now, as you can see, here is our dataframe.

All we need to do here is drop the line on the school item, on the school ID column, and everything

else is in order.

So let me execute this L which uses the drop method to drop the column and assigns it back to the data

frame and then display the records.

Now, as you can see, this is in the required format.

So again, I’ll use the right top park method.

Remember that the mode should be overwrite just in case you want to run the cell multiple times.

So at present the file doesn’t exist.

So let me run this cell.

And now, as you can see, it’s been completed.

So when I refresh this, we have now got two folders.

So we’ve got the order items here as well.

We have a product file now as well, which I’m copying from the bronze layer.

I’ve got the schema and I’ve assigned it to the variable products.

So now let’s display that dataframe.

And our requirement is just to write this as is.

We’ve specified the data types here according to the requirements, and now let’s write that to the

silver layer.

And that’s done.

So now if I refer to this, you can see we’ve got products as well.

Similarly, let’s do the same for customers.

So I’ll run the cell, which reads into a variable called customers.

And again, I’ve assigned the schema as per the requirements.

Let’s display that dataframe.

And now let’s run the cell, which writes it as a parquet file in the silver layer.

I don’t need this cell.

It’s just duplicate.

So now, as you can see.

We have got the four tables that were required.

Okay, great.

So I hope you found that walkthrough useful.

And please remember to review this documentation by accessing the course Resources folder.

Azure Databricks with PySpark

Curriculum

Assignment 1

Leave a Reply Cancel reply

Modal title