We’ve started a new section of the course and the focus now is on orchestrating your workflows and tasks with databricks jobs.
Databricks jobs enables you to schedule and automate the execution of data processing tasks and workflows on the Databricks platform.
It provides an easy way to run your code or notebooks in a controlled and repeatable manner, making it easier to manage and schedule your data.
To create a job, you go to workflows. And then under jobs, you can simply click on Create Job.
You can create individual tasks by giving your task a name and then specifying the type of that task.
I’m currently in the Databricks premium service that I created in the last section, but if you’re using the standard service, then you won’t be able to see some of these options.
For example, you might not see SQL or Spark submit at the time of recording, but for the purposes of this demo, I’ll be using Databricks notebooks only.
Okay.
So let’s create some notebooks which we can run via databricks jobs.
So in my workspace, I’ll create a new folder called 16. Databricks jobs.
I’ll create another notebook called Notebook to.
And that will also contain a print statement.
Great.
So now we can use Databricks jobs to run both of these notebooks.
So let’s go to workflows.
And click on Create Job.
In the top here.
You can add a name for your job.
I’ll call it Demo job zero one.
Currently we have an unnamed task.
I’ll use this task to execute the notebook we created called Notebook zero one.
So I’ll call this task notebook zero one and you can’t have spaces.
So instead I’m using underscores.
The type is notebook because we want to run a notebook.
The source is workspace, but this can also be a git repository.
So now let’s select The Notebook.
So it’s in section 16 and here’s notebook one and click on Confirm.
Okay, so we now need to select the cluster.
This is the cluster that will execute the task.
We can select an all purpose cluster or a job cluster for databricks jobs.
It’s advised to use a job cluster.
All purpose clusters are used for general purpose computing tasks such as data exploration, data engineering
and analytics.
They provide a flexible and scalable infrastructure to handle various workloads.
Job clusters, on the other hand, are specific clusters created for running databricks jobs.
These clusters are dedicated to executing scheduled jobs and are separate from all purpose clusters.
They were automatically created and terminated for each job execution, ensuring isolation and resource optimization.
So, for databricks jobs, you should use job clusters.
They’re specifically designed for executing jobs and won’t interfere with existing workloads.
So let’s create a new job cluster.
Because this job cluster is too big.
We don’t need 36 cores.
I’ll select a single node cluster and I’ll call this cluster.
Job.
Cluster.
Oh one.
And as you can see, this is 0.75 DB per hour.
Great.
So click on Confirm.
And here’s our cluster.
We have other options too, such as adding dependent libraries or passing parameters into the task.
You can even set up email notifications.
You can specify how many times you want to retry if a task fails and the maximum timeout in seconds.
By default, this is unlimited.
Great.
So if we click here, we can create the first task.
So the first task has been created.
Before I add an additional task for notebook two, let’s go over.
Some of the additional details on the side here.
So each job has its own job ID.
As you can see here, and this is unique for each job.
You can also configure a git repo and you can add a schedule.
So let’s say you wanted to run this job on a daily basis.
You can like so.
Each task has its own compute cluster.
You can use the same compute cluster for each task or a separate compute cluster for each task.
You can also manage notifications and permissions.
So currently as your dev, which is my user account, is the owner, but you can also edit that here as well.
So if you’re an owner, you get full permission.
You can also assign a new owner, but you can only have one owner at a time.
And it has to be an individual.
It can’t be a group.
Other permissions are can manage.
Run.
Can view.
And can manage.
And as you can see, the detail of those permissions is listed on the side.
While only an individual can be an owner.
You can assign groups to the other permissions so you can assign a group to any of these permissions,
but only an individual can be an owner.
Finally, we have maximum concurrent runs.
This means how many times the same job can run in parallel.
And this is set to one by default.
Great.
So let’s add a new task.
Click here on Add a new Task.
And that’s a notebook task.
So I’ll call this task notebook zero two and select Workspace and The Notebook.
And that is this notebook to here.
Great.
So I’ll keep the cluster as job cluster zero one and notice here we have depended on.
So, this task depends on notebook zero one.
So, this task will run.
After the notebook, one task finishes.
So, if I remove this, as you can see, they both run in parallel.
But let me just select notebook one.
Great.
So, let’s create this task.
And now we can run this job which executes both of the notebooks.
To run the job.
You can click here on Run now.
So the run has been triggered.
And here is the run ID.
You can view the run by clicking here or simply clicking here on runs.
So if you scroll up here, you can see the job ID.
And if you look here, you get additional details such as the start time, the duration, how it was
launched.
So this was manually launched and you’ve got the run ID here.
This is unique for each run of the job.
This diagram here visualizes the execution time here.
And each individual task.
So this square will go green.
On successful completion or red?
On a failure.
You can hover here for additional detail.
So you’ve got the job ID, the job run ID, and so on and on.
Each task you can see additional details as well for that specific task.
So notice.
That a task has got a task run.
Each task will have its own task.
Run ID.
So as you can see, this task has got the task run of six one, three, three, two, and this one has
got six one, two eight, eight.
If you click here on the start time for that run ID.
Then you can see each task.
This will turn green when it’s successfully completed or it will turn red if it’s failed.
And then you’ve got the lineage here, so you can go back to the job level and you can go back to jobs
here as well.
And here is the job which is running.
So if I click back into it, as you can see, here’s the additional detail again.
So the job has completed.
We can now click into each task by clicking on the square.
And as you can see, this is the notebook.
Notebook one.
And here’s the cell.
It’s just the simple print statement.
And then you can see this view here as well.
So again, they’re both green because they were successful.
And if I click here, I can access notebook two.
And again, it’s a simple print statement.
So if I go back to Demo Job one, you can see the additional details this time when you hover over it,
such as the cluster information as well.
So you can see the details, the spark UI logs and the metrics of the cluster for each task.
So before I wrap up this lecture, let me run the job again, but I’ll run it twice.
So I’ll click Run now and I’ll click run now again.
So as you can see, we’ve got two jobs which are the same job running concurrently, but because we’ve
got concurrency set to one, let’s see what happens.
And as you can see, this turns red because it’s been skipped and that is because the limit of one maximum
concurrent runs has been reached.
So I just wanted to show you what happens when you try to run the same job concurrently.
So you’d have to change this to be two.
Great.
So that was a lecture on how you can use Databricks jobs to run your workflows.
In the folder here.
I’ll add a new notebook called Links and Resources.
This will contain additional links and resources that you can review in your own time, so be sure to
check that out in the course.
Resources.