This article introduces Databricks from a Gurobi perspective. We aim to explain high-level what the main concepts are and what the underlying architecture looks like. The goal is to understand where your (Gurobi) code would run when using Databricks. This understanding will help when considering Gurobi architecture and licensing options for Databricks environments, and when getting started using Gurobi on Databricks.
Databricks is a general-purpose front-end to cloud resources (AWS, Azure, GCP) for teams to collaborate using shared data. It is founded by the people behind Apache Spark, which is an engine to run code on clusters of any size, to be able to process large amounts of data efficiently. Databricks brings this technology to the main cloud platforms and gives a layer of abstraction over the actual compute resources being offered on those platforms.
Ultimately Databricks is being used for writing and executing code. In that sense it doesn't matter whether the code performs machine learning, optimization or any other task. There are different ways to run code on Databricks.
- An easy way to get started with Databricks is using notebooks. These are essentially Jupyter notebooks which can be run cell-by-cell.
- Notebooks, as well as other work like plain Python scripts, can be turned into jobs. One job consists of multiple tasks (e.g. "run this notebook first, then this Python script"). Jobs can be triggered manually or scheduled periodically.
Although abstraction layers hide the complexities of the underlying hardware, it's essential to understand what's going on behind the scenes when your code gets executed. Databricks defines the concept of compute to refer to machines that can execute code. You can define different types of compute. For each of the below options, you can choose from various machine types (memory/storage/compute optimized with different #cores, RAM etc).
- All-purpose clusters can run jobs and notebooks. They are created when needed and terminate when idle.
- Job clusters can only run jobs. They relate to a specific job. Machines are destroyed once the job is finished.
- Pools are collections of machines that are pre-started before there is actually work to perform. When a cluster needs resources, it can claim them from a pool instead of managing its own compute resources.
Ultimately each compute type triggers the creation of virtual machines, which are fully managed by Databricks (you would never access the individual machines yourself). There is also the option to choose for using Docker and specifying base images for your nodes. In that scenario, your code runs inside containers.
Workers and distributed computing
Databricks is built on top of Apache Spark. A key capability of Apache Spark is executing code in a distributed fashion. For that reason, we distinguish between driver and worker nodes in a cluster. Driver nodes are the entrypoints where your code will run. When using libraries that build on top of Apache Spark, these libraries offload part of their work to the worker nodes using Spark.
When defining a cluster, you choose between two options:
- Single-node clusters consist only of a driver node and no worker nodes. The driver runs Apache Spark but always performs calculations locally.
- Multi-node clusters consist of a driver node and at least one worker node.
Note that multi-node clusters are only useful when you intend to distribute calculations to worker nodes using Apache Spark - either explicitly, or by using libraries that manage this for you. Have a look at the next article on architecture and licensing to understand how this relates to Gurobi.