Prerequisites
In order to set up a cluster of distributed workers as an academic user, you must have access to a license for which distributed optimization is enabled (in the User Portal for your particular license the LIMITS tab should contain "Distributed Limit: 100"). If your license does not have this property, then please submit a request.
Alternatively, your institution might have an academic floating site license managed by a Gurobi token server (please contact your institution's IT administrator for additional information).
Creating the cluster
In the following, we create two workers on two separate machines and connect them to form a cluster. A client machine (optionally requesting a license token from the Gurobi token server when using a site license) can then submit a distributed optimization job to the two worker machines:
We strongly recommend to use one machine per worker when forming a cluster of distributed workers. If you wish to start multiple workers on the same machine for testing purposes, each instance of Gurobi Remote Services (grb_rs) must be started on a different port using a different data directory. The command "grb_rs init" will help by copying the default configuration and data directory into the current directory. Please refer to the Connecting Nodes documentation for more details.
1. Starting the first worker
In this example, we start the Remote Services workers as processes. To instead start the workers as services, please refer to the Installing a Cluster Node documentation.
We start the first worker on machine1 from the command line via
grb_rs --worker --port=12345
This starts a grb_rs worker process on port 12345 (you can choose any port number allowed by your IT administrator, often for ports <1024 you need root permissions). Instead of providing parameters "--worker --port=12345" via the command line, you could alternatively add the following lines to the grb_rs.cnf configuration file, located in the bin subdirectory of the Remote Services installation:
WORKER=TRUE
PORT=12345
Executing the aforementioned grb_rs command should provide output similar to the following:
info : Gurobi Remote Services starting...
info : Platform is linux
info : Version is 11.0.0 (build v11.0.0rc2)
info : Worker mode is limited to 1 job, no queue
info : Variable GRB_LICENSE_FILE is not set
info : Node address is machine1:12345
info : Node FQN is machine1.domain.com
info : Node has 4 cores
info : Using data directory <path_to_gurobi>/gurobi_server1100/linux64/bin/data
info : Node ID is bd9e74d1-ce8d-472f-8069-06d152940033
info : Available runtimes: [10.0.0 10.0.1 10.0.2 10.0.3 11.0.0 9.5.0 9.5.1 9.5.2]
info : Accepting worker registration on port 42443...
info : Public root is <path_to_gurobi>/gurobi_server1100/linux64/resources/grb_rs/public
info : Starting API server (HTTP) on port 12345...
2. Starting additional workers
We continue the example by starting a second worker on machine2 and connecting it to the first worker on machine1. Note that the two machines and the client must be able to communicate via their hostnames. If this is not possible you need to work with IP addresses instead of names.
On machine2, we execute
grb_rs --worker --port=12346 --join=machine1:12345
This starts a grb_rs worker process on port 12346 and joins the already running grb_rs process on machine1, creating a cluster. Alternatively, we could add the worker, port, and join settings to the grb_rs.cnf configuration file as follows:
WORKER=TRUE
PORT=12346
JOIN=machine1:12345
Starting Remote Services on machine2 in this manner produces output like the following:
info : Gurobi Remote Services starting...
info : Platform is linux
info : Version is 11.0.0 (build v11.0.0rc2)
info : Worker mode is limited to 1 job, no queue
info : Variable GRB_LICENSE_FILE is not set
info : Node address is machine2:12346
info : Node FQN is machine2.domain.com
info : Node has 4 cores
info : Using data directory <path_to_gurobi>/gurobi_server1100/linux64/bin/data
info : Node ID is bd9e74d1-ce8d-472f-8069-06d152940033
info : Available runtimes: [10.0.0 10.0.1 10.0.2 10.0.3 11.0.0 9.5.0 9.5.1 9.5.2]
info : Accepting worker registration on port 42443...
info : Public root is <path_to_gurobi>/gurobi_server1100/linux64/resources/grb_rs/public
info : Starting API server (HTTP) on port 12346...
info : Node machine1:12345, added to the cluster
On machine1, we can see an additional log line recording the new addition to the cluster:
info : Node machine2:12346, added to the cluster
With this, we have successfully created a cluster of two worker nodes.
Checking the status of the cluster
To check the current status of the cluster, use the grbcluster command to log in to the first worker machine:
grbcluster login --server=machine1:12345
When prompted for a password, pressing Enter without entering a password will use the default password.
Once you have logged into the cluster, the command "grbcluster nodes" generates a table with the status of the cluster nodes:
ID ADDRESS STATUS TYPE LICENSE PROCESSING #Q #R JL IDLE %MEM %CPU
08e7e51b machine1:12345 ALIVE WORKER N/A ACCEPTING 0 0 1 <1m 1.06 0.00
bb5942d0 machine2:12346 ALIVE WORKER N/A ACCEPTING 0 0 1 1m 1.22 0.24
Submitting a job to the cluster
The license file needs to be available, e.g., in one of the default locations. If using an academic site license, create a token server client license file containing the address and port of the machine running the Gurobi token server.
To submit a job to the cluster, set the DistributedMIPJobs parameter together with the WorkerPool parameter. For example, we can use the Gurobi command-line tool to submit a distributed optimization job to the cluster with the following command:
gurobi_cl DistributedMIPJobs=2 WorkerPool=machine1.domain.com:12345 glass4.mps
The resulting console output should state that a distributed job has been submitted and show which machines are being used:
Starting distributed worker jobs...
Started distributed worker on machine1.domain.com:12345
Started distributed worker on machine2.domain.com:12346
Distributed MIP job count: 2
Please note that it is sufficient to provide only one worker machine address when specifying the WorkerPool parameter. This can be any machine in the distributed worker cluster.
Common errors and warnings
1. The error
error : Error creating storage service, was grb_rs already started?: Error opening data store: timeout
occurs when two grb_rs processes try to access the same data directory. This may happen when a previous grb_rs service has not been stopped properly. In this case, please kill all grb_rs processes on the machine and try again. If you are trying to run two grb_rs processes on the same machine and/or data file system, please refer to the Connecting Nodes documentation for more details.
2. The error
Unable to start worker - Couldn't connect to server (code 7, command POST http://machine1.domain.com/api/v1/cluster/jobs)
occurs when the address provided by the WorkerPool parameter is wrong (e.g., machine1.domain.com instead of machine1.domain.com:12345).
3. The warning
Unable to start worker - Job was rejected because there is no capacity in the cluster
Distributed MIP job count: 2
Job count limited by machine availability
occurs when the value of the DistributedMIPJobs parameter exceeds the number of available worker nodes in the cluster.
Comments
0 comments
Article is closed for comments.