Running a compute node on academic license
AnsweredHi,
Since I'm running in a LSF environment worker nodes can be suspended or killed by the HPC without notice. When using a self-managed cluster where all nodes are workers if one of them is solving the problem and the HPC decides to kill it the entire solution is aborted. I was hoping using a compute node would help alleviate this issue.
However when starting a compute node using grb_rs the node log reports: Token Server (*******): Compute server not enabled.
Currently the license has lines:
TYPE=TOKEN
VERSION=9
TOKENSERVER=*******
HOSTNAME=*******
HOSTID=*******
SOCKETS=2
EXPIRATION=*******
USELIMIT=4096
DISTRIBUTED=100
KEY=*******
CKEY=*******
What is it I'm doing wrong?
Thank you
-
Hi Shlomo,
To set up a compute cluster with an academic floating license, you need to define every grb_rs process as "worker". This can be done by
grb_rs --worker
or by setting the WORKER parameter in the grb_rs.cnf configuration file accordingly. Then, you can let additional workers join this first server by specifying the "--join=<servername>" option when starting them. This is also explained here. All workers need to have the same Token defined in the configuration file.
The worker nodes do not need to access the token server; only the client needs to acquire a token. To send jobs to such a cluster of workers you also need to specify the WorkerPool parameter when submitting a job via gurobi_cl or the API.
Best regards,
Matthias0 -
Thanks Matthias,
I've already done that. The issue is that if for some reason LSF decides to kill or suspend the worker job - the solution immediately fails.
I was hoping using a compute node would help alleviate this issue. From your response I understand academic license cannot be used to start a compute node or that a compute node would solve that issue. Please correct me if I'm wrong or if there is another academic license which does allow for a compute node and that indeed using a compute node would help.
Thank you,
Shlomo
0 -
Hi Shlomo,
I am not familiar with LSF environments. With an academic license you cannot start a full Compute Server environment but you can still use distributed computing by connecting multiple grb_rs workers as I explained above.
All this is worthless, though, if your cluster management is randomly killing jobs. You should talk to your system administrator to figure out the reason for this. I don't really know how else we can help you here.
Cheers,
Matthias0 -
Thanks, I'll take this up with our sys admin.
0
Please sign in to leave a comment.
Comments
4 comments