Transient errors in compute server jobs
AnsweredWe are using Gurobi Compute Server with cluster manager, and the client is a python application.
Every now and then, some jobs randomly fail with one of the below two errors:
1. ERROR_CSWORKER = 10030. Gurobi state is failed
2. ERROR_NETWORK = 10022
We typically have the client submit the job again when such an error occurs and the retry finishes without any problem, which is why we think these are transient errors.
What is the best way to handle these errors so that we can avoid having to submit an entire new job ? (read in data, generate model, solve - the error occurs in this solve step).
We tried just trying the solve_model step again and that doesn't seem to be working. Do we need to create a new compute server job as these errors are kind of CS worker/environment related and not the model solve itself ?
read_in_data
generate_model
try:
solve_model
except GurobiException as error:
solve_model
-
I see this in the compute server documentation,
The one scenario you may need to guard against is the situation where you lose the connection to the server while the portion of your program that builds and solves an optimization model is running. Gurobi Compute Server will automatically route queued jobs to another server, but jobs that are running when the server goes down are interrupted (the client will receive a NETWORK error). If you want your program to be able to survive such failures, you will need to architect it in such a way that it will rebuild and resolve the optimization model in response to a NETWORK error. The exact steps for doing so are application dependent, but they generally involve encapsulating the code between the initial Gurobi environment creation and the last Gurobi call into a function that can be reinvoked in case of an error.So looks like we will have to actually use the below exception handling ?
read_in_data
try:
generate_model
solve_model
except GurobiException as error:
generate_model
solve_model
Is there any way that we can use the model object generated in the first try to start a second job session so that we don't have to spend all the time again to add variables, constraints etc. ? Maybe something like
read_in_data
try:
generate_model
solve_model
except GurobiException as error:
new_job_session_from_previous_model
solve_model
0 -
Hello Mohan,
Thank you for contacting us! Since you have a commercial account with us, you qualify for one-on-one support. We converted your post into a ticket in our Help Center and will be in contact regarding your question shortly.
Best regards,
Dan
0
Please sign in to leave a comment.
Comments
2 comments