The error "Job state is 'FAILED'" occurs when a job submitted to Gurobi's Remote Services (Compute Server, Cluster Manager, or Instant Cloud) encounters a critical issue that prevents it from completing successfully.
In Gurobi's Remote Services environment, jobs can have several states:
- RUNNING: Job is currently being processed
- COMPLETED: Job finished successfully
- ABORTED: Job was manually terminated
- FAILED: Job encountered an error and could not complete
Common Causes and Solutions
Resource issues
Two common reasons for a Compute Server worker crashing during the solve are :
- the worker machine runs out of memory
- the CPU on the worker machine is overloaded
This type of crash may be intermittent as it also depend on other loads running on the server.
Memory exhaustion
- Worker machine runs out of memory during optimization
- Multiple jobs competing for limited memory resources
- Large models requiring more memory than available
CPU overload
- Worker machine CPU is overloaded
- Too many concurrent jobs on the same worker
- Insufficient processing capacity for the job requirements
Resolving resource issues
- Monitor system resources on compute nodes to identify bottlenecks
-
Configure memory limits using Remote Services parameters:
- Set
MEMLIMITin grb_rs.cnf to limit total memory per job (in GB) - Set
SOFTMEMLIMITfor graceful termination when memory limit is approached
- Set
-
Adjust job concurrency:
- Review
JOBLIMITparameter in thegrb_rs.cnfto control simultaneous jobs - Consider
HARDJOBLIMITfor strict job count enforcement
- Review
-
Optimize memory usage in your models:
- Reduce
Threadsparameter to decrease memory per job - Set
NodefileStart=0.5to write nodes to disk when memory usage exceeds 0.5 GB - Use
NodefileDirto specify disk location for node files - See How do I avoid an out-of-memory condition? for further suggestions.
- Reduce
Network issues
Connection timeouts
Large model updates can cause a connection timeout because sending large models to a Compute Server requires a lot of memory on both the client and the server and a high-bandwidth network connection.
- Large model updates causing network timeouts
- Insufficient bandwidth between client and server
- Network instability affecting job communication
Resolving network issues
-
Optimize model updates:
- Call
model.update()more frequently to send smaller incremental updates - Balance between update frequency and network overhead
- Call
-
Verify network infrastructure:
- Test network connectivity between client and server
- Check firewall settings and port accessibility
- Validate server addresses and ports in configuration
- Test with simpler jobs first to isolate network vs. model complexity issues
Process for Diagnosing a FAILED job
The first thing to look at are the extended client and server logs corresponding to the run when this error is encountered.
Enable Detailed Logging for Computer Server workers
If you are using Compute Server or Cluster Manager, you can enable detailed logging by modifying the configuration file and restarting the service.
- Add
VERBOSE=1to your grb_rs.cnf file - Restart the Remote Services:
grb_rs restart - Check logs at:
- Windows:
..\\win64\\bin\\service.log - Linux:
$GUROBI_SERVER_HOME/bin/service.log
- Windows:
Enable extended logging for Compute Server or Instant Cloud
To enable extended logging, you need to set the environment parameter CSClientLog to 3. This parameter has to be set before the environment is started (see code snippet below for an example):
env = Env(empty=True)
env.setParam('CSClientLog', 3)
...
env.start()
Monitoring Server Performance
When reviewing the extended logs, pay attention to server performance indicators that may appear in the output:
- CPU Usage: Look for patterns like (4 running, 95 %CPU, 95 %MEM)
- Connection Status: Monitor connection establishment and job submission messages
- Queue Information: Check for job queuing and capacity messages
- Error Messages: Watch for network timeouts, authentication failures, or capacity issues