The error "Job state is 'FAILED'" occurs when a job submitted to Gurobi's Remote Services (Compute Server, Cluster Manager, or Instant Cloud) encounters a critical issue that prevents it from completing successfully.

In Gurobi's Remote Services environment, jobs can have several states:

RUNNING: Job is currently being processed
COMPLETED: Job finished successfully
ABORTED: Job was manually terminated
FAILED: Job encountered an error and could not complete

Common Causes and Solutions

Resource issues

Two common reasons for a Compute Server worker crashing during the solve are :

the worker machine runs out of memory
the CPU on the worker machine is overloaded

This type of crash may be intermittent as it also depend on other loads running on the server.

Memory exhaustion

Worker machine runs out of memory during optimization
Multiple jobs competing for limited memory resources
Large models requiring more memory than available

CPU overload

Worker machine CPU is overloaded
Too many concurrent jobs on the same worker
Insufficient processing capacity for the job requirements

Resolving resource issues

Monitor system resources on compute nodes to identify bottlenecks
Configure memory limits using Remote Services parameters:
- Set MEMLIMIT in grb_rs.cnf to limit total memory per job (in GB)
- Set SOFTMEMLIMIT for graceful termination when memory limit is approached
Adjust job concurrency:
- Review JOBLIMIT parameter in the grb_rs.cnf to control simultaneous jobs
- Consider HARDJOBLIMIT for strict job count enforcement
Optimize memory usage in your models:
- Reduce Threads parameter to decrease memory per job
- Set NodefileStart=0.5 to write nodes to disk when memory usage exceeds 0.5 GB
- Use NodefileDir to specify disk location for node files
- See How do I avoid an out-of-memory condition? for further suggestions.

Network issues

Connection timeouts

Large model updates can cause a connection timeout because sending large models to a Compute Server requires a lot of memory on both the client and the server and a high-bandwidth network connection.

Large model updates causing network timeouts
Insufficient bandwidth between client and server
Network instability affecting job communication

Resolving network issues

Optimize model updates:
- Call model.update() more frequently to send smaller incremental updates
- Balance between update frequency and network overhead
Verify network infrastructure:
- Test network connectivity between client and server
- Check firewall settings and port accessibility
- Validate server addresses and ports in configuration
Test with simpler jobs first to isolate network vs. model complexity issues

Process for Diagnosing a FAILED job

The first thing to look at are the extended client and server logs corresponding to the run when this error is encountered.

Enable Detailed Logging for Computer Server workers

If you are using Compute Server or Cluster Manager, you can enable detailed logging by modifying the configuration file and restarting the service.

Add VERBOSE=1 to your grb_rs.cnf file
Restart the Remote Services: grb_rs restart
Check logs at:
- Windows: ..\\win64\\bin\\service.log
- Linux: $GUROBI_SERVER_HOME/bin/service.log

Enable extended logging for Compute Server or Instant Cloud

To enable extended logging, you need to set the environment parameter CSClientLog to 3. This parameter has to be set before the environment is started (see code snippet below for an example):

env = Env(empty=True)
env.setParam('CSClientLog', 3)
...
env.start()

Monitoring Server Performance

When reviewing the extended logs, pay attention to server performance indicators that may appear in the output:

CPU Usage: Look for patterns like (4 running, 95 %CPU, 95 %MEM)
Connection Status: Monitor connection establishment and job submission messages
Queue Information: Check for job queuing and capacity messages
Error Messages: Watch for network timeouts, authentication failures, or capacity issues

Related to