parallel processing in Gurobi

I am working in C++ with Gurobi version 11.0.3. The project is about the parallelization of solvers for stochastic problems.
This is my cut generation class:

CGO(const shared_ptr<Network>& networkPtr_, const GRBEnv& env_):
    networkPtr{networkPtr_},
    environment(env_),
    model {environment} {
    n = networkPtr_->n;
    model.set(GRB_IntParam_OutputFlag, 0);
    model.set(GRB_IntParam_Threads, 1);
    model.set(GRB_IntParam_InfUnbdInfo, 1);

It also initializes the variables and adds the constraints to the model. Since it is designed for stochastic programming, it will receive a solution from the Benders master problem and solves 50 sub-problems by just changing the objective function in a for loop using reset, setobjectives, and optimize() functions. I have written the following tests to evaluate the performance of parallelization, and I cannot see why it is performing so poorly.

void ParallelSolver::test1() {
   vector<int> wnums = {16};
   for (auto item : wnums) {
       omp_set_num_threads(item);
       #pragma omp parallel
       {
           GRBEnv envsub = GRBEnv(true);
           envsub.set(GRB_IntParam_LogToConsole, 1);
           envsub.start();
           CGO subsolver(networkPtr,std::move(envsub));
           vi solution = {1, 1, 1, 2 ,1 ,-1 ,0 ,-1, -1, -1 ,-1 ,-1 ,-1, 1, -1, -1, -1, 2 ,-1, 0 ,-1 ,-1 ,-1, 0, -1};
           auto y_bar = w2y(solution);
           #pragma omp critical
           {
               auto start = std::chrono::high_resolution_clock::now();
               auto cut = subsolver.solveSubProblem(y_bar);
               auto end = std::chrono::high_resolution_clock::now();
               std::chrono::duration<double> duration = (end - start);
               cout << "Execution time: " << duration.count() << " seconds" << endl;
           }
       }
   }
}
void ParallelSolver::test2() {
   vector<int> wnums = {16};
   for (auto item : wnums) {
       omp_set_num_threads(item);
       #pragma omp parallel
       {
           GRBEnv envsub = GRBEnv(true);
           envsub.start();
           CGO subsolver(networkPtr,std::move(envsub));
           vi solution = {1, 1, 1, 2 ,1 ,-1 ,0 ,-1, -1, -1 ,-1 ,-1 ,-1, 1, -1, -1, -1, 2 ,-1, 0 ,-1 ,-1 ,-1, 0, -1};
           auto y_bar = w2y(solution);
           auto start = std::chrono::high_resolution_clock::now();
           auto cut = subsolver.solveSubProblem(y_bar);
           auto end = std::chrono::high_resolution_clock::now();
           std::chrono::duration<double> duration = (end - start);
           cout << "Execution time: " << duration.count() << " seconds" << endl;
       }
   }
}

The function subsolver.solveSubProblem() adjusts the objective functions and solves the 50 scenarios. Here are the results from the two tests. I am using 16 cores(not threads) out of 24 on my machine(AMD Ryzen Threadripper PRO 5965WX 24-Cores (3.80 GHz)).

from test1:

Execution time: 3.09092 seconds
Execution time: 3.63771 seconds
Execution time: 4.00157 seconds
Execution time: 2.99066 seconds
Execution time: 2.9729 seconds
Execution time: 2.81628 seconds
Execution time: 2.82811 seconds
Execution time: 2.84136 seconds
Execution time: 2.81278 seconds
Execution time: 2.8417 seconds
Execution time: 2.81909 seconds
Execution time: 2.81132 seconds
Execution time: 2.91263 seconds
Execution time: 2.83684 seconds
Execution time: 2.92789 seconds
Execution time: 2.83579 seconds

from test2:

Execution time: 14.9 seconds
Execution time: 14.9554 seconds
Execution time: 15.0151 seconds
Execution time: 15.038 seconds
Execution time: 15.0729 seconds
Execution time: 15.1317 seconds
Execution time: 15.293 seconds
Execution time: 15.3157 seconds
Execution time: 15.2754 seconds
Execution time: 15.3682 seconds
Execution time: 15.455 seconds
Execution time: 15.4082 seconds
Execution time: 15.4379 seconds
Execution time: 15.4913 seconds
Execution time: 15.5538 seconds
Execution time: 15.6591 seconds

As shown in the test2 results, it is taking significantly longer than around 3-4 seconds, which is expected to happen.
In terms of license, I am using a university HPC license form that I received from my school.

Riley Clement

Gurobi Staff

2025年11月21日 09:52

Hi Seyed,

In test1 the subproblems are running sequentially because of #pragma omp critical, and the solves are not competing for resources. The total time required will be at least the sum of these runtimes, so 48s+.

In test2 your subproblems are running in parallel, but they are competing for resources, so each subproblem solves slower. However the total time required for test2 will be closer to 16s, so overall it is faster and I would not say it is performing poorly.

I think the difference can be explained by the resource sharing. You have set Threads=1 for your subproblem but threads are not the only shared resource. How big are the subproblems in terms of rows/columns/nonzeros?

- Riley

Seyed Khademnia

2025年11月21日 21:00

It is about 400 rows of constraints and about 3000 variables.
About the performance, since the machine has more than needed actual cores for this test, what do you think is the resource that is limiting the performance? Would it be rational to think the bottleneck is the communication between RAM and CPU? If so, is there any method to reduce the size of the Gurobi solver since it is going to only solve LP problems, and Gurobi has an arsenal of tools for MILP, which are not needed in my case?

Thank you for your response.

Best

2025年11月22日 14:43

what do you think is the resource that is limiting the performance? Would it be rational to think the bottleneck is the communication between RAM and CPU?

Yes I think so - L3 cache and memory bandwidth would be the main culprits I think. Perhaps i/o if writing to file. I would not be surprised if there are other shared resources I am not aware of.

is there any method to reduce the size of the Gurobi solver since it is going to only solve LP problems, and Gurobi has an arsenal of tools for MILP, which are not needed in my case?

It does this automatically. If you are solving LPs then Gurobi is not going to be setting up a Branch and Bound tree, running heuristics etc. It's only going to call routines which are relevant for LPs.

As an aside, if you're not already experimenting with the Method parameter for your subproblems then I certainly would. With one thread the default LP algorithm will be dual simplex. You may want to try primal simplex too. I'd be careful about using barrier, if you subproblem is found to be infeasible or unbounded during barrier (before crossover) then the Farkas dual or unbounded ray won't be available.

2025年11月25日 01:33

Thanks! I will check out the methods parameter.

One thing I figured out here is that the .reset() method is not necessary for changing the objective and reoptimizing the model, so I removed it. Now the same test as above returns the solutions in about 8 seconds. There must be some serialization factor that occurs in .reset(), which I can now avoid and achieve better performance.

parallel processing in Gurobi

コメント

お探しのものを見つけられませんでしたか？