Gurobi with multiprocessing has few benefits

回答済み

2022年08月12日 04:11

Problem:

I follow the tutorial in https://support.gurobi.com/hc/en-us/articles/360043111231-How-do-I-use-multiprocessing-in-Python-with-Gurobi- to use multiprocessing with gurobi, to solve multiple problems in parallel. Expected and unexcepted phenomenons appear together:

Expected phenomenon appears: log file of each problem is written at the same time, and the number of busy cpu cores is close to the number of problems. So the multiprocessing module is working.
Unexpected phenomenon appears: when I solve 4 problems with 4 processes, a single problem costs 3 minutes, and when I solve 8 problems with 8 processes, a single problem costs 6 minutes, which is strange since single problem should cost nearly 3 minutes with multiprocessing. This phenomenon indicates there are few benefits to use multiprocessing with Gurobi.

Possible reasons that have been excluded:

The numebr of physical cores and logic cores are 12 and 24 respectively, which are larger than tested scale. So it is not because the problems are too many and cpu cores are too few.
The space of remaining main memory is still large (several GBs) in all tested scales, so it is not because the main memory is run up and virtual memory is used.
I ever tried to write a python script for each problem respectively (namely, 8 problems then 8 python script files), and run 8 python scripts at the same time. But above problem still exists. So it is not because multiprocessing module has some issue, and starting 8 processes directly by 8 python scripts also leads to such situation.

Possible reason that has NOT been excluded:

Maybe Gurobi has a singleton library that is shared by all solving processes, and this singleton library is locked when it is used by a process. Thus multiprocessing has few benefits.

Code in Google Drive:

If above illustration cannot help find the reason, then the codes below may help. The codes consists of two files:

A file modelling TSP (travelling salesman problem), which is originated from official example and modified by myself to experiment.
A file using multiprocessing to solve multiple TSP in parallel to repeat the problem.

Code of TSP (this code has no logic problem, so just copy):

https://drive.google.com/file/d/1BrWSppre5QHG2Njpqs3gKkKprr-AEaQw/view?usp=sharing

Code of using multiprocessing:

https://drive.google.com/file/d/1JLLeHNSF29Z3Bkuo59OtrnODQkZxwa_7/view?usp=sharing

10件のコメント

正式なコメント
Simranjit Kaur

Gurobi Staff

2025年11月16日 23:30
This post is more than three years old. Some information may not be up to date. For current information, please check the Gurobi Documentation or Knowledge Base. If you need more help, please create a new post in the community forum, or try Gurobot, our chatbot interface offering instant, expert-level support.
Jaromił Najman

Gurobi Staff

2022年08月12日 08:25 編集済み
Hi,

Unexpected phenomenon appears: when I solve 4 problems with 4 processes, a single problem costs 3 minutes, and when I solve 8 problems with 8 processes, a single problem costs 6 minutes, which is strange since single problem should cost nearly 3 minutes with multiprocessing. This phenomenon indicates there are few benefits to use multiprocessing with Gurobi.

What you observe might be caused by cores with hyperthreading. At least I was able to reproduce your observation on a machine with 4 physical cores of which each uses hyperthreading providing 8 logical processors. When I run your code with \(\texttt{caseNum=8}\) vs \(\texttt{caseNum=4}\) I can observe the same behavior you described. This is because Gurobi (and other Software packages) usually can get out the most efficiency out of physical cores instead of logical processors, i.e., running 8 cases on 8 logical processors will be slower than running 4 cases on 4 physical cores, because you don't have to push 4 additional cases onto the virtual/logical processors. I am not able to observe this issue when comparing \(\texttt{caseNum=4}\) vs \(\texttt{caseNum=2}\), because then all cases can use physical cores.It should be best to limit the number of cases run in parallel to the number of physical cores your machine has (probably 4).

Best regards,
Jaromił
1
junyan qiu

2022年08月12日 10:49
Hi, Jaromił:

I am not able to observe this issue when comparing caseNum=4 vs caseNum=2, because then all cases can use physical cores.It should be best to limit the number of cases run in parallel to the number of physical cores your machine has (probably 4).

I also observe this behavior. When the number of cases is smaller than (or equal to) 4, all cases can be parallelized as expected. And when it is greater than 4, additional cases will compete for processors with others, which leads to longer solving time.

But I actually have 12 physical cores, so running 4 cases or 8 cases should NOT lead to contention of processors. Maybe this is the most strange phenomenon.
0
Jaromił Najman

Gurobi Staff

2022年08月12日 11:08 編集済み
But I actually have 12 physical cores, so running 4 cases or 8 cases should NOT lead to contention of processors. Maybe this is the most strange phenomenon.

This is indeed strange. Currently, you are using randomly generated TSP models. Could you please fix the TSP data and try again? Maybe you are just getting really unlucky with the randomly generated data. If this is not it, could you please post a log snippet of one of the 8 cases runs vs the same in the 4 cases runs. Please use fixed TSP data for this experiment to avoid non-determinism.
0

junyan qiu

2022年08月12日 12:40

Please use fixed TSP data for this experiment to avoid non-determinism.

I use a benchmark in TSPLIB as data, which consists of 200 points. Every TSP case is using the same data. Then the the number of cases and solving time of single case are:

1 case, 130 seconds
2 cases, 135 seconds
4 cases, 150 seconds
8 cases, 320 seconds

The log snippet (62 lines about solving process) under 4 cases is:

Gurobi Optimizer version 9.5.2 build v9.5.2rc0 (linux64)
Thread count: 12 physical cores, 24 logical processors, using up to 1 threads
Optimize a model with 200 rows, 19900 columns and 39800 nonzeros
Model fingerprint: 0x7fc99f6a
Variable types: 0 continuous, 19900 integer (19900 binary)
Coefficient statistics:
  Matrix range     [1e+00, 1e+00]
  Objective range  [1e+01, 4e+03]
  Bounds range     [1e+00, 1e+00]
  RHS range        [2e+00, 2e+00]
Presolve time: 0.03s
Presolved: 200 rows, 19900 columns, 39800 nonzeros
Variable types: 0 continuous, 19900 integer (19900 binary)

Root relaxation: objective 2.705634e+04, 286 iterations, 0.01 seconds (0.00 work units)

    Nodes    |    Current Node    |     Objective Bounds      |     Work
 Expl Unexpl |  Obj  Depth IntInf | Incumbent    BestBd   Gap | It/Node Time

     0     0 27056.3397    0   44          - 27056.3397      -     -    0s
     0     0 27715.8457    0   34          - 27715.8457      -     -    1s
     0     0 27810.6488    0   66          - 27810.6488      -     -    1s
     0     0 27891.0050    0   46          - 27891.0050      -     -    1s
     0     0 27891.0050    0   46          - 27891.0050      -     -    1s
     0     2 28269.9233    0   40          - 28269.9233      -     -    2s
   330   332 29266.8831  292    6          - 28269.9233      -   2.2    5s
*  797   797             702    33913.632716 28269.9233  16.6%   2.6    9s
*  801   795             704    33445.732380 28269.9233  15.5%   2.7    9s
H  808   755                    33436.614637 28741.9356  14.0%   2.7   10s
   810   756 32748.8882  639   20 33436.6146 28805.9819  13.8%   2.7   15s
H  812   719                    32980.622476 28901.6668  12.4%   2.7   15s
H  815   685                    32955.955635 28941.8916  12.2%   2.7   15s
H  822   655                    32729.995194 29088.6910  11.1%   2.7   22s
H  823   623                    32656.972270 29088.6910  10.9%   3.3   23s
H  927   660                    32256.122317 29088.6910  9.82%   3.7   24s
  1005   714 29710.3775  101   10 32256.1223 29088.6910  9.82%   4.1   25s
H 1031   699                    31924.196415 29088.6910  8.88%   4.2   25s
  1496   996 29757.7544  105   16 31924.1964 29088.6910  8.88%   5.5   30s
  2217  1434 29862.7351   89   10 31924.1964 29088.6910  8.88%   6.3   35s
  2980  2084 30887.6736  262   14 31924.1964 29088.6910  8.88%   6.5   40s
H 3012  2089                    31802.363117 29088.6910  8.53%   6.5   40s
  3724  2773 31449.2284  299   12 31802.3631 29088.6910  8.53%   6.4   45s
H 4012  3021                    31769.138981 29088.6910  8.44%   6.5   47s
  4629  3616 30281.8051  188   12 31769.1390 29088.6910  8.44%   6.2   50s
* 4648  2371             207    30609.785858 29088.6910  4.97%   6.2   50s
* 4666  2363             222    30599.716103 29088.6910  4.94%   6.2   50s
  5890  3439 29232.5910   17   34 30599.7161 29090.3735  4.93%   6.1   55s
  7045  4485 29827.9924   92    6 30599.7161 29100.2090  4.90%   6.0   60s
* 7785  4273             153    30347.881419 29112.7718  4.07%   6.0   63s
  8306  4698 29470.2863   36    8 30347.8814 29119.3132  4.05%   5.9   65s
  9685  5925 29872.4161  101    6 30347.8814 29127.0345  4.02%   6.0   70s
*10721  6814             135    30343.316106 29131.2986  3.99%   5.9   73s
*10722  6788             135    30334.946511 29131.2986  3.97%   5.9   73s
 11069  7083 29659.9582   41   12 30334.9465 29132.5935  3.96%   5.9   75s
*11453  3909              92    29790.702290 29136.7929  2.20%   5.9   76s
*11466  3756              97    29761.636800 29136.7929  2.10%   5.9   76s
*11535  3716              70    29752.022242 29137.1662  2.07%   5.9   76s
*11536  3714              70    29751.714716 29137.1662  2.07%   5.9   77s
*11561  3266              42    29678.142637 29137.4403  1.82%   5.9   77s
H11873  3452                    29669.274324 29140.0459  1.78%   5.9   78s
*11874  3357              40    29655.125980 29140.0459  1.74%   5.9   78s
*12278  3569              40    29641.631650 29142.3223  1.68%   5.9   79s
*12289  2802              44    29548.615021 29142.3223  1.37%   5.9   79s
 12430  2888 29361.3788   36   38 29548.6150 29149.4528  1.35%   5.9   80s
*13435  2641              42    29464.215146 29159.7805  1.03%   6.0   82s
 14252  3095 29453.4811   47   39 29464.2151 29167.5177  1.01%   6.1   85s
 16208  4199 29337.9802   22   47 29464.2151 29195.7398  0.91%   6.1   90s
 18016  5148 29406.1668   30   12 29464.2151 29208.6554  0.87%   6.3   95s
*18500  3027              35    29375.914940 29211.5626  0.56%   6.3   96s
 19643  3338 29322.9716   18   78 29375.9149 29223.8827  0.52%   6.5  100s
 21310  3654     cutoff   48      29375.9149 29238.1204  0.47%   7.1  105s
 22914  3901     cutoff   30      29375.9149 29251.5040  0.42%   7.6  110s
 24569  4131     cutoff   21      29375.9149 29263.2067  0.38%   8.0  115s
 26153  4181     cutoff   34      29375.9149 29274.6952  0.34%   8.5  120s
 27786  4209 29352.1452   19   74 29375.9149 29286.0138  0.31%   9.0  125s
 29418  4194 29329.9900   28   28 29375.9149 29297.0957  0.27%   9.4  130s
H30860  3843                    29369.407047 29305.3776  0.22%   9.8  134s
 30891  3837 29350.4595   32   18 29369.4070 29305.7943  0.22%   9.8  135s
 32497  3322     cutoff   26      29369.4070 29319.5672  0.17%  10.2  140s
 34275  2502     cutoff   35      29369.4070 29332.9114  0.12%  10.6  145s
 36362   941     cutoff   33      29369.4070 29353.0792  0.06%  10.9  150s

The log snippet (95 lines about solving process) under 8 cases is:

Gurobi Optimizer version 9.5.2 build v9.5.2rc0 (linux64)
Thread count: 12 physical cores, 24 logical processors, using up to 1 threads
Optimize a model with 200 rows, 19900 columns and 39800 nonzeros
Model fingerprint: 0x7fc99f6a
Variable types: 0 continuous, 19900 integer (19900 binary)
Coefficient statistics:
  Matrix range     [1e+00, 1e+00]
  Objective range  [1e+01, 4e+03]
  Bounds range     [1e+00, 1e+00]
  RHS range        [2e+00, 2e+00]
Presolve time: 0.04s
Presolved: 200 rows, 19900 columns, 39800 nonzeros
Variable types: 0 continuous, 19900 integer (19900 binary)

Root relaxation: objective 2.705634e+04, 286 iterations, 0.01 seconds (0.00 work units)

    Nodes    |    Current Node    |     Objective Bounds      |     Work
 Expl Unexpl |  Obj  Depth IntInf | Incumbent    BestBd   Gap | It/Node Time

     0     0 27056.3397    0   44          - 27056.3397      -     -    0s
     0     0 27715.8457    0   34          - 27715.8457      -     -    1s
     0     0 27810.6488    0   66          - 27810.6488      -     -    1s
     0     0 27891.0050    0   46          - 27891.0050      -     -    1s
     0     0 27891.0050    0   46          - 27891.0050      -     -    1s
     0     2 28269.9233    0   40          - 28269.9233      -     -    3s
   118   120 28386.9552  104   18          - 28269.9233      -   2.1    5s
   490   492 29871.9338  438   12          - 28269.9233      -   2.2   10s
   766   768 32888.5359  678    8          - 28269.9233      -   2.6   15s
*  797   797             702    33913.632716 28269.9233  16.6%   2.6   15s
*  801   795             704    33445.732380 28269.9233  15.5%   2.7   16s
H  808   755                    33436.614637 28741.9356  14.0%   2.7   17s
   810   756 32748.8882  639   20 33436.6146 28805.9819  13.8%   2.7   22s
H  812   719                    32980.622476 28901.6668  12.4%   2.7   22s
H  815   685                    32955.955635 28941.8916  12.2%   2.7   23s
H  822   655                    32729.995194 29088.6910  11.1%   2.7   30s
H  823   623                    32656.972270 29088.6910  10.9%   3.3   30s
H  927   660                    32256.122317 29088.6910  9.82%   3.7   32s
H 1031   699                    31924.196415 29088.6910  8.88%   4.2   34s
  1045   711 29783.8275  121    6 31924.1964 29088.6910  8.88%   4.2   35s
  1282   868 31747.8521  240   10 31924.1964 29088.6910  8.88%   5.1   40s
  1586  1056 30057.6787  150   13 31924.1964 29088.6910  8.88%   5.7   45s
  1923  1281 31251.0126  318   12 31924.1964 29088.6910  8.88%   6.1   50s
  2305  1493 30682.6351  133    6 31924.1964 29088.6910  8.88%   6.5   55s
  2688  1792 29843.1081   99   21 31924.1964 29088.6910  8.88%   6.7   60s
H 3012  2089                    31802.363117 29088.6910  8.53%   6.5   64s
  3047  2126 31441.0030  295    8 31802.3631 29088.6910  8.53%   6.6   65s
  3490  2539 30375.7221  203   12 31802.3631 29088.6910  8.53%   6.5   70s
  3836  2867 29663.8587   64   12 31802.3631 29088.6910  8.53%   6.5   75s
H 4012  3021                    31769.138981 29088.6910  8.44%   6.5   77s
  4209  3220 31153.2319  230    6 31769.1390 29088.6910  8.44%   6.4   80s
* 4648  2371             207    30609.785858 29088.6910  4.97%   6.2   84s
* 4666  2363             222    30599.716103 29088.6910  4.94%   6.2   84s
  4730  2423 30283.5904   81    6 30599.7161 29088.6910  4.94%   6.2   85s
  5318  2944 29337.5764   33   26 30599.7161 29088.6910  4.94%   6.2   90s
  5946  3495 29804.6214   73   12 30599.7161 29090.3735  4.93%   6.0   95s
  6509  3990 30553.8642  166    6 30599.7161 29091.8567  4.93%   6.0  100s
  7034  4474 29679.9788   81   19 30599.7161 29100.2090  4.90%   6.0  105s
  7585  4964 29895.3347   64   14 30599.7161 29108.5532  4.87%   6.0  110s
* 7785  4273             153    30347.881419 29112.7718  4.07%   6.0  111s
  8211  4649 30128.2273  147   12 30347.8814 29118.3349  4.05%   5.9  115s
  8905  5234 29438.7625   44    6 30347.8814 29121.4767  4.04%   5.8  120s
  9503  5758 29248.8773   21   64 30347.8814 29126.8141  4.02%   6.0  125s
 10174  6347 29684.6457   43   21 30347.8814 29129.0950  4.02%   5.9  130s
*10721  6814             135    30343.316106 29131.2986  3.99%   5.9  133s
*10722  6788             135    30334.946511 29131.2986  3.97%   5.9  133s
 10911  6957 30255.7660  170    6 30334.9465 29131.8342  3.97%   5.9  135s
*11453  3909              92    29790.702290 29136.7929  2.20%   5.9  139s
*11466  3756              97    29761.636800 29136.7929  2.10%   5.9  139s
*11535  3716              70    29752.022242 29137.1662  2.07%   5.9  140s
*11536  3714              70    29751.714716 29137.1662  2.07%   5.9  140s
*11561  3266              42    29678.142637 29137.4403  1.82%   5.9  140s
H11873  3452                    29669.274324 29140.0459  1.78%   5.9  142s
*11874  3357              40    29655.125980 29140.0459  1.74%   5.9  143s
 12164  3585 29507.0146   57    8 29655.1260 29141.6644  1.73%   5.9  145s
*12278  3569              40    29641.631650 29142.3223  1.68%   5.9  145s
*12289  2802              44    29548.615021 29142.3223  1.37%   5.9  145s
 12876  3207 29249.2750   19   43 29548.6150 29156.2383  1.33%   6.0  150s
*13435  2641              42    29464.215146 29159.7805  1.03%   6.0  153s
 13639  2746 29455.1680   31   14 29464.2151 29161.1308  1.03%   6.0  155s
 14452  3215 29243.1795   26   62 29464.2151 29171.4097  0.99%   6.1  160s
 15262  3682 29358.2684   37   77 29464.2151 29182.8030  0.96%   6.1  165s
 16040  4125 29296.7689   21   52 29464.2151 29194.5989  0.92%   6.1  170s
 16844  4521 29332.2775   27   51 29464.2151 29200.1569  0.90%   6.2  175s
 17586  4918 29289.5524   29   60 29464.2151 29206.7207  0.87%   6.2  180s
 18197  5239 29398.8549   44   48 29464.2151 29210.3335  0.86%   6.3  185s
*18500  3027              35    29375.914940 29211.5626  0.56%   6.3  187s
 18844  3140 29345.7430   22   36 29375.9149 29215.5909  0.55%   6.4  190s
 19552  3317 29357.9260   29   14 29375.9149 29222.2520  0.52%   6.5  195s
 20251  3496 29349.7913   26   20 29375.9149 29228.8713  0.50%   6.7  200s
 20972  3608     cutoff   32      29375.9149 29234.2899  0.48%   7.0  205s
 21689  3711 29374.7460   20   26 29375.9149 29240.7951  0.46%   7.2  210s
 22369  3821 29367.0213   28  110 29375.9149 29246.9056  0.44%   7.5  215s
 23068  3926 29303.0486   22   34 29375.9149 29252.5605  0.42%   7.6  220s
 23787  4016 29360.2784   32   51 29375.9149 29258.2020  0.40%   7.8  225s
 24481  4129 29333.0553   24   59 29375.9149 29262.3270  0.39%   8.0  230s
 25216  4155 29352.3492   22   16 29375.9149 29267.6988  0.37%   8.2  235s
 25886  4189 29371.1291   28   28 29375.9149 29272.6217  0.35%   8.4  240s
 26603  4213 29363.8694   20   18 29375.9149 29277.6986  0.33%   8.6  245s
 27338  4212 29349.8996   28   55 29375.9149 29283.4633  0.31%   8.8  250s
 28082  4207 29361.7154   29    8 29375.9149 29287.9059  0.30%   9.1  255s
 28837  4203 29365.2478   21   53 29375.9149 29292.1512  0.29%   9.2  260s
 29575  4193 infeasible   25      29375.9149 29297.3943  0.27%   9.4  265s
 30293  4179 29346.8473   35   41 29375.9149 29301.7524  0.25%   9.6  270s
H30860  3843                    29369.407047 29305.3776  0.22%   9.8  273s
 31019  3796     cutoff   27      29369.4070 29307.2726  0.21%   9.8  275s
 31730  3595 29344.6533   35   56 29369.4070 29313.4701  0.19%  10.0  280s
 32468  3338     cutoff   25      29369.4070 29319.2369  0.17%  10.2  285s
 33062  3094 infeasible   23      29369.4070 29323.8124  0.16%  10.4  290s
 33418  2929     cutoff   20      29369.4070 29326.5822  0.15%  10.4  295s
 33918  2718 29360.2605   25   67 29369.4070 29329.6431  0.14%  10.5  300s
 34400  2429 29346.1475   27   53 29369.4070 29333.9145  0.12%  10.6  305s
 34973  2067     cutoff   24      29369.4070 29338.9773  0.10%  10.7  310s
 35716  1497     cutoff   26      29369.4070 29345.8693  0.08%  10.8  315s
 36228  1066     cutoff   27      29369.4070 29351.3598  0.06%  10.9  320s

Jaromił Najman

Gurobi Staff

2022年08月12日 13:50
It looks like something is interfering with the optimization tasks in the case of 8 parallel runs because reaching the first feasible point takes ~50% more time than with 4 parallel runs but the solution path looks the same. Are you sure that all 8 cores are fully available and nothing is running in the background? Are all cores equal?
0
junyan qiu

2022年08月12日 14:40
- I am sure 8 cores are fully available. I can see the number of busy cores is also 8 (with 100% usage), and there is no other heavy task in the background when I experiment.
- I guess the problem may be the "magic" number 4. When I switch to another machine with 8 physical cores, the parallelism behaves the same weird way when the number of cases exceeds 4. Maybe 4 is written in the source code of Gurobi?
0
Jaromił Najman

Gurobi Staff

2022年08月12日 15:28
I guess the problem may be the "magic" number 4. When I switch to another machine with 8 physical cores, the parallelism behaves the same weird way when the number of cases exceeds 4. Maybe 4 is written in the source code of Gurobi?

I just tested your setting on a 24 core machine and could not observe the behavior you describe.

I would guess that the L-Cache of your cores is full when you run more processes. Note that often L1 and L2 Cache are small compared to L3 Cache and RAM but are the fastest access points on a processor. The more processes you start, the bigger is the probability that the smaller L1 and L2 Caches do not suffice any longer and your processes have to take the "long" way to access the L3 Cache and/or RAM. This can result in significant slow downs depending on speed differences in accessing the different Cache Levels.

Best regards,
Jaromił
1
David Torres Sanchez

Gurobi Staff

2022年08月12日 16:10
Also posted here: https://or.stackexchange.com/q/8872/4752
0
junyan qiu

2022年08月12日 16:48
Thanks for your responses, which are professional. When the number of cases goes up, the ratio of cache-miss increases quickly, which can explain the significant slowing down of solving speed.
0

投稿コメントは受け付けていません。

Gurobi with multiprocessing has few benefits

コメント

お探しのものを見つけられませんでしたか？