Skip to main content

Gurobi with multiprocessing has few benefits

Answered

Comments

10 comments

  • Official comment
    Simranjit Kaur
    • Gurobi Staff
    This post is more than three years old. Some information may not be up to date. For current information, please check the Gurobi Documentation or Knowledge Base. If you need more help, please create a new post in the community forum, or try Gurobot, our chatbot interface offering instant, expert-level support.
  • Jaromił Najman
    • Gurobi Staff

    Hi,

    Unexpected phenomenon appears: when I solve 4 problems with 4 processes, a single problem costs 3 minutes, and when I solve 8 problems with 8 processes, a single problem costs 6 minutes, which is strange since single problem should cost nearly 3 minutes with multiprocessing. This phenomenon indicates there are few benefits to use multiprocessing with Gurobi. 

    What you observe might be caused by cores with hyperthreading. At least I was able to reproduce your observation on a machine with 4 physical cores of which each uses hyperthreading providing 8 logical processors. When I run your code with \(\texttt{caseNum=8}\) vs \(\texttt{caseNum=4}\) I can observe the same behavior you described. This is because Gurobi (and other Software packages) usually can get out the most efficiency out of physical cores instead of logical processors, i.e., running 8 cases on 8 logical processors will be slower than running 4 cases on 4 physical cores, because you don't have to push 4 additional cases onto the virtual/logical processors. I am not able to observe this issue when comparing \(\texttt{caseNum=4}\) vs \(\texttt{caseNum=2}\), because then all cases can use physical cores.It should be best to limit the number of cases run in parallel to the number of physical cores your machine has (probably 4).

    Best regards, 
    Jaromił

    1
  • junyan qiu
    • Gurobi-versary
    • First Comment
    • First Question

    Hi, Jaromił:

     I am not able to observe this issue when comparing caseNum=4 vs caseNum=2, because then all cases can use physical cores.It should be best to limit the number of cases run in parallel to the number of physical cores your machine has (probably 4).

    I also observe this behavior. When the number of cases is smaller than (or equal to) 4, all cases can be parallelized as expected. And when it is greater than 4, additional cases will compete for processors with others, which leads to longer solving time. 

    But I actually have 12 physical cores, so running 4 cases or 8 cases should NOT lead to contention of processors. Maybe this is the most strange phenomenon. 

    0
  • Jaromił Najman
    • Gurobi Staff

    But I actually have 12 physical cores, so running 4 cases or 8 cases should NOT lead to contention of processors. Maybe this is the most strange phenomenon. 

    This is indeed strange. Currently, you are using randomly generated TSP models. Could you please fix the TSP data and try again? Maybe you are just getting really unlucky with the randomly generated data. If this is not it, could you please post a log snippet of one of the 8 cases runs vs the same in the 4 cases runs. Please use fixed TSP data for this experiment to avoid non-determinism.

    0
  • junyan qiu
    • Gurobi-versary
    • First Comment
    • First Question

    Please use fixed TSP data for this experiment to avoid non-determinism. 

    I use a benchmark in TSPLIB as data, which consists of 200 points. Every TSP case is using the same data. Then the the number of cases and solving time of single case are:

    • 1 case, 130  seconds
    • 2 cases, 135 seconds
    • 4 cases, 150 seconds
    • 8 cases, 320 seconds

    The log snippet (62 lines about solving process) under 4 cases is:

    Gurobi Optimizer version 9.5.2 build v9.5.2rc0 (linux64)
    Thread count: 12 physical cores, 24 logical processors, using up to 1 threads
    Optimize a model with 200 rows, 19900 columns and 39800 nonzeros
    Model fingerprint: 0x7fc99f6a
    Variable types: 0 continuous, 19900 integer (19900 binary)
    Coefficient statistics:
      Matrix range     [1e+00, 1e+00]
      Objective range  [1e+01, 4e+03]
      Bounds range     [1e+00, 1e+00]
      RHS range        [2e+00, 2e+00]
    Presolve time: 0.03s
    Presolved: 200 rows, 19900 columns, 39800 nonzeros
    Variable types: 0 continuous, 19900 integer (19900 binary)

    Root relaxation: objective 2.705634e+04, 286 iterations, 0.01 seconds (0.00 work units)

        Nodes    |    Current Node    |     Objective Bounds      |     Work
     Expl Unexpl |  Obj  Depth IntInf | Incumbent    BestBd   Gap | It/Node Time

         0     0 27056.3397    0   44          - 27056.3397      -     -    0s
         0     0 27715.8457    0   34          - 27715.8457      -     -    1s
         0     0 27810.6488    0   66          - 27810.6488      -     -    1s
         0     0 27891.0050    0   46          - 27891.0050      -     -    1s
         0     0 27891.0050    0   46          - 27891.0050      -     -    1s
         0     2 28269.9233    0   40          - 28269.9233      -     -    2s
       330   332 29266.8831  292    6          - 28269.9233      -   2.2    5s
    *  797   797             702    33913.632716 28269.9233  16.6%   2.6    9s
    *  801   795             704    33445.732380 28269.9233  15.5%   2.7    9s
    H  808   755                    33436.614637 28741.9356  14.0%   2.7   10s
       810   756 32748.8882  639   20 33436.6146 28805.9819  13.8%   2.7   15s
    H  812   719                    32980.622476 28901.6668  12.4%   2.7   15s
    H  815   685                    32955.955635 28941.8916  12.2%   2.7   15s
    H  822   655                    32729.995194 29088.6910  11.1%   2.7   22s
    H  823   623                    32656.972270 29088.6910  10.9%   3.3   23s
    H  927   660                    32256.122317 29088.6910  9.82%   3.7   24s
      1005   714 29710.3775  101   10 32256.1223 29088.6910  9.82%   4.1   25s
    H 1031   699                    31924.196415 29088.6910  8.88%   4.2   25s
      1496   996 29757.7544  105   16 31924.1964 29088.6910  8.88%   5.5   30s
      2217  1434 29862.7351   89   10 31924.1964 29088.6910  8.88%   6.3   35s
      2980  2084 30887.6736  262   14 31924.1964 29088.6910  8.88%   6.5   40s
    H 3012  2089                    31802.363117 29088.6910  8.53%   6.5   40s
      3724  2773 31449.2284  299   12 31802.3631 29088.6910  8.53%   6.4   45s
    H 4012  3021                    31769.138981 29088.6910  8.44%   6.5   47s
      4629  3616 30281.8051  188   12 31769.1390 29088.6910  8.44%   6.2   50s
    * 4648  2371             207    30609.785858 29088.6910  4.97%   6.2   50s
    * 4666  2363             222    30599.716103 29088.6910  4.94%   6.2   50s
      5890  3439 29232.5910   17   34 30599.7161 29090.3735  4.93%   6.1   55s
      7045  4485 29827.9924   92    6 30599.7161 29100.2090  4.90%   6.0   60s
    * 7785  4273             153    30347.881419 29112.7718  4.07%   6.0   63s
      8306  4698 29470.2863   36    8 30347.8814 29119.3132  4.05%   5.9   65s
      9685  5925 29872.4161  101    6 30347.8814 29127.0345  4.02%   6.0   70s
    *10721  6814             135    30343.316106 29131.2986  3.99%   5.9   73s
    *10722  6788             135    30334.946511 29131.2986  3.97%   5.9   73s
     11069  7083 29659.9582   41   12 30334.9465 29132.5935  3.96%   5.9   75s
    *11453  3909              92    29790.702290 29136.7929  2.20%   5.9   76s
    *11466  3756              97    29761.636800 29136.7929  2.10%   5.9   76s
    *11535  3716              70    29752.022242 29137.1662  2.07%   5.9   76s
    *11536  3714              70    29751.714716 29137.1662  2.07%   5.9   77s
    *11561  3266              42    29678.142637 29137.4403  1.82%   5.9   77s
    H11873  3452                    29669.274324 29140.0459  1.78%   5.9   78s
    *11874  3357              40    29655.125980 29140.0459  1.74%   5.9   78s
    *12278  3569              40    29641.631650 29142.3223  1.68%   5.9   79s
    *12289  2802              44    29548.615021 29142.3223  1.37%   5.9   79s
     12430  2888 29361.3788   36   38 29548.6150 29149.4528  1.35%   5.9   80s
    *13435  2641              42    29464.215146 29159.7805  1.03%   6.0   82s
     14252  3095 29453.4811   47   39 29464.2151 29167.5177  1.01%   6.1   85s
     16208  4199 29337.9802   22   47 29464.2151 29195.7398  0.91%   6.1   90s
     18016  5148 29406.1668   30   12 29464.2151 29208.6554  0.87%   6.3   95s
    *18500  3027              35    29375.914940 29211.5626  0.56%   6.3   96s
     19643  3338 29322.9716   18   78 29375.9149 29223.8827  0.52%   6.5  100s
     21310  3654     cutoff   48      29375.9149 29238.1204  0.47%   7.1  105s
     22914  3901     cutoff   30      29375.9149 29251.5040  0.42%   7.6  110s
     24569  4131     cutoff   21      29375.9149 29263.2067  0.38%   8.0  115s
     26153  4181     cutoff   34      29375.9149 29274.6952  0.34%   8.5  120s
     27786  4209 29352.1452   19   74 29375.9149 29286.0138  0.31%   9.0  125s
     29418  4194 29329.9900   28   28 29375.9149 29297.0957  0.27%   9.4  130s
    H30860  3843                    29369.407047 29305.3776  0.22%   9.8  134s
     30891  3837 29350.4595   32   18 29369.4070 29305.7943  0.22%   9.8  135s
     32497  3322     cutoff   26      29369.4070 29319.5672  0.17%  10.2  140s
     34275  2502     cutoff   35      29369.4070 29332.9114  0.12%  10.6  145s
     36362   941     cutoff   33      29369.4070 29353.0792  0.06%  10.9  150s

    The log snippet (95 lines about solving process) under 8 cases is:

    Gurobi Optimizer version 9.5.2 build v9.5.2rc0 (linux64)
    Thread count: 12 physical cores, 24 logical processors, using up to 1 threads
    Optimize a model with 200 rows, 19900 columns and 39800 nonzeros
    Model fingerprint: 0x7fc99f6a
    Variable types: 0 continuous, 19900 integer (19900 binary)
    Coefficient statistics:
      Matrix range     [1e+00, 1e+00]
      Objective range  [1e+01, 4e+03]
      Bounds range     [1e+00, 1e+00]
      RHS range        [2e+00, 2e+00]
    Presolve time: 0.04s
    Presolved: 200 rows, 19900 columns, 39800 nonzeros
    Variable types: 0 continuous, 19900 integer (19900 binary)

    Root relaxation: objective 2.705634e+04, 286 iterations, 0.01 seconds (0.00 work units)

        Nodes    |    Current Node    |     Objective Bounds      |     Work
     Expl Unexpl |  Obj  Depth IntInf | Incumbent    BestBd   Gap | It/Node Time

         0     0 27056.3397    0   44          - 27056.3397      -     -    0s
         0     0 27715.8457    0   34          - 27715.8457      -     -    1s
         0     0 27810.6488    0   66          - 27810.6488      -     -    1s
         0     0 27891.0050    0   46          - 27891.0050      -     -    1s
         0     0 27891.0050    0   46          - 27891.0050      -     -    1s
         0     2 28269.9233    0   40          - 28269.9233      -     -    3s
       118   120 28386.9552  104   18          - 28269.9233      -   2.1    5s
       490   492 29871.9338  438   12          - 28269.9233      -   2.2   10s
       766   768 32888.5359  678    8          - 28269.9233      -   2.6   15s
    *  797   797             702    33913.632716 28269.9233  16.6%   2.6   15s
    *  801   795             704    33445.732380 28269.9233  15.5%   2.7   16s
    H  808   755                    33436.614637 28741.9356  14.0%   2.7   17s
       810   756 32748.8882  639   20 33436.6146 28805.9819  13.8%   2.7   22s
    H  812   719                    32980.622476 28901.6668  12.4%   2.7   22s
    H  815   685                    32955.955635 28941.8916  12.2%   2.7   23s
    H  822   655                    32729.995194 29088.6910  11.1%   2.7   30s
    H  823   623                    32656.972270 29088.6910  10.9%   3.3   30s
    H  927   660                    32256.122317 29088.6910  9.82%   3.7   32s
    H 1031   699                    31924.196415 29088.6910  8.88%   4.2   34s
      1045   711 29783.8275  121    6 31924.1964 29088.6910  8.88%   4.2   35s
      1282   868 31747.8521  240   10 31924.1964 29088.6910  8.88%   5.1   40s
      1586  1056 30057.6787  150   13 31924.1964 29088.6910  8.88%   5.7   45s
      1923  1281 31251.0126  318   12 31924.1964 29088.6910  8.88%   6.1   50s
      2305  1493 30682.6351  133    6 31924.1964 29088.6910  8.88%   6.5   55s
      2688  1792 29843.1081   99   21 31924.1964 29088.6910  8.88%   6.7   60s
    H 3012  2089                    31802.363117 29088.6910  8.53%   6.5   64s
      3047  2126 31441.0030  295    8 31802.3631 29088.6910  8.53%   6.6   65s
      3490  2539 30375.7221  203   12 31802.3631 29088.6910  8.53%   6.5   70s
      3836  2867 29663.8587   64   12 31802.3631 29088.6910  8.53%   6.5   75s
    H 4012  3021                    31769.138981 29088.6910  8.44%   6.5   77s
      4209  3220 31153.2319  230    6 31769.1390 29088.6910  8.44%   6.4   80s
    * 4648  2371             207    30609.785858 29088.6910  4.97%   6.2   84s
    * 4666  2363             222    30599.716103 29088.6910  4.94%   6.2   84s
      4730  2423 30283.5904   81    6 30599.7161 29088.6910  4.94%   6.2   85s
      5318  2944 29337.5764   33   26 30599.7161 29088.6910  4.94%   6.2   90s
      5946  3495 29804.6214   73   12 30599.7161 29090.3735  4.93%   6.0   95s
      6509  3990 30553.8642  166    6 30599.7161 29091.8567  4.93%   6.0  100s
      7034  4474 29679.9788   81   19 30599.7161 29100.2090  4.90%   6.0  105s
      7585  4964 29895.3347   64   14 30599.7161 29108.5532  4.87%   6.0  110s
    * 7785  4273             153    30347.881419 29112.7718  4.07%   6.0  111s
      8211  4649 30128.2273  147   12 30347.8814 29118.3349  4.05%   5.9  115s
      8905  5234 29438.7625   44    6 30347.8814 29121.4767  4.04%   5.8  120s
      9503  5758 29248.8773   21   64 30347.8814 29126.8141  4.02%   6.0  125s
     10174  6347 29684.6457   43   21 30347.8814 29129.0950  4.02%   5.9  130s
    *10721  6814             135    30343.316106 29131.2986  3.99%   5.9  133s
    *10722  6788             135    30334.946511 29131.2986  3.97%   5.9  133s
     10911  6957 30255.7660  170    6 30334.9465 29131.8342  3.97%   5.9  135s
    *11453  3909              92    29790.702290 29136.7929  2.20%   5.9  139s
    *11466  3756              97    29761.636800 29136.7929  2.10%   5.9  139s
    *11535  3716              70    29752.022242 29137.1662  2.07%   5.9  140s
    *11536  3714              70    29751.714716 29137.1662  2.07%   5.9  140s
    *11561  3266              42    29678.142637 29137.4403  1.82%   5.9  140s
    H11873  3452                    29669.274324 29140.0459  1.78%   5.9  142s
    *11874  3357              40    29655.125980 29140.0459  1.74%   5.9  143s
     12164  3585 29507.0146   57    8 29655.1260 29141.6644  1.73%   5.9  145s
    *12278  3569              40    29641.631650 29142.3223  1.68%   5.9  145s
    *12289  2802              44    29548.615021 29142.3223  1.37%   5.9  145s
     12876  3207 29249.2750   19   43 29548.6150 29156.2383  1.33%   6.0  150s
    *13435  2641              42    29464.215146 29159.7805  1.03%   6.0  153s
     13639  2746 29455.1680   31   14 29464.2151 29161.1308  1.03%   6.0  155s
     14452  3215 29243.1795   26   62 29464.2151 29171.4097  0.99%   6.1  160s
     15262  3682 29358.2684   37   77 29464.2151 29182.8030  0.96%   6.1  165s
     16040  4125 29296.7689   21   52 29464.2151 29194.5989  0.92%   6.1  170s
     16844  4521 29332.2775   27   51 29464.2151 29200.1569  0.90%   6.2  175s
     17586  4918 29289.5524   29   60 29464.2151 29206.7207  0.87%   6.2  180s
     18197  5239 29398.8549   44   48 29464.2151 29210.3335  0.86%   6.3  185s
    *18500  3027              35    29375.914940 29211.5626  0.56%   6.3  187s
     18844  3140 29345.7430   22   36 29375.9149 29215.5909  0.55%   6.4  190s
     19552  3317 29357.9260   29   14 29375.9149 29222.2520  0.52%   6.5  195s
     20251  3496 29349.7913   26   20 29375.9149 29228.8713  0.50%   6.7  200s
     20972  3608     cutoff   32      29375.9149 29234.2899  0.48%   7.0  205s
     21689  3711 29374.7460   20   26 29375.9149 29240.7951  0.46%   7.2  210s
     22369  3821 29367.0213   28  110 29375.9149 29246.9056  0.44%   7.5  215s
     23068  3926 29303.0486   22   34 29375.9149 29252.5605  0.42%   7.6  220s
     23787  4016 29360.2784   32   51 29375.9149 29258.2020  0.40%   7.8  225s
     24481  4129 29333.0553   24   59 29375.9149 29262.3270  0.39%   8.0  230s
     25216  4155 29352.3492   22   16 29375.9149 29267.6988  0.37%   8.2  235s
     25886  4189 29371.1291   28   28 29375.9149 29272.6217  0.35%   8.4  240s
     26603  4213 29363.8694   20   18 29375.9149 29277.6986  0.33%   8.6  245s
     27338  4212 29349.8996   28   55 29375.9149 29283.4633  0.31%   8.8  250s
     28082  4207 29361.7154   29    8 29375.9149 29287.9059  0.30%   9.1  255s
     28837  4203 29365.2478   21   53 29375.9149 29292.1512  0.29%   9.2  260s
     29575  4193 infeasible   25      29375.9149 29297.3943  0.27%   9.4  265s
     30293  4179 29346.8473   35   41 29375.9149 29301.7524  0.25%   9.6  270s
    H30860  3843                    29369.407047 29305.3776  0.22%   9.8  273s
     31019  3796     cutoff   27      29369.4070 29307.2726  0.21%   9.8  275s
     31730  3595 29344.6533   35   56 29369.4070 29313.4701  0.19%  10.0  280s
     32468  3338     cutoff   25      29369.4070 29319.2369  0.17%  10.2  285s
     33062  3094 infeasible   23      29369.4070 29323.8124  0.16%  10.4  290s
     33418  2929     cutoff   20      29369.4070 29326.5822  0.15%  10.4  295s
     33918  2718 29360.2605   25   67 29369.4070 29329.6431  0.14%  10.5  300s
     34400  2429 29346.1475   27   53 29369.4070 29333.9145  0.12%  10.6  305s
     34973  2067     cutoff   24      29369.4070 29338.9773  0.10%  10.7  310s
     35716  1497     cutoff   26      29369.4070 29345.8693  0.08%  10.8  315s
     36228  1066     cutoff   27      29369.4070 29351.3598  0.06%  10.9  320s
    0
  • Jaromił Najman
    • Gurobi Staff

    It looks like something is interfering with the optimization tasks in the case of 8 parallel runs because reaching the first feasible point takes ~50% more time than with 4 parallel runs but the solution path looks the same. Are you sure that all 8 cores are fully available and nothing is running in the background? Are all cores equal?

    0
  • junyan qiu
    • Gurobi-versary
    • First Comment
    • First Question
    • I am sure 8 cores are fully available. I can see the number of busy cores is also 8 (with 100% usage), and there is no other heavy task in the background when I experiment.
    • I guess the problem may be the "magic" number 4. When I switch to another machine with 8 physical cores, the parallelism behaves the same weird way when the number of cases exceeds 4. Maybe 4 is written in the source code of Gurobi?
    0
  • Jaromił Najman
    • Gurobi Staff

    I guess the problem may be the "magic" number 4. When I switch to another machine with 8 physical cores, the parallelism behaves the same weird way when the number of cases exceeds 4. Maybe 4 is written in the source code of Gurobi?

    I just tested your setting on a 24 core machine and could not observe the behavior you describe.

    I would guess that the L-Cache of your cores is full when you run more processes. Note that often L1 and L2 Cache are small compared to L3 Cache and RAM but are the fastest access points on a processor. The more processes you start, the bigger is the probability that the smaller L1 and L2 Caches do not suffice any longer and your processes have to take the "long" way to access the L3 Cache and/or RAM. This can result in significant slow downs depending on speed differences in accessing the different Cache Levels.

    Best regards, 
    Jaromił

    1
  • David Torres Sanchez
    • Gurobi Staff

    Also posted here: https://or.stackexchange.com/q/8872/4752

    0
  • junyan qiu
    • Gurobi-versary
    • First Comment
    • First Question

    Thanks for your responses, which are professional. When the number of cases goes up, the ratio of cache-miss increases quickly, which can explain the significant slowing down of solving speed.

    0

Post is closed for comments.