Gurobi with multiprocessing has few benefits
回答済みProblem:
I follow the tutorial in https://support.gurobi.com/hc/en-us/articles/360043111231-How-do-I-use-multiprocessing-in-Python-with-Gurobi- to use multiprocessing with gurobi, to solve multiple problems in parallel. Expected and unexcepted phenomenons appear together:
- Expected phenomenon appears: log file of each problem is written at the same time, and the number of busy cpu cores is close to the number of problems. So the multiprocessing module is working.
- Unexpected phenomenon appears: when I solve 4 problems with 4 processes, a single problem costs 3 minutes, and when I solve 8 problems with 8 processes, a single problem costs 6 minutes, which is strange since single problem should cost nearly 3 minutes with multiprocessing. This phenomenon indicates there are few benefits to use multiprocessing with Gurobi.
Possible reasons that have been excluded:
- The numebr of physical cores and logic cores are 12 and 24 respectively, which are larger than tested scale. So it is not because the problems are too many and cpu cores are too few.
- The space of remaining main memory is still large (several GBs) in all tested scales, so it is not because the main memory is run up and virtual memory is used.
- I ever tried to write a python script for each problem respectively (namely, 8 problems then 8 python script files), and run 8 python scripts at the same time. But above problem still exists. So it is not because multiprocessing module has some issue, and starting 8 processes directly by 8 python scripts also leads to such situation.
Possible reason that has NOT been excluded:
- Maybe Gurobi has a singleton library that is shared by all solving processes, and this singleton library is locked when it is used by a process. Thus multiprocessing has few benefits.
Code in Google Drive:
If above illustration cannot help find the reason, then the codes below may help. The codes consists of two files:
- A file modelling TSP (travelling salesman problem), which is originated from official example and modified by myself to experiment.
- A file using multiprocessing to solve multiple TSP in parallel to repeat the problem.
Code of TSP (this code has no logic problem, so just copy):
https://drive.google.com/file/d/1BrWSppre5QHG2Njpqs3gKkKprr-AEaQw/view?usp=sharing
Code of using multiprocessing:
https://drive.google.com/file/d/1JLLeHNSF29Z3Bkuo59OtrnODQkZxwa_7/view?usp=sharing
-
正式なコメント
This post is more than three years old. Some information may not be up to date. For current information, please check the Gurobi Documentation or Knowledge Base. If you need more help, please create a new post in the community forum, or try Gurobot, our chatbot interface offering instant, expert-level support. -
Hi,
Unexpected phenomenon appears: when I solve 4 problems with 4 processes, a single problem costs 3 minutes, and when I solve 8 problems with 8 processes, a single problem costs 6 minutes, which is strange since single problem should cost nearly 3 minutes with multiprocessing. This phenomenon indicates there are few benefits to use multiprocessing with Gurobi.
What you observe might be caused by cores with hyperthreading. At least I was able to reproduce your observation on a machine with 4 physical cores of which each uses hyperthreading providing 8 logical processors. When I run your code with \(\texttt{caseNum=8}\) vs \(\texttt{caseNum=4}\) I can observe the same behavior you described. This is because Gurobi (and other Software packages) usually can get out the most efficiency out of physical cores instead of logical processors, i.e., running 8 cases on 8 logical processors will be slower than running 4 cases on 4 physical cores, because you don't have to push 4 additional cases onto the virtual/logical processors. I am not able to observe this issue when comparing \(\texttt{caseNum=4}\) vs \(\texttt{caseNum=2}\), because then all cases can use physical cores.It should be best to limit the number of cases run in parallel to the number of physical cores your machine has (probably 4).
Best regards,
Jaromił1 -
Hi, Jaromił:
I am not able to observe this issue when comparing caseNum=4 vs caseNum=2, because then all cases can use physical cores.It should be best to limit the number of cases run in parallel to the number of physical cores your machine has (probably 4).
I also observe this behavior. When the number of cases is smaller than (or equal to) 4, all cases can be parallelized as expected. And when it is greater than 4, additional cases will compete for processors with others, which leads to longer solving time.
But I actually have 12 physical cores, so running 4 cases or 8 cases should NOT lead to contention of processors. Maybe this is the most strange phenomenon.
0 -
But I actually have 12 physical cores, so running 4 cases or 8 cases should NOT lead to contention of processors. Maybe this is the most strange phenomenon.
This is indeed strange. Currently, you are using randomly generated TSP models. Could you please fix the TSP data and try again? Maybe you are just getting really unlucky with the randomly generated data. If this is not it, could you please post a log snippet of one of the 8 cases runs vs the same in the 4 cases runs. Please use fixed TSP data for this experiment to avoid non-determinism.
0 -
Please use fixed TSP data for this experiment to avoid non-determinism.
I use a benchmark in TSPLIB as data, which consists of 200 points. Every TSP case is using the same data. Then the the number of cases and solving time of single case are:
- 1 case, 130 seconds
- 2 cases, 135 seconds
- 4 cases, 150 seconds
- 8 cases, 320 seconds
The log snippet (62 lines about solving process) under 4 cases is:
Gurobi Optimizer version 9.5.2 build v9.5.2rc0 (linux64)
Thread count: 12 physical cores, 24 logical processors, using up to 1 threads
Optimize a model with 200 rows, 19900 columns and 39800 nonzeros
Model fingerprint: 0x7fc99f6a
Variable types: 0 continuous, 19900 integer (19900 binary)
Coefficient statistics:
Matrix range [1e+00, 1e+00]
Objective range [1e+01, 4e+03]
Bounds range [1e+00, 1e+00]
RHS range [2e+00, 2e+00]
Presolve time: 0.03s
Presolved: 200 rows, 19900 columns, 39800 nonzeros
Variable types: 0 continuous, 19900 integer (19900 binary)
Root relaxation: objective 2.705634e+04, 286 iterations, 0.01 seconds (0.00 work units)
Nodes | Current Node | Objective Bounds | Work
Expl Unexpl | Obj Depth IntInf | Incumbent BestBd Gap | It/Node Time
0 0 27056.3397 0 44 - 27056.3397 - - 0s
0 0 27715.8457 0 34 - 27715.8457 - - 1s
0 0 27810.6488 0 66 - 27810.6488 - - 1s
0 0 27891.0050 0 46 - 27891.0050 - - 1s
0 0 27891.0050 0 46 - 27891.0050 - - 1s
0 2 28269.9233 0 40 - 28269.9233 - - 2s
330 332 29266.8831 292 6 - 28269.9233 - 2.2 5s
* 797 797 702 33913.632716 28269.9233 16.6% 2.6 9s
* 801 795 704 33445.732380 28269.9233 15.5% 2.7 9s
H 808 755 33436.614637 28741.9356 14.0% 2.7 10s
810 756 32748.8882 639 20 33436.6146 28805.9819 13.8% 2.7 15s
H 812 719 32980.622476 28901.6668 12.4% 2.7 15s
H 815 685 32955.955635 28941.8916 12.2% 2.7 15s
H 822 655 32729.995194 29088.6910 11.1% 2.7 22s
H 823 623 32656.972270 29088.6910 10.9% 3.3 23s
H 927 660 32256.122317 29088.6910 9.82% 3.7 24s
1005 714 29710.3775 101 10 32256.1223 29088.6910 9.82% 4.1 25s
H 1031 699 31924.196415 29088.6910 8.88% 4.2 25s
1496 996 29757.7544 105 16 31924.1964 29088.6910 8.88% 5.5 30s
2217 1434 29862.7351 89 10 31924.1964 29088.6910 8.88% 6.3 35s
2980 2084 30887.6736 262 14 31924.1964 29088.6910 8.88% 6.5 40s
H 3012 2089 31802.363117 29088.6910 8.53% 6.5 40s
3724 2773 31449.2284 299 12 31802.3631 29088.6910 8.53% 6.4 45s
H 4012 3021 31769.138981 29088.6910 8.44% 6.5 47s
4629 3616 30281.8051 188 12 31769.1390 29088.6910 8.44% 6.2 50s
* 4648 2371 207 30609.785858 29088.6910 4.97% 6.2 50s
* 4666 2363 222 30599.716103 29088.6910 4.94% 6.2 50s
5890 3439 29232.5910 17 34 30599.7161 29090.3735 4.93% 6.1 55s
7045 4485 29827.9924 92 6 30599.7161 29100.2090 4.90% 6.0 60s
* 7785 4273 153 30347.881419 29112.7718 4.07% 6.0 63s
8306 4698 29470.2863 36 8 30347.8814 29119.3132 4.05% 5.9 65s
9685 5925 29872.4161 101 6 30347.8814 29127.0345 4.02% 6.0 70s
*10721 6814 135 30343.316106 29131.2986 3.99% 5.9 73s
*10722 6788 135 30334.946511 29131.2986 3.97% 5.9 73s
11069 7083 29659.9582 41 12 30334.9465 29132.5935 3.96% 5.9 75s
*11453 3909 92 29790.702290 29136.7929 2.20% 5.9 76s
*11466 3756 97 29761.636800 29136.7929 2.10% 5.9 76s
*11535 3716 70 29752.022242 29137.1662 2.07% 5.9 76s
*11536 3714 70 29751.714716 29137.1662 2.07% 5.9 77s
*11561 3266 42 29678.142637 29137.4403 1.82% 5.9 77s
H11873 3452 29669.274324 29140.0459 1.78% 5.9 78s
*11874 3357 40 29655.125980 29140.0459 1.74% 5.9 78s
*12278 3569 40 29641.631650 29142.3223 1.68% 5.9 79s
*12289 2802 44 29548.615021 29142.3223 1.37% 5.9 79s
12430 2888 29361.3788 36 38 29548.6150 29149.4528 1.35% 5.9 80s
*13435 2641 42 29464.215146 29159.7805 1.03% 6.0 82s
14252 3095 29453.4811 47 39 29464.2151 29167.5177 1.01% 6.1 85s
16208 4199 29337.9802 22 47 29464.2151 29195.7398 0.91% 6.1 90s
18016 5148 29406.1668 30 12 29464.2151 29208.6554 0.87% 6.3 95s
*18500 3027 35 29375.914940 29211.5626 0.56% 6.3 96s
19643 3338 29322.9716 18 78 29375.9149 29223.8827 0.52% 6.5 100s
21310 3654 cutoff 48 29375.9149 29238.1204 0.47% 7.1 105s
22914 3901 cutoff 30 29375.9149 29251.5040 0.42% 7.6 110s
24569 4131 cutoff 21 29375.9149 29263.2067 0.38% 8.0 115s
26153 4181 cutoff 34 29375.9149 29274.6952 0.34% 8.5 120s
27786 4209 29352.1452 19 74 29375.9149 29286.0138 0.31% 9.0 125s
29418 4194 29329.9900 28 28 29375.9149 29297.0957 0.27% 9.4 130s
H30860 3843 29369.407047 29305.3776 0.22% 9.8 134s
30891 3837 29350.4595 32 18 29369.4070 29305.7943 0.22% 9.8 135s
32497 3322 cutoff 26 29369.4070 29319.5672 0.17% 10.2 140s
34275 2502 cutoff 35 29369.4070 29332.9114 0.12% 10.6 145s
36362 941 cutoff 33 29369.4070 29353.0792 0.06% 10.9 150sThe log snippet (95 lines about solving process) under 8 cases is:
Gurobi Optimizer version 9.5.2 build v9.5.2rc0 (linux64)
Thread count: 12 physical cores, 24 logical processors, using up to 1 threads
Optimize a model with 200 rows, 19900 columns and 39800 nonzeros
Model fingerprint: 0x7fc99f6a
Variable types: 0 continuous, 19900 integer (19900 binary)
Coefficient statistics:
Matrix range [1e+00, 1e+00]
Objective range [1e+01, 4e+03]
Bounds range [1e+00, 1e+00]
RHS range [2e+00, 2e+00]
Presolve time: 0.04s
Presolved: 200 rows, 19900 columns, 39800 nonzeros
Variable types: 0 continuous, 19900 integer (19900 binary)
Root relaxation: objective 2.705634e+04, 286 iterations, 0.01 seconds (0.00 work units)
Nodes | Current Node | Objective Bounds | Work
Expl Unexpl | Obj Depth IntInf | Incumbent BestBd Gap | It/Node Time
0 0 27056.3397 0 44 - 27056.3397 - - 0s
0 0 27715.8457 0 34 - 27715.8457 - - 1s
0 0 27810.6488 0 66 - 27810.6488 - - 1s
0 0 27891.0050 0 46 - 27891.0050 - - 1s
0 0 27891.0050 0 46 - 27891.0050 - - 1s
0 2 28269.9233 0 40 - 28269.9233 - - 3s
118 120 28386.9552 104 18 - 28269.9233 - 2.1 5s
490 492 29871.9338 438 12 - 28269.9233 - 2.2 10s
766 768 32888.5359 678 8 - 28269.9233 - 2.6 15s
* 797 797 702 33913.632716 28269.9233 16.6% 2.6 15s
* 801 795 704 33445.732380 28269.9233 15.5% 2.7 16s
H 808 755 33436.614637 28741.9356 14.0% 2.7 17s
810 756 32748.8882 639 20 33436.6146 28805.9819 13.8% 2.7 22s
H 812 719 32980.622476 28901.6668 12.4% 2.7 22s
H 815 685 32955.955635 28941.8916 12.2% 2.7 23s
H 822 655 32729.995194 29088.6910 11.1% 2.7 30s
H 823 623 32656.972270 29088.6910 10.9% 3.3 30s
H 927 660 32256.122317 29088.6910 9.82% 3.7 32s
H 1031 699 31924.196415 29088.6910 8.88% 4.2 34s
1045 711 29783.8275 121 6 31924.1964 29088.6910 8.88% 4.2 35s
1282 868 31747.8521 240 10 31924.1964 29088.6910 8.88% 5.1 40s
1586 1056 30057.6787 150 13 31924.1964 29088.6910 8.88% 5.7 45s
1923 1281 31251.0126 318 12 31924.1964 29088.6910 8.88% 6.1 50s
2305 1493 30682.6351 133 6 31924.1964 29088.6910 8.88% 6.5 55s
2688 1792 29843.1081 99 21 31924.1964 29088.6910 8.88% 6.7 60s
H 3012 2089 31802.363117 29088.6910 8.53% 6.5 64s
3047 2126 31441.0030 295 8 31802.3631 29088.6910 8.53% 6.6 65s
3490 2539 30375.7221 203 12 31802.3631 29088.6910 8.53% 6.5 70s
3836 2867 29663.8587 64 12 31802.3631 29088.6910 8.53% 6.5 75s
H 4012 3021 31769.138981 29088.6910 8.44% 6.5 77s
4209 3220 31153.2319 230 6 31769.1390 29088.6910 8.44% 6.4 80s
* 4648 2371 207 30609.785858 29088.6910 4.97% 6.2 84s
* 4666 2363 222 30599.716103 29088.6910 4.94% 6.2 84s
4730 2423 30283.5904 81 6 30599.7161 29088.6910 4.94% 6.2 85s
5318 2944 29337.5764 33 26 30599.7161 29088.6910 4.94% 6.2 90s
5946 3495 29804.6214 73 12 30599.7161 29090.3735 4.93% 6.0 95s
6509 3990 30553.8642 166 6 30599.7161 29091.8567 4.93% 6.0 100s
7034 4474 29679.9788 81 19 30599.7161 29100.2090 4.90% 6.0 105s
7585 4964 29895.3347 64 14 30599.7161 29108.5532 4.87% 6.0 110s
* 7785 4273 153 30347.881419 29112.7718 4.07% 6.0 111s
8211 4649 30128.2273 147 12 30347.8814 29118.3349 4.05% 5.9 115s
8905 5234 29438.7625 44 6 30347.8814 29121.4767 4.04% 5.8 120s
9503 5758 29248.8773 21 64 30347.8814 29126.8141 4.02% 6.0 125s
10174 6347 29684.6457 43 21 30347.8814 29129.0950 4.02% 5.9 130s
*10721 6814 135 30343.316106 29131.2986 3.99% 5.9 133s
*10722 6788 135 30334.946511 29131.2986 3.97% 5.9 133s
10911 6957 30255.7660 170 6 30334.9465 29131.8342 3.97% 5.9 135s
*11453 3909 92 29790.702290 29136.7929 2.20% 5.9 139s
*11466 3756 97 29761.636800 29136.7929 2.10% 5.9 139s
*11535 3716 70 29752.022242 29137.1662 2.07% 5.9 140s
*11536 3714 70 29751.714716 29137.1662 2.07% 5.9 140s
*11561 3266 42 29678.142637 29137.4403 1.82% 5.9 140s
H11873 3452 29669.274324 29140.0459 1.78% 5.9 142s
*11874 3357 40 29655.125980 29140.0459 1.74% 5.9 143s
12164 3585 29507.0146 57 8 29655.1260 29141.6644 1.73% 5.9 145s
*12278 3569 40 29641.631650 29142.3223 1.68% 5.9 145s
*12289 2802 44 29548.615021 29142.3223 1.37% 5.9 145s
12876 3207 29249.2750 19 43 29548.6150 29156.2383 1.33% 6.0 150s
*13435 2641 42 29464.215146 29159.7805 1.03% 6.0 153s
13639 2746 29455.1680 31 14 29464.2151 29161.1308 1.03% 6.0 155s
14452 3215 29243.1795 26 62 29464.2151 29171.4097 0.99% 6.1 160s
15262 3682 29358.2684 37 77 29464.2151 29182.8030 0.96% 6.1 165s
16040 4125 29296.7689 21 52 29464.2151 29194.5989 0.92% 6.1 170s
16844 4521 29332.2775 27 51 29464.2151 29200.1569 0.90% 6.2 175s
17586 4918 29289.5524 29 60 29464.2151 29206.7207 0.87% 6.2 180s
18197 5239 29398.8549 44 48 29464.2151 29210.3335 0.86% 6.3 185s
*18500 3027 35 29375.914940 29211.5626 0.56% 6.3 187s
18844 3140 29345.7430 22 36 29375.9149 29215.5909 0.55% 6.4 190s
19552 3317 29357.9260 29 14 29375.9149 29222.2520 0.52% 6.5 195s
20251 3496 29349.7913 26 20 29375.9149 29228.8713 0.50% 6.7 200s
20972 3608 cutoff 32 29375.9149 29234.2899 0.48% 7.0 205s
21689 3711 29374.7460 20 26 29375.9149 29240.7951 0.46% 7.2 210s
22369 3821 29367.0213 28 110 29375.9149 29246.9056 0.44% 7.5 215s
23068 3926 29303.0486 22 34 29375.9149 29252.5605 0.42% 7.6 220s
23787 4016 29360.2784 32 51 29375.9149 29258.2020 0.40% 7.8 225s
24481 4129 29333.0553 24 59 29375.9149 29262.3270 0.39% 8.0 230s
25216 4155 29352.3492 22 16 29375.9149 29267.6988 0.37% 8.2 235s
25886 4189 29371.1291 28 28 29375.9149 29272.6217 0.35% 8.4 240s
26603 4213 29363.8694 20 18 29375.9149 29277.6986 0.33% 8.6 245s
27338 4212 29349.8996 28 55 29375.9149 29283.4633 0.31% 8.8 250s
28082 4207 29361.7154 29 8 29375.9149 29287.9059 0.30% 9.1 255s
28837 4203 29365.2478 21 53 29375.9149 29292.1512 0.29% 9.2 260s
29575 4193 infeasible 25 29375.9149 29297.3943 0.27% 9.4 265s
30293 4179 29346.8473 35 41 29375.9149 29301.7524 0.25% 9.6 270s
H30860 3843 29369.407047 29305.3776 0.22% 9.8 273s
31019 3796 cutoff 27 29369.4070 29307.2726 0.21% 9.8 275s
31730 3595 29344.6533 35 56 29369.4070 29313.4701 0.19% 10.0 280s
32468 3338 cutoff 25 29369.4070 29319.2369 0.17% 10.2 285s
33062 3094 infeasible 23 29369.4070 29323.8124 0.16% 10.4 290s
33418 2929 cutoff 20 29369.4070 29326.5822 0.15% 10.4 295s
33918 2718 29360.2605 25 67 29369.4070 29329.6431 0.14% 10.5 300s
34400 2429 29346.1475 27 53 29369.4070 29333.9145 0.12% 10.6 305s
34973 2067 cutoff 24 29369.4070 29338.9773 0.10% 10.7 310s
35716 1497 cutoff 26 29369.4070 29345.8693 0.08% 10.8 315s
36228 1066 cutoff 27 29369.4070 29351.3598 0.06% 10.9 320s0 -
It looks like something is interfering with the optimization tasks in the case of 8 parallel runs because reaching the first feasible point takes ~50% more time than with 4 parallel runs but the solution path looks the same. Are you sure that all 8 cores are fully available and nothing is running in the background? Are all cores equal?
0 -
- I am sure 8 cores are fully available. I can see the number of busy cores is also 8 (with 100% usage), and there is no other heavy task in the background when I experiment.
- I guess the problem may be the "magic" number 4. When I switch to another machine with 8 physical cores, the parallelism behaves the same weird way when the number of cases exceeds 4. Maybe 4 is written in the source code of Gurobi?
0 -
I guess the problem may be the "magic" number 4. When I switch to another machine with 8 physical cores, the parallelism behaves the same weird way when the number of cases exceeds 4. Maybe 4 is written in the source code of Gurobi?
I just tested your setting on a 24 core machine and could not observe the behavior you describe.
I would guess that the L-Cache of your cores is full when you run more processes. Note that often L1 and L2 Cache are small compared to L3 Cache and RAM but are the fastest access points on a processor. The more processes you start, the bigger is the probability that the smaller L1 and L2 Caches do not suffice any longer and your processes have to take the "long" way to access the L3 Cache and/or RAM. This can result in significant slow downs depending on speed differences in accessing the different Cache Levels.
Best regards,
Jaromił1 -
Also posted here: https://or.stackexchange.com/q/8872/4752
0 -
Thanks for your responses, which are professional. When the number of cases goes up, the ratio of cache-miss increases quickly, which can explain the significant slowing down of solving speed.
0
投稿コメントは受け付けていません。
コメント
10件のコメント