figuring out resource requirements
AnsweredI'm having trouble getting a MIP to solve even the root node. I'm running on a university cluster, and I need to request a specific amount of memory, nodes, and tasks per node when I submit my job. Currently I'm using 1 node, 1 task per node, 96 GB of memory, but I don't even get any feasible solution to the root node within 150 hours (the scheduler killed the job after this time elapsed). I'm not sure how to figure out what my bottleneck is? I.e., would requesting more memory, nodes, or tasks per node help? The maximum time I can request per university rules is 168 hours.
Here is the Gurobi output:
Changed value of parameter method to 4
Prev: -1 Min: -1 Max: 5 Default: -1
Changed value of parameter mipgap to 0.1
Prev: 0.0001 Min: 0.0 Max: 1e+100 Default: 0.0001
Changed value of parameter timelimit to 9.99999999999e+11
Prev: 1e+100 Min: 0.0 Max: 1e+100 Default: 1e+100
Optimize a model with 1746714 rows, 7510471 columns and 25685499 nonzeros
Variable types: 5494157 continuous, 2016314 integer (0 binary)
Coefficient statistics:
Matrix range [1e+00, 1e+00]
Objective range [1e-02, 1e+00]
Bounds range [0e+00, 0e+00]
RHS range [2e-08, 2e+01]
Presolve removed 1379600 rows and 193993 columns (presolve time = 5s) ...
Presolve removed 1379600 rows and 194963 columns (presolve time = 12s) ...
Presolve removed 1379600 rows and 194963 columns (presolve time = 15s) ...
Presolve removed 1379600 rows and 194963 columns (presolve time = 20s) ...
Presolve removed 1384062 rows and 391291 columns (presolve time = 32s) ...
Presolve removed 1384062 rows and 391291 columns
Presolve time: 31.72s
Presolved: 362652 rows, 7119180 columns, 23175688 nonzeros
Variable types: 5170352 continuous, 1948828 integer (0 binary)
Deterministic concurrent LP optimizer: primal simplex, dual simplex, and barrier
Showing barrier log only...
Presolve removed 4 rows and 0 columns (presolve time = 9s) ...
Presolve removed 4 rows and 0 columns (presolve time = 16s) ...
Presolve removed 4 rows and 0 columns (presolve time = 25s) ...
Presolve removed 4 rows and 0 columns (presolve time = 25s) ...
Presolve removed 62350 rows and 0 columns (presolve time = 42s) ...
Presolve removed 62350 rows and 0 columns (presolve time = 46s) ...
Presolve removed 62350 rows and 0 columns (presolve time = 50s) ...
Presolve removed 62350 rows and 0 columns
Presolved: 300302 rows, 7119180 columns, 21067634 nonzeros
Root barrier log...
Elapsed ordering time = 5s
Elapsed ordering time = 10s
Elapsed ordering time = 15s
Elapsed ordering time = 20s
Elapsed ordering time = 25s
Elapsed ordering time = 30s
Elapsed ordering time = 35s
Elapsed ordering time = 40s
Elapsed ordering time = 45s
Elapsed ordering time = 50s
Elapsed ordering time = 55s
Elapsed ordering time = 60s
Elapsed ordering time = 65s
Elapsed ordering time = 70s
Elapsed ordering time = 75s
Elapsed ordering time = 80s
Elapsed ordering time = 85s
Elapsed ordering time = 90s
Elapsed ordering time = 95s
Elapsed ordering time = 100s
Elapsed ordering time = 105s
Elapsed ordering time = 110s
Elapsed ordering time = 115s
Elapsed ordering time = 120s
Elapsed ordering time = 125s
Elapsed ordering time = 130s
Elapsed ordering time = 135s
Elapsed ordering time = 140s
Elapsed ordering time = 145s
Elapsed ordering time = 150s
Elapsed ordering time = 155s
Elapsed ordering time = 160s
Elapsed ordering time = 165s
Elapsed ordering time = 170s
Elapsed ordering time = 175s
Elapsed ordering time = 180s
Elapsed ordering time = 185s
Elapsed ordering time = 190s
Elapsed ordering time = 195s
Elapsed ordering time = 200s
Elapsed ordering time = 205s
Elapsed ordering time = 210s
Elapsed ordering time = 215s
Elapsed ordering time = 220s
Elapsed ordering time = 225s
Elapsed ordering time = 230s
Elapsed ordering time = 235s
Elapsed ordering time = 240s
Elapsed ordering time = 245s
Elapsed ordering time = 250s
Elapsed ordering time = 255s
Elapsed ordering time = 260s
Elapsed ordering time = 265s
Elapsed ordering time = 270s
Elapsed ordering time = 275s
Elapsed ordering time = 280s
Elapsed ordering time = 285s
Elapsed ordering time = 290s
Elapsed ordering time = 295s
Elapsed ordering time = 300s
Elapsed ordering time = 305s
Elapsed ordering time = 310s
Elapsed ordering time = 315s
Elapsed ordering time = 320s
Elapsed ordering time = 325s
Elapsed ordering time = 330s
Elapsed ordering time = 335s
Elapsed ordering time = 340s
Elapsed ordering time = 345s
Elapsed ordering time = 350s
Elapsed ordering time = 355s
Elapsed ordering time = 360s
Elapsed ordering time = 365s
Elapsed ordering time = 370s
Elapsed ordering time = 375s
Elapsed ordering time = 380s
Elapsed ordering time = 385s
Elapsed ordering time = 390s
Elapsed ordering time = 395s
Elapsed ordering time = 400s
Elapsed ordering time = 405s
Elapsed ordering time = 410s
Elapsed ordering time = 415s
Elapsed ordering time = 420s
Elapsed ordering time = 425s
Elapsed ordering time = 430s
Elapsed ordering time = 435s
Elapsed ordering time = 440s
Elapsed ordering time = 445s
Elapsed ordering time = 450s
Elapsed ordering time = 455s
Elapsed ordering time = 460s
Elapsed ordering time = 465s
Elapsed ordering time = 470s
Elapsed ordering time = 475s
Elapsed ordering time = 480s
Elapsed ordering time = 485s
Elapsed ordering time = 490s
Elapsed ordering time = 495s
Elapsed ordering time = 500s
Elapsed ordering time = 505s
Elapsed ordering time = 510s
Elapsed ordering time = 515s
Elapsed ordering time = 520s
Elapsed ordering time = 525s
Elapsed ordering time = 530s
Elapsed ordering time = 535s
Elapsed ordering time = 540s
Elapsed ordering time = 545s
Elapsed ordering time = 550s
Elapsed ordering time = 555s
Elapsed ordering time = 560s
Elapsed ordering time = 565s
Elapsed ordering time = 570s
Elapsed ordering time = 575s
Elapsed ordering time = 580s
Ordering time: 582.59s
Barrier statistics:
AA' NZ : 1.606e+07
Factor NZ : 2.338e+08 (roughly 5.0 GBytes of memory)
Factor Ops : 4.625e+11 (roughly 6 seconds per iteration)
Threads : 21
Objective Residual
Iter Primal Dual Primal Dual Compl Time
0 2.14104074e+06 -4.81895447e+02 5.09e+02 0.00e+00 8.97e-01 801s
1 7.11378286e+05 -2.74727434e+02 1.17e+02 1.64e-01 2.71e-01 879s
2 8.92325467e+04 -8.37988613e+01 1.07e+01 5.21e-02 2.94e-02 961s
3 1.01690978e+04 2.23879878e+02 9.13e-01 1.41e-02 2.97e-03 1052s
4 6.87606154e+03 3.70929185e+02 5.77e-01 1.04e-02 1.93e-03 1135s
5 5.47958402e+03 4.66425951e+02 4.40e-01 9.00e-03 1.50e-03 1209s
6 3.92370482e+03 6.01129221e+02 3.01e-01 6.55e-03 1.01e-03 1290s
7 3.21463794e+03 6.50872480e+02 2.34e-01 6.06e-03 8.03e-04 1363s
8 2.81367772e+03 7.09854823e+02 1.97e-01 5.43e-03 6.74e-04 1443s
9 2.50785585e+03 7.88832411e+02 1.67e-01 4.40e-03 5.66e-04 1535s
10 2.30515291e+03 8.18416820e+02 1.45e-01 4.03e-03 4.94e-04 1619s
11 2.18872973e+03 8.41786747e+02 1.33e-01 3.66e-03 4.52e-04 1692s
12 1.98698996e+03 8.53225317e+02 1.11e-01 3.43e-03 3.80e-04 1765s
13 1.70591951e+03 8.73261434e+02 8.16e-02 3.14e-03 2.83e-04 1844s
14 1.50524627e+03 9.13061585e+02 5.90e-02 2.65e-03 2.06e-04 1934s
15 1.43077793e+03 9.38829753e+02 4.92e-02 2.22e-03 1.73e-04 2017s
16 1.34078220e+03 9.55502013e+02 3.81e-02 1.94e-03 1.36e-04 2090s
17 1.30443141e+03 9.64365372e+02 3.37e-02 1.80e-03 1.21e-04 2163s
18 1.26188097e+03 9.71776399e+02 2.84e-02 1.69e-03 1.03e-04 2243s
19 1.24318173e+03 9.81006477e+02 2.60e-02 1.56e-03 9.41e-05 2315s
20 1.23136682e+03 9.87226936e+02 2.42e-02 1.47e-03 8.80e-05 2397s
21 1.19107842e+03 9.96032252e+02 1.92e-02 1.34e-03 7.07e-05 2487s
22 1.18468316e+03 9.98817684e+02 1.84e-02 1.30e-03 6.77e-05 2567s
23 1.17675659e+03 1.00284307e+03 1.73e-02 1.25e-03 6.40e-05 2641s
24 1.16697507e+03 1.00584840e+03 1.60e-02 1.22e-03 5.94e-05 2715s
25 1.16393862e+03 1.01000412e+03 1.56e-02 1.17e-03 5.78e-05 2794s
26 1.15679740e+03 1.01237785e+03 1.46e-02 1.13e-03 5.42e-05 2881s
27 1.15309755e+03 1.01398380e+03 1.40e-02 1.08e-03 5.21e-05 2962s
28 1.15051453e+03 1.01588494e+03 1.36e-02 1.06e-03 5.08e-05 3036s
29 1.14698752e+03 1.01796793e+03 1.31e-02 1.03e-03 4.89e-05 3109s
30 1.13865080e+03 1.02460250e+03 1.20e-02 9.35e-04 4.44e-05 3189s
31 1.13588473e+03 1.02726898e+03 1.14e-02 8.71e-04 4.23e-05 3264s
32 1.11413570e+03 1.03094468e+03 8.28e-03 7.95e-04 3.15e-05 3346s
33 1.10863243e+03 1.03585426e+03 7.26e-03 7.16e-04 2.78e-05 3439s
34 1.10711859e+03 1.03644003e+03 7.01e-03 7.06e-04 2.69e-05 3517s
35 1.10617944e+03 1.03751652e+03 6.84e-03 6.88e-04 2.63e-05 3592s
36 1.10438535e+03 1.03850885e+03 6.51e-03 6.71e-04 2.51e-05 3667s
37 1.10194933e+03 1.03962376e+03 6.06e-03 6.60e-04 2.36e-05 3749s
38 1.09602995e+03 1.04622090e+03 5.07e-03 5.33e-04 1.95e-05 3838s
39 1.08833816e+03 1.04950192e+03 3.76e-03 4.65e-04 1.48e-05 3920s
40 1.08636339e+03 1.05236409e+03 3.23e-03 4.08e-04 1.29e-05 3994s
41 1.08482185e+03 1.05525674e+03 2.85e-03 3.46e-04 1.13e-05 4069s
42 1.08046365e+03 1.05648756e+03 2.13e-03 3.18e-04 8.77e-06 4150s
43 1.07909499e+03 1.05819151e+03 1.76e-03 2.82e-04 7.41e-06 4223s
44 1.07733449e+03 1.06099040e+03 1.33e-03 2.17e-04 5.71e-06 4305s
45 1.07441917e+03 1.06305913e+03 8.04e-04 1.68e-04 3.67e-06 4395s
46 1.07329827e+03 1.06439729e+03 5.76e-04 1.35e-04 2.74e-06 4477s
47 1.07283534e+03 1.06513192e+03 4.83e-04 1.16e-04 2.34e-06 4551s
48 1.07239447e+03 1.06560822e+03 3.99e-04 1.04e-04 1.99e-06 4625s
49 1.07170791e+03 1.06656638e+03 2.94e-04 7.84e-05 1.49e-06 4708s
50 1.07079037e+03 1.06774017e+03 1.70e-04 4.86e-05 8.72e-07 4798s
51 1.07055198e+03 1.06812932e+03 1.34e-04 3.66e-05 6.89e-07 4878s
52 1.07042692e+03 1.06823270e+03 1.18e-04 3.36e-05 6.17e-07 4951s
53 1.07021427e+03 1.06839200e+03 8.88e-05 2.94e-05 4.90e-07 5026s
54 1.06997158e+03 1.06872209e+03 5.86e-05 1.96e-05 3.30e-07 5108s
55 1.06986978e+03 1.06886642e+03 4.79e-05 1.54e-05 2.67e-07 5182s
56 1.06974904e+03 1.06892977e+03 3.60e-05 1.36e-05 2.10e-07 5262s
57 1.06961974e+03 1.06902069e+03 2.39e-05 1.09e-05 1.48e-07 5352s
58 1.06954436e+03 1.06903387e+03 1.66e-05 1.05e-05 1.17e-07 5432s
59 1.06953670e+03 1.06907585e+03 1.57e-05 9.15e-06 1.07e-07 5509s
60 1.06950703e+03 1.06910945e+03 1.27e-05 8.14e-06 9.06e-08 5586s
61 1.06948231e+03 1.06915488e+03 1.03e-05 6.72e-06 7.42e-08 5675s
62 1.06945894e+03 1.06921007e+03 8.14e-06 4.96e-06 5.71e-08 5781s
63 1.06944723e+03 1.06923197e+03 7.07e-06 4.28e-06 4.95e-08 5867s
64 1.06942867e+03 1.06924407e+03 5.46e-06 3.88e-06 4.10e-08 5946s
65 1.06942097e+03 1.06925561e+03 4.67e-06 3.51e-06 3.62e-08 6022s
66 1.06941312e+03 1.06926961e+03 3.89e-06 3.06e-06 3.10e-08 6097s
67 1.06940701e+03 1.06928080e+03 3.32e-06 2.74e-06 2.70e-08 6183s
68 1.06939554e+03 1.06929785e+03 2.53e-06 2.14e-06 2.08e-08 6281s
69 1.06939075e+03 1.06931139e+03 2.10e-06 1.70e-06 1.70e-08 6367s
70 1.06938553e+03 1.06932304e+03 1.71e-06 1.31e-06 1.35e-08 6443s
71 1.06938049e+03 1.06933206e+03 1.28e-06 1.01e-06 1.03e-08 6521s
72 1.06937552e+03 1.06934113e+03 9.38e-07 7.05e-07 7.42e-09 6608s
73 1.06936820e+03 1.06935027e+03 3.78e-07 3.99e-07 3.60e-09 6706s
74 1.06936675e+03 1.06935511e+03 2.67e-07 2.40e-07 2.38e-09 6792s
75 1.06936378e+03 1.06935871e+03 7.49e-08 1.18e-07 9.38e-10 6870s
76 1.06936276e+03 1.06936177e+03 2.60e-08 1.45e-08 2.09e-10 6947s
77 1.06936233e+03 1.06936202e+03 5.22e-09 6.27e-09 5.95e-11 7024s
78 1.06936228e+03 1.06936216e+03 3.63e-09 1.41e-09 2.32e-11 7110s
79 1.06936224e+03 1.06936218e+03 3.80e-09 7.23e-10 1.09e-11 7202s
80 1.06936222e+03 1.06936218e+03 2.32e-09 6.93e-10 7.81e-12 7288s
81 1.06936221e+03 1.06936220e+03 7.19e-10 1.04e-10 1.09e-12 7365s
Barrier solved model in 81 iterations and 7365.88 seconds
Optimal objective 1.06936221e+03
Root crossover log...
283099 DPushes remaining with DInf 0.0000000e+00 7391s
66660 DPushes remaining with DInf 1.5065884e-05 7423s
12685 DPushes remaining with DInf 0.0000000e+00 7436s
4892 DPushes remaining with DInf 0.0000000e+00 7445s
1543 DPushes remaining with DInf 0.0000000e+00 7452s
180 DPushes remaining with DInf 0.0000000e+00 7458s
7 DPushes remaining with DInf 0.0000000e+00 7462s
0 DPushes remaining with DInf 0.0000000e+00 7466s
9566 PPushes remaining with PInf 5.2399968e-06 7468s
8314 PPushes remaining with PInf 5.2399968e-06 7472s
8096 PPushes remaining with PInf 0.0000000e+00 7475s
5711 PPushes remaining with PInf 0.0000000e+00 7481s
4031 PPushes remaining with PInf 0.0000000e+00 7487s
2317 PPushes remaining with PInf 0.0000000e+00 7493s
464 PPushes remaining with PInf 0.0000000e+00 7502s
0 PPushes remaining with PInf 0.0000000e+00 7506s
Push phase complete: Pinf 0.0000000e+00, Dinf 5.1343183e-12 7508s
Root simplex log...
Iteration Objective Primal Inf. Dual Inf. Time
292423 1.0693622e+03 0.000000e+00 0.000000e+00 7530s
292423 1.0693622e+03 0.000000e+00 0.000000e+00 7569s
Concurrent spin time: 3510.74s (can be avoided by choosing Method=3)
Solved with barrier
Root relaxation: objective 1.069362e+03, 292423 iterations, 11026.43 seconds
Total elapsed time = 11088.89s
Total elapsed time = 19114.32s
Total elapsed time = 19702.05s
Total elapsed time = 21256.24s
Total elapsed time = 22426.80s
Nodes | Current Node | Objective Bounds | Work
Expl Unexpl | Obj Depth IntInf | Incumbent BestBd Gap | It/Node Time
0 0 1069.36220 0 3894 - 1069.36220 - - 23376s
0 0 1069.48074 0 4429 - 1069.48074 - - 69517s
0 0 1069.48074 0 4421 - 1069.48074 - - 70458s
0 0 1069.93352 0 4477 - 1069.93352 - - 183238s
0 0 1070.06944 0 4512 - 1070.06944 - - 208352s
0 0 1070.06966 0 4455 - 1070.06966 - - 210142s
0 0 1070.06967 0 4460 - 1070.06967 - - 210265s
0 0 1070.74120 0 4670 - 1070.74120 - - 322142s
0 0 1070.83038 0 4632 - 1070.83038 - - 336979s
0 0 1070.83141 0 4595 - 1070.83141 - - 341137s
0 0 1070.83143 0 4602 - 1070.83143 - - 341724s
0 0 1071.49069 0 4796 - 1071.49069 - - 521069s
<scheduler killed job>
The error message given by the scheduler when the job timed out was the following:
State: TIMEOUT (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 12-11:38:25
CPU Efficiency: 99.88% of 12-12:00:08 core-walltime
Job Wall-clock time: 6-06:00:04
Memory Utilized: 84.66 GB
Memory Efficiency: 90.30% of 93.75 GB
(I'm not sure why it says I'm using 2 cores per node, since my script requests 1?)
-
There are a few things I notice from your log:
First of all, the error message from the scheduler says TIMEOUT. So I guess more time would help. :-)
Why does it time out? Is there a hard limit on the duration of such a run on your cluster?
Moreover, it looks as though your model is very difficult.
According to your log, you have 2016314 integer and 0 binary variables, but your matrix range is [1e+00, 1e+00]. Is it possible that your (integer) variables are unbounded? Unbounded integer variables make a model extremely hard to solve (in particular, if there is a huge number of them). If this is the case, you should either add bounds (as tight as possible) or relax on the integrality.
In addition to that, it could be possible to speed up the solver by setting the right parameters. To start, I think you could save some time by choosing Method=2 instead of 4.
Silke
0 -
Hi Silke,
Thanks very much for the response. Yes, the university imposes a time limit of 168 hours. I accidentally set this experiment to 150 hours, so there's a chance those extra 18 hours will make a big difference (I'm running it again with the longer time limit), but I'm not optimistic.
By setting bounds on the variables, you mean that I should tell the solver if I know the final values should be within some range?
You suggest method=2 instead of method=3? As I was writing the message above I noticed that the log said the concurrent spin time can be avoided by choosing method=3...
Thanks again,
Jennifer
0 -
Hi Jennifer,
Yes, with the bounds I meant exactly what you say. Find an upper and lower (if that's not already 0) bound on the integer variables. Otherwise, the bounds will be assumed to be +/- 2 billion (2.000.000.000) and I think in almost all practical applications, this is not reasonable.
On the other hand, if an integer variable does need to have a very big range, you should reconsider whether it really needs to be integer or whether you can remove the integrality condition and round the result in the end. E.g. if such a variable models an amount of money (say in cents) and needs to have a very huge range (because your problem involves millions of bucks), then you can make it continuous and round to the nearest cent (or dollar, or multiple of 100 dollars) without affecting the real-world solution quality. Does that make sense?
Setting good bounds or relaxing on the integrality should make a much bigger difference for the running time than any parameter settings. (As an experiment, you could try to just make all your variables continuous and see whether this helps.)
As for the method, 3 will choose non-deterministic concurrent (i.e., multiple algorithms in parallel), 2 will choose the barrier. Setting it to 3 should provide a speedup, but since we already know from your first run that barrier wins, you could as well set it to use barrier only. (This could provide another albeit probably small speedup since the barrier then does not have to share resources with the other algorithms.)
Silke
0 -
Thank you so much for the very helpful explanation! Am I correct that requesting more nodes or tasks per node will not help? Do you think more memory will help? The log says Factor NZ takes roughly 5.0 GB of memory, but the error message from the cluster says I used 84.66 GB. I'm not sure if there is an easy way to convert the Factor NZ memory amounts to total amount of memory needed for the whole computation?
0 -
I think that more nodes or more tasks might make sense if you wanted to do concurrent optimization, but since the solver gets stuck in the root, I don't expect this to help.
As for the memory, Gurobi does not log (or monitor) the total memory usage. So it is hard to say when these 84.66 GB were used, but I would guess this happened during the root relaxation (because of the barrier). Afterward, memory usage typically goes down significantly and then starts growing again. Can you get onto the machine and see (e.g. using top) how much memory Gurobi is using? If so, you should take note of these numbers and compare them for different times (e.g. towards the end of the root relaxation and afterward).
0
Please sign in to leave a comment.
Comments
5 comments