Solve speed difference between Linux, Windows, and WSL2
OngoingSimply put, my question is why Gurobi on Windows significantly slower compared to Linux (and even WSL2). And one step further, if there's a simple setting I'm missing to make Windows match the performance of Linux?
I was mostly using Gurobi directly on my Windows machine, but for fun also tried out the Compute Server docker image running in WSL2, just to see what the overhead of WSL2 was. It turns out WSL2 was _faster_ in solving than native Windows. This piqued my interest, so going one step further I also tried running on a Ubuntu Live USB. This turned out to be faster still. Note that this means I'm testing all this on the exact same hardware.
I would expect some overhead when using WSL2, but otherwise I would expect native Windows and native Linux to be very similar in performance. I certainly would not expect native Windows to be that slow. The work units seem to not differ _as_ much as the wall time (especially for the seed runs of cost226-UUE), although I'm not sure what this means.
I tested with two models from the MIPLIP2017 set, selected because they had ~5 minute runtimes (i.e. slow enough to measure difference in speed, but not slow enough to have to wait hours for results): bnatt400 and cost266-UUE.
Below I've tried to summarize the results:
bnatt400
--------
Windows CS: Explored 37709 nodes (38178539 simplex iterations) in 427.84 seconds (725.70 work units)
WSL2 CS: Explored 13446 nodes (12109196 simplex iterations) in 270.33 seconds (358.05 work units)
Linux CS: Explored 13446 nodes (12109196 simplex iterations) in 187.45 seconds (358.05 work units)
cost266-UUE
-----------
Windows CS: Explored 174342 nodes (16041645 simplex iterations) in 478.56 seconds (287.37 work units)
WSL2 CS: Explored 168890 nodes (16051099 simplex iterations) in 259.68 seconds (291.48 work units)
Linux CS: Explored 168890 nodes (16051099 simplex iterations) in 188.09 seconds (291.48 work units)
cost266-UUE seed runs
---------------------
Windows - default seed: Explored 174342 nodes (16041645 simplex iterations) in 478.56 seconds (287.37 work units)
Windows - 100: Explored 164319 nodes (17885279 simplex iterations) in 424.16 seconds (276.17 work units)
Windows - 1000: Explored 200445 nodes (21141094 simplex iterations) in 613.85 seconds (435.11 work units)
Windows - 10000: Explored 221163 nodes (21536535 simplex iterations) in 712.02 seconds (400.61 work units)
wsl2 - default seed: Explored 168890 nodes (16051099 simplex iterations) in 259.68 seconds (291.48 work units)
wsl2 - 100: Explored 170931 nodes (18453681 simplex iterations) in 277.35 seconds (272.70 work units)
wsl2 - 1000: Explored 215428 nodes (21704406 simplex iterations) in 372.57 seconds (420.57 work units)
wsl2 - 10000: Explored 157469 nodes (14988594 simplex iterations) in 234.69 seconds (284.30 work units)
-
Hi Timo,
This is an interesting observation.
Let me try to explain what might be happening here.
You can see that on WSL2 and Linux, the path taken by Gurobi is the same (or at least seems to be). This can be seen by the exact same number of explored node, simplex iterations, and work units. Here, native Linux is faster which is expected because WSL2 is "just" a Windows-Subsystem and may not be as performant as native Linux.
Now, let's compare Windows to Linux in your tests. For case bnatt400 the work units / second ratio for Windows is 725.7/427.84 ~ 1.7 work units per second. For WSL2 we have 358.05/270.33 ~1.3. For Linux we have 358.05/187.45 ~1.9. So despite that Windows took more seconds to solve the model, the solution speed, i.e., work units per seconds seems comparable to Linux. The difference in seconds needed to solve the model can be explained by the fact that on Windows a different solution path has been taken. The different path can be caused by using a different operating system due to slightly different hardware usage of different OS.
For case cost266-UUE, we have 0.6 work units per second on Windows, 1.1 for WSL2, and 1.5 for Linux. This definitely sounds surprising, but still may happen for single instances (outliers). It may happen that the different path chosen on Windows for the cost266-UEE case causes this drop in work units per second. However, without a greater overall benchmark, it is very hard to tell.
You say that you tested all 3 operating systems on the same machine. Are you sure that Windows was able to use all its resources and that no background processes were running? This may be quite difficult to achieve on Windows without particular permissions.
I hope this helps.
Best regards,
Jaromił0 -
> Are you sure that Windows was able to use all its resources and that no background processes were running? This may be quite difficult to achieve on Windows without particular permissions.
I mean, WSL2 is running with the same background processes running, so surely that can't explain the difference between 0.6 WU/sec and 1.1 WU/sec. There are more background processes running for sure on Windows, but nothing particularly resource intensive. And with 8 cores, and MIP solving not scaling perfectly, some of those cores are typically unused (i.e. free to use by the background processes).
> For case cost266-UUE, we have 0.6 work units per second on Windows, 1.1 for WSL2, and 1.5 for Linux. This definitely sounds surprising, but still may happen for single instances (outliers). It may happen that the different path chosen on Windows for the cost266-UEE case causes this drop in work units per second. However, without a greater overall benchmark, it is very hard to tell.
If it is indeed outliers, I would expect Windows to be sometimes _faster_ than Linux as well. I'll see if I can take a look at a bigger data set, i.e. more MIPLIP2017 cases (but skipping the timeout ones in https://plato.asu.edu/ftp/milp_tables/8threads.res ).
0 -
Back with more testing. I got a new PC with a Ryzen 7700X that I freshly installed with Windows 11. Then performed tests on Linux using the docker image running on a Ubuntu 23.04 Live USB. And lastly I installed WSL2 in Windows to run the docker in WSL2 tests there.
As mentioned above, I ran 227 cases in the MIPLIP2017 benchmark set, skipping the ones with timeouts to save a bit of time. Every case I ran with 4 different seeds (default, 10, 100, and 1000).
Summary:
Time:
- Linux is about 25% faster than Windows on average; median is about 30% faster.
- WSL2 is about 5% slower than Windows on average, median is about 2% slower.WU/sec:
- Linux is about 70% faster than Windows on average; median is about 40% faster.
- WSL2 is about 20% faster than Windows on average; median is about 1-2% slower.Pictures:
I have all the log files, but since pictures are easier to understand (and also I cannot upload zip files). Vertical axes is performance, horizontal axis in average run time of the particular problem on Windows. Note that the latter is log scale.
And just to show that the docker runs are deterministic:
Some interesting unrelated findings:
- Case gmu-35-50 is _a lot_ slower on Linux/Docker compared to Windows. Minimum 4.5 times the work units, max was 50 times.
- Conversly, gfd-schedulen180f7d50m30k18 is a lot faster on Linux. Minimum about the same (1.03 times the work units) because then Windows is _also_ slow, but 3 runs on Linux are 600 times slower than on Windows.Questions I have:
This is going very much into detail on software, compilation, scheduling, etc, but:
- I am expecting that Gurobi would spend most of its time in hand-written assembly routines (or at the very least intrinsics). I can try and figure that out myself, but is this not the case?
- If this _is_ the case, where else could the performance difference come from? 25% performance (time) on average is a lot, surely the scheduler on Windows can't be that bad?I don't really expect answers on these, but hopefully there's someone at Gurobi that likes performance/profiling/assembly/schedulers as much as I do :)
0
Please sign in to leave a comment.
Comments
3 comments