By Connor Mowry (cmowry), Wenqi Deng (wenqid)
Our project focuses on parallelizing Dinic's max-flow algorithm in its BFS and DFS phases using OpenMP and MPI to enhance performance. Over the past few weeks, we have made significant progress in implementing the sequential and OpenMP versions of parallel Dinic's algorithm and measuring their performance.
We started by implementing and testing the sequential version of Dinic's algorithm, establishing a solid foundation for subsequent parallelization. In the first phase of parallelization using OpenMP, we focused on the BFS component, constructing the layer graph in parallel and exploring strategies at different granularities, such as parallelizing by vertex and edge, to balance load and optimize performance.
For the DFS phase, where blocking flows are computed, we addressed dependency issues of capacity updates by implementing two approaches: locking edges along paths being explored to prevent conflicts as well as allowing threads to explore freely while only updating back capacities after paths successfully reach the sink.
In addition, we have started implementing a basic MPI version to parallelize the BFS phase. In this approach, we divide the graph's vertices across different processors, with each processor exploring the vertices assigned to it. Our next steps will focus on exploring ways to reduce communication overhead and optimize the overall efficiency of the MPI implementation of BFS, and start with the MPI implementation of DFS.
We observed some speedup of parallelizing the BFS phase when testing on large networks. For most of the test cases, parallelizing by vertex results in better performance compared to parallelizing by edge. This is likely because the overhead of managing numerous small tasks becomes significant when there are a large number of tasks, which can slow down performance in the edge-based approach. However, we did notice that in most of the Erdős–Rényi networks, parallelizing by edge outperforms the by vertex approach. This is potentially due to large numbers of edges attached to small numbers of vertices in this model, causing load imbalance when processors work on different nodes in the by vertex approach.
Although the speedup from parallelizing the DFS phase compared to the sequential version was only slight in some test cases, we observed measurable performance improvements when running more threads compared to 1 thread. This could be attributed to the high overhead of managing a large number of locks and situations where some threads make progress that does not contribute meaningfully to the overall solution (one edge in the path later saturated by another process) in the reverse lock approach.
We measured the computation time for calculating the max flow in networks of varying models and sizes, using both GHC and PSC machines. The time is recorded in milliseconds.
1 | 2 | 4 | 8 | |
---|---|---|---|---|
Sequential | 0.229 | NA | NA | NA |
OpenMP BFS vertex | 0.245 | 0.264 | 0.334 | 0.343 |
OpenMP BFS edge | 0.425 | 0.505 | 0.529 | 0.389 |
OpenMP DFS forward lock | 0.433 | 0.446 | 0.464 | 0.403 |
OpenMP DFS reverse lock | 3.31 | 2.45 | 2.05 | 1.83 |
1 | 2 | 4 | 8 | |
---|---|---|---|---|
Sequential | 0.121 | NA | NA | NA |
OpenMP BFS by vertex | 0.161 | 0.259 | 0.566 | 0.804 |
OpenMP BFS by edge | 1.427 | 1.005 | 0.729 | 0.981 |
OpenMP DFS forward lock | 0.259 | 0.426 | 0.327 | 0.615 |
OpenMP DFS reverse lock | 0.540 | 0.269 | 0.302 | 0.337 |
1 | 2 | 4 | 8 | |
---|---|---|---|---|
Sequential | 0.128 | NA | NA | NA |
OpenMP BFS vertex | 0.321 | 0.213 | 0.201 | 0.310 |
OpenMP BFS edge | 1.234 | 0.416 | 0.379 | 0.510 |
OpenMP DFS forward lock | 0.038 | 0.098 | 0.144 | 0.189 |
OpenMP DFS reverse lock | 0.160 | 0.164 | 0.252 | 0.307 |
1 | 2 | 4 | 8 | |
---|---|---|---|---|
Sequential | 66.7 | NA | NA | NA |
OpenMP BFS vertex | 68.3 | 54.2 | 51.8 | 50.5 |
OpenMP BFS edge | 124.1 | 71.2 | 44.4 | 30.2 |
OpenMP DFS forward lock | 162.7 | 116.2 | 93.6 | 83.2 |
OpenMP DFS reverse lock | 1649.2 | 1460.3 | 1277.1 | 1156.3 |
1 | 2 | 4 | 8 | |
---|---|---|---|---|
Sequential | 8.76 | NA | NA | NA |
OpenMP BFS vertex | 5.58 | 4.63 | 4.33 | 4.36 |
OpenMP BFS edge | 84.6 | 48.3 | 29.7 | 20.1 |
OpenMP DFS forward lock | 12.7 | 10.3 | 8.88 | 8.68 |
OpenMP DFS reverse lock | 72.2 | 54.1 | 47.1 | 43.7 |
1 | 2 | 4 | 8 | |
---|---|---|---|---|
Sequential | 1.43 | NA | NA | NA |
OpenMP BFS vertex | 0.54 | 1.28 | 0.27 | 1.35 |
OpenMP BFS edge | 149.4 | 90.4 | 60.5 | 44.7 |
OpenMP DFS forward lock | 0.56 | 0.69 | 0.72 | 0.80 |
OpenMP DFS reverse lock | 2.53 | 2.93 | 3.11 | 3.19 |
1 | 2 | 4 | 8 | |
---|---|---|---|---|
Sequential | 228.7 | NA | NA | NA |
OpenMP BFS vertex | 236.4 | 194.4 | 184.7 | 179.3 |
OpenMP BFS edge | 387.4 | 225.0 | 144.7 | 101.5 |
OpenMP DFS forward lock | 592.2 | 407.1 | 311.9 | 276.1 |
OpenMP DFS reverse lock | 7519.04 | 6589.8 | 6536.5 | 6613.1 |
1 | 2 | 4 | 8 | |
---|---|---|---|---|
Sequential | 19.7 | NA | NA | NA |
OpenMP BFS vertex | 19.5 | 16.7 | 16.1 | 16.2 |
OpenMP BFS edge | 234.4 | 136.8 | 88.1 | 61.6 |
OpenMP DFS forward lock | 46.0 | 35.9 | 28.9 | 28.8 |
OpenMP DFS reverse lock | 251.8 | 168.2 | 142.8 | 132.7 |
1 | 2 | 4 | 8 | |
---|---|---|---|---|
Sequential | 0.136 | NA | NA | NA |
OpenMP BFS vertex | 0.242 | 0.506 | 3.91 | 1.150 |
OpenMP BFS edge | 110.2 | 60.8 | 36.4 | 23.2 |
OpenMP DFS forward lock | 0.296 | 0.395 | 0.403 | 0.453 |
OpenMP DFS reverse lock | 1.27 | 0.919 | 0.877 | 0.955 |
1 | 2 | 4 | 8 | 16 | 32 | 64 | |
---|---|---|---|---|---|---|---|
Sequential | 1284.5 | NA | NA | NA | NA | NA | NA |
OpenMP BFS vertex | 1348.2 | 931.3 | 793.8 | 733.3 | 793.4 | 810.5 | 823.8 |
OpenMP BFS edge | 2025.3 | 1413.8 | 1223.9 | 817.4 | 879.7 | 954.8 | 771.3 |
OpenMP DFS forward lock | 3535.5 | 2460.7 | 1841.3 | 1573.2 | 1893.9 | 1839.9 | 1791.6 |
OpenMP DFS reverse lock | 47954 | 34909 | 31844 | 27034 | 25403 | 86543 | 86407 |
1 | 2 | 4 | 8 | 16 | 32 | 64 | |
---|---|---|---|---|---|---|---|
Sequential | 160.1 | NA | NA | NA | NA | NA | NA |
OpenMP BFS vertex | 158.6 | 121.5 | 103.0 | 100.1 | 124.4 | 149.5 | 151.4 |
OpenMP BFS edge | 1753.4 | 1383.5 | 986.8 | 612.3 | 591.2 | 611.8 | 400.2 |
OpenMP DFS forward lock | 347.2 | 270.4 | 226.3 | 210.6 | 259.4 | 303.2 | 297.5 |
OpenMP DFS reverse lock | 3247.1 | 2640.3 | 2675.7 | 3345.8 | 3106.4 | 5328.0 | 9336.0 |
1 | 2 | 4 | 8 | 16 | 32 | 64 | |
---|---|---|---|---|---|---|---|
Sequential | 0.453 | NA | NA | NA | NA | NA | NA |
OpenMP BFS vertex | 0.735 | 2.509 | 2.55 | 0.828 | 2.31 | 8.83 | 9.27 |
OpenMP BFS edge | 772.3 | 826.2 | 494.5 | 313.2 | 334.7 | 339.3 | 227.5 |
OpenMP DFS forward lock | 1.62 | 2.94 | 3.53 | 3.77 | 4.34 | 5.89 | 6.78 |
OpenMP DFS reverse lock | 5.83 | 8.53 | 10.2 | 11.2 | 15.6 | 16.8 | 16.7 |
1 | 2 | 4 | 8 | 16 | 32 | 64 | |
---|---|---|---|---|---|---|---|
Sequential | 1764.9 | NA | NA | NA | NA | NA | NA |
OpenMP BFS vertex | 1825.5 | 1391.2 | 1194.6 | 1114.2 | 1168.2 | 1204.2 | 1195.7 |
OpenMP BFS edge | 2546.9 | 1775.5 | 1564.7 | 1060.5 | 1105.2 | 1325.7 | 1124.0 |
OpenMP DFS forward lock | 5272.2 | 3586.0 | 2567.6 | 2033.0 | 2214.6 | 2219.8 | 2063.5 |
OpenMP DFS reverse lock | 80849 | 57050 | 51330 | 59283.9 | 42305 | 121861 | 143007 |
1 | 2 | 4 | 8 | 16 | 32 | 64 | |
---|---|---|---|---|---|---|---|
Sequential | 250.1 | NA | NA | NA | NA | NA | NA |
OpenMP BFS vertex | 241.0 | 183.3 | 151.2 | 149.1 | 171.5 | 201.1 | 200.2 |
OpenMP BFS edge | 2774.0 | 2371.1 | 1732.0 | 934.1 | 933.0 | 946.5 | 626.4 |
OpenMP DFS forward lock | 574.0 | 409.0 | 342.1 | 309.7 | 387.1 | 425.6 | 451.9 |
OpenMP DFS reverse lock | 5254.4 | 4784.1 | 4456.0 | 4732.6 | 5173.3 | 10112 | 8030.3 |
1 | 2 | 4 | 8 | 16 | 32 | 64 | |
---|---|---|---|---|---|---|---|
Sequential | 0.630 | NA | NA | NA | NA | NA | NA |
OpenMP BFS vertex | 0.914 | 6.58 | 4.21 | 4.65 | 2.57 | 8.68 | 5.57 |
OpenMP BFS edge | 1152.7 | 884.8 | 853.5 | 515.0 | 485.0 | 498.8 | 326.3 |
OpenMP DFS forward lock | 2.17 | 3.97 | 5.87 | 6.31 | 5.79 | 7.42 | 8.00 |
OpenMP DFS reverse lock | 6.80 | 10.5 | 15.1 | 20.0 | 21.4 | 21.6 | 13.45 |
From the benchmarking results, we observed that:
Our primary goal was to parallelize both the BFS and DFS phases of Dinic's algorithm using OpenMP for shared memory and MPI for distributed memory systems. We aimed to implement efficient parallel strategies for each phase, optimizing for both performance and scalability.
The original deliverables included:
Stretch goals included exploring extensive MPI optimizations (e.g., non-blocking communication and load balancing) and implementing Dinic's algorithm on a GPU using CUDA to explore performance gains on massively parallel architectures.
We have made significant progress toward our original goals, particularly with the BFS phase. However, the DFS phase has proven challenging to parallelize effectively due to its dependency on capacity updates during flow computation, which limits opportunities for parallel execution.
Achievements So Far:
Updated Plan:
MPI_Alltoall
vs. point-to-point communication).Key Concerns:
Remaining Unknowns:
MPI_Alltoall
and point-to-point communication.At this stage, our main concerns center on parallelizing the DFS phase effectively and evaluating MPI strategies. We aim to address these challenges while continuing progress on BFS optimizations and other deliverables.
For the poster session, we plan to showcase:
By adjusting our focus and emphasizing BFS optimizations, we aim to deliver a robust analysis of parallelization strategies for Dinic's algorithm while addressing the challenges encountered in DFS parallelization.
MPI_Alltoall
vs. point-to-point communication.