15-418 Final Project Milestone Report

By Connor Mowry (cmowry), Wenqi Deng (wenqid)

Summary of Progress

Our project focuses on parallelizing Dinic's max-flow algorithm in its BFS and DFS phases using OpenMP and MPI to enhance performance. Over the past few weeks, we have made significant progress in implementing the sequential and OpenMP versions of parallel Dinic's algorithm and measuring their performance.

We started by implementing and testing the sequential version of Dinic's algorithm, establishing a solid foundation for subsequent parallelization. In the first phase of parallelization using OpenMP, we focused on the BFS component, constructing the layer graph in parallel and exploring strategies at different granularities, such as parallelizing by vertex and edge, to balance load and optimize performance.

For the DFS phase, where blocking flows are computed, we addressed dependency issues of capacity updates by implementing two approaches: locking edges along paths being explored to prevent conflicts as well as allowing threads to explore freely while only updating back capacities after paths successfully reach the sink.

In addition, we have started implementing a basic MPI version to parallelize the BFS phase. In this approach, we divide the graph's vertices across different processors, with each processor exploring the vertices assigned to it. Our next steps will focus on exploring ways to reduce communication overhead and optimize the overall efficiency of the MPI implementation of BFS, and start with the MPI implementation of DFS.

Attempted Approaches

Benchmarking

We measured the computation time for calculating the max flow in networks of varying models and sizes, using both GHC and PSC machines. The time is recorded in milliseconds.

On GHC Machines:

  1. Small Erdős–Rényi Graph (n=128, e=15457)
    1 2 4 8
    Sequential 0.229 NA NA NA
    OpenMP BFS vertex 0.245 0.264 0.334 0.343
    OpenMP BFS edge 0.425 0.505 0.529 0.389
    OpenMP DFS forward lock 0.433 0.446 0.464 0.403
    OpenMP DFS reverse lock 3.31 2.45 2.05 1.83
  2. Small Barabási–Albert Graph (n=128, e=1694)
    1 2 4 8
    Sequential 0.121 NA NA NA
    OpenMP BFS by vertex 0.161 0.259 0.566 0.804
    OpenMP BFS by edge 1.427 1.005 0.729 0.981
    OpenMP DFS forward lock 0.259 0.426 0.327 0.615
    OpenMP DFS reverse lock 0.540 0.269 0.302 0.337
  3. Small Grid Graph (n=121, e=440)
    1 2 4 8
    Sequential 0.128 NA NA NA
    OpenMP BFS vertex 0.321 0.213 0.201 0.310
    OpenMP BFS edge 1.234 0.416 0.379 0.510
    OpenMP DFS forward lock 0.038 0.098 0.144 0.189
    OpenMP DFS reverse lock 0.160 0.164 0.252 0.307
  4. Medium Erdős–Rényi Graph (n=2048, e=3982965)
    1 2 4 8
    Sequential 66.7 NA NA NA
    OpenMP BFS vertex 68.3 54.2 51.8 50.5
    OpenMP BFS edge 124.1 71.2 44.4 30.2
    OpenMP DFS forward lock 162.7 116.2 93.6 83.2
    OpenMP DFS reverse lock 1649.2 1460.3 1277.1 1156.3
  5. Medium Barabási–Albert Graph (n=2048, e=400670)
    1 2 4 8
    Sequential 8.76 NA NA NA
    OpenMP BFS vertex 5.58 4.63 4.33 4.36
    OpenMP BFS edge 84.6 48.3 29.7 20.1
    OpenMP DFS forward lock 12.7 10.3 8.88 8.68
    OpenMP DFS reverse lock 72.2 54.1 47.1 43.7
  6. Medium Grid Graph (n=2025, e=7920)
    1 2 4 8
    Sequential 1.43 NA NA NA
    OpenMP BFS vertex 0.54 1.28 0.27 1.35
    OpenMP BFS edge 149.4 90.4 60.5 44.7
    OpenMP DFS forward lock 0.56 0.69 0.72 0.80
    OpenMP DFS reverse lock 2.53 2.93 3.11 3.19
  7. Large Erdős–Rényi Graph (n=4096, e=15934587)
    1 2 4 8
    Sequential 228.7 NA NA NA
    OpenMP BFS vertex 236.4 194.4 184.7 179.3
    OpenMP BFS edge 387.4 225.0 144.7 101.5
    OpenMP DFS forward lock 592.2 407.1 311.9 276.1
    OpenMP DFS reverse lock 7519.04 6589.8 6536.5 6613.1
  8. Large Barabási–Albert Graph (n=4096, e=1595310)
    1 2 4 8
    Sequential 19.7 NA NA NA
    OpenMP BFS vertex 19.5 16.7 16.1 16.2
    OpenMP BFS edge 234.4 136.8 88.1 61.6
    OpenMP DFS forward lock 46.0 35.9 28.9 28.8
    OpenMP DFS reverse lock 251.8 168.2 142.8 132.7
  9. Large Grid Graph (n=4096, e=16128)
    1 2 4 8
    Sequential 0.136 NA NA NA
    OpenMP BFS vertex 0.242 0.506 3.91 1.150
    OpenMP BFS edge 110.2 60.8 36.4 23.2
    OpenMP DFS forward lock 0.296 0.395 0.403 0.453
    OpenMP DFS reverse lock 1.27 0.919 0.877 0.955

On PSC Machines:

  1. Extreme Erdős–Rényi Graph (n=8192, e=63743938)
    1 2 4 8 16 32 64
    Sequential 1284.5 NA NA NA NA NA NA
    OpenMP BFS vertex 1348.2 931.3 793.8 733.3 793.4 810.5 823.8
    OpenMP BFS edge 2025.3 1413.8 1223.9 817.4 879.7 954.8 771.3
    OpenMP DFS forward lock 3535.5 2460.7 1841.3 1573.2 1893.9 1839.9 1791.6
    OpenMP DFS reverse lock 47954 34909 31844 27034 25403 86543 86407
  2. Extreme Barabási–Albert Graph (n=8192, e=6381240)
    1 2 4 8 16 32 64
    Sequential 160.1 NA NA NA NA NA NA
    OpenMP BFS vertex 158.6 121.5 103.0 100.1 124.4 149.5 151.4
    OpenMP BFS edge 1753.4 1383.5 986.8 612.3 591.2 611.8 400.2
    OpenMP DFS forward lock 347.2 270.4 226.3 210.6 259.4 303.2 297.5
    OpenMP DFS reverse lock 3247.1 2640.3 2675.7 3345.8 3106.4 5328.0 9336.0
  3. Extreme Grid Graph (n=8190, e=32398)
    1 2 4 8 16 32 64
    Sequential 0.453 NA NA NA NA NA NA
    OpenMP BFS vertex 0.735 2.509 2.55 0.828 2.31 8.83 9.27
    OpenMP BFS edge 772.3 826.2 494.5 313.2 334.7 339.3 227.5
    OpenMP DFS forward lock 1.62 2.94 3.53 3.77 4.34 5.89 6.78
    OpenMP DFS reverse lock 5.83 8.53 10.2 11.2 15.6 16.8 16.7
  4. Impossible Erdős–Rényi Graph (n=12040, e=99609033)
    1 2 4 8 16 32 64
    Sequential 1764.9 NA NA NA NA NA NA
    OpenMP BFS vertex 1825.5 1391.2 1194.6 1114.2 1168.2 1204.2 1195.7
    OpenMP BFS edge 2546.9 1775.5 1564.7 1060.5 1105.2 1325.7 1124.0
    OpenMP DFS forward lock 5272.2 3586.0 2567.6 2033.0 2214.6 2219.8 2063.5
    OpenMP DFS reverse lock 80849 57050 51330 59283.9 42305 121861 143007
  5. Impossible Barabási–Albert Graph (n=12040, e=9961472)
    1 2 4 8 16 32 64
    Sequential 250.1 NA NA NA NA NA NA
    OpenMP BFS vertex 241.0 183.3 151.2 149.1 171.5 201.1 200.2
    OpenMP BFS edge 2774.0 2371.1 1732.0 934.1 933.0 946.5 626.4
    OpenMP DFS forward lock 574.0 409.0 342.1 309.7 387.1 425.6 451.9
    OpenMP DFS reverse lock 5254.4 4784.1 4456.0 4732.6 5173.3 10112 8030.3
  6. Impossible Grid Graph (n=10201, e=40400)
    1 2 4 8 16 32 64
    Sequential 0.630 NA NA NA NA NA NA
    OpenMP BFS vertex 0.914 6.58 4.21 4.65 2.57 8.68 5.57
    OpenMP BFS edge 1152.7 884.8 853.5 515.0 485.0 498.8 326.3
    OpenMP DFS forward lock 2.17 3.97 5.87 6.31 5.79 7.42 8.00
    OpenMP DFS reverse lock 6.80 10.5 15.1 20.0 21.4 21.6 13.45

Performance Graphs

Performance Graph 1 Performance Graph 2 Performance Graph 3 Performance Graph 4 Performance Graph 5 Performance Graph 6 Performance Graph 7 Performance Graph 8 Performance Graph 9 Performance Graph 10 Performance Graph 11 Performance Graph 12 Performance Graph 13 Performance Graph 14 Performance Graph 15

Performance Analysis

From the benchmarking results, we observed that:

  1. In large and dense networks with more edges, parallelizing BFS in OpenMP shows measurable performance improvements that increase with the number of threads up to 16. However, as the number of threads exceeds 16, the speedup begins to diminish. This is likely due to the limited number of parallelizable tasks available, causing increased overhead from thread management and synchronization, which reduces the effectiveness of additional threads.
  2. In most of the test cases, parallelizing by vertex results in better performance compared to by edge. However, we did notice that in most of the Erdős–Rényi networks, parallelizing by edge outperforms the by vertex approach. This is potentially due to large numbers of edges attached to small numbers of vertices in this model, causing load imbalance when processors work on different nodes in the by vertex approach.
  3. Although the speedup from parallelizing the DFS phase compared to the sequential version was minimal in some test cases, there are noticeable performance improvements when running with 8 threads versus a single thread in large and dense networks. This could be caused by the high overhead of managing a large number of locks.
  4. For most of the test cases, especially when there is a large number of edges, the reverse lock approach performs worse than the forward lock approach. This might be due to situations where some threads make progress that does not contribute meaningfully to the overall solution. For example, certain threads might explore paths that ultimately do not lead to valid augmenting flows or encounter situations where their updates are later blocked and invalidated by other threads. These unproductive efforts waste computational resources and add synchronization overhead.

Goals and Deliverables

Original Goals and Deliverables

Our primary goal was to parallelize both the BFS and DFS phases of Dinic's algorithm using OpenMP for shared memory and MPI for distributed memory systems. We aimed to implement efficient parallel strategies for each phase, optimizing for both performance and scalability.

The original deliverables included:

Stretch goals included exploring extensive MPI optimizations (e.g., non-blocking communication and load balancing) and implementing Dinic's algorithm on a GPU using CUDA to explore performance gains on massively parallel architectures.

Progress and Adjusted Goals

We have made significant progress toward our original goals, particularly with the BFS phase. However, the DFS phase has proven challenging to parallelize effectively due to its dependency on capacity updates during flow computation, which limits opportunities for parallel execution.

Achievements So Far:

Updated Plan:

Issues and Concerns

Key Concerns:

Remaining Unknowns:

At this stage, our main concerns center on parallelizing the DFS phase effectively and evaluating MPI strategies. We aim to address these challenges while continuing progress on BFS optimizations and other deliverables.

Stretch Goals

Plan for Poster Session

For the poster session, we plan to showcase:

By adjusting our focus and emphasizing BFS optimizations, we aim to deliver a robust analysis of parallelization strategies for Dinic's algorithm while addressing the challenges encountered in DFS parallelization.

Adjusted Schedule

Completed Work (Weeks 1-3)

Week 1 (Nov 10 - Nov 16)

Week 2 (Nov 17 - Nov 23)

Week 3 (Nov 24 - Nov 30)

Upcoming Work (Weeks 4-5)

Week 4 (Dec 1 - Dec 7)

Week 5 (Dec 8 - Dec 13)

Final Week (Dec 14 - Dec 15)