15-418 Project: Milestone Report

For our project, we are simulating GPUs in CADSS.

Updated Schedule

Theo

4/18 - Modify trace file format to allow for computation. Only needs to support a minimal subset (i.e. SAXPY)
4/21 - Computation demo for SAXPY without branch
4/23 - CFG/Dominator implementation

Ethan

4/17 - Improve warp scheduling
4/18 - Implement multiple SMs
4/20 - Extract memory into standalone component, to better model shared and global memory.
4/21 - Computation demo for SAXPY

Progress Update

So far, we have focused on putting together a simple streaming multiprocessor with basic warp scheduling. What we have so far is:

A basic streaming multiprocessor based on a five-stage FICI (fire in-order, complete in-order) pipeline (Fetch, Decode, Execute, Mem, Write Back). Currently, the pipeline correctly handles data hazards and register dependencies, but treats most CADSS instructions (ALU, Store, Branch) the same way. For now, each non memory pipeline stage is assumed to take only one tick, and memory loads takes 100 ticks.
Basic warp scheduling. Our framework supports the scheduling of arbitrarily many warps onto a single streaming multiprocessor. To achieve this, it tracks remaining instructions to be executed per warp, and whether the instruction at the head of each queue is ready to be fired. Instructions from runnable warps are then interleaved with each other in the processor’s pipeline to resolve hazards.

Deliverables Achieved

Of the deliverables we originally proposed, we have already put together the basic streaming multiprocessor (Deliverable 2) which supports CADSS’ built-in trace file format (Deliverable 1).

Deliverables Remaining

We have not yet implemented support for multiple streaming multiprocessors (Deliverable 3). From there, the only remaining deliverable we proposed was integrations with other CADSS components (caches etc.). After reviewing GPU architecture further we have decided to only implement the GPU and “memory” components with variable delays based on whether the accesses are to shared or global memory, which should cut down on implementation time.

Planned Work (“Nice to Haves”)

After implementing some pieces of our project, we have found that the most interesting parts pertain to scheduling: both of warps onto SMs, and of instructions within individual warps.

Currently, a limitation of our simulator is that threads within warps never diverge. In essence, all threads within a warp always execute the same instruction. This limits our ability to perform interesting simulations, for example, ones that model workload imbalance or thread divergence. We have thought of two approaches to tackle this problem:

Encode different work for different warps in the trace file. This matches the design used for the existing processor lab in 15-346.
Perform basic computation on the GPU (enough to handle correct predication) and explicitly handle control flow.

Approach 1 is much simpler to implement, but has the potential to produce massive trace files for complex programs.

Approach 2 will require significant implementation effort, but would allow us to perform simulations using much shorter trace files (e.g. on a transpiled version of the original NVIDIA PTX) and expicitly model techniques used in modern GPUs to handle thread divergence. We hope this will make our project more interesting.

In addition to modelling computation, this approach would require us to implement algorithms similar to the ones proposed by Aamodt et al. (p.25) in their book if we want to simulate loops without explicitly unrolling them in the trace files. The basic approach would involve building a control flow graph from the assembly at initialization time, then maintaining a simple runtime data structure to make sure that threads properly converge. Additionally, this approach seems to be used by Ali et al in their paper.

Poster Session

The main result we hope to show at the poster session are charts that demonstrate that runtime trends on our simulated GPU have roughly the same runtime curve as on a real GPU as we increase the input size.

If we implement the more advanced execution model, we could also put together a demo of complex programs supported by our simulator.

Concerns

A significant concern is that modern GPUs may be too difficult to simulate using the simplistic model we have so far. Many implementation details are proprietary and can only be speculated about (at best) using the CUDA specification, white papers, and patents [3]. Additionally, implementation details (microarchitecture) varies from one GPU to another. This may affect our ability to produce performance curves that accurately model those of real GPUs. For example, while experimenting with SAXPY, we found that code that ought to have higher thread divergence actually performed better in certain cases due to (unknown) optimizations.

Works Cited

Aamodt, Tor M., et al. General-purpose graphics processor architectures. Morgan & Claypool Publishers, 2018.
Ali Bakhoda, George Yuan, Wilson W. L. Fung, Henry Wong, Tor M. Aamodt, Analyzing CUDA Workloads Using a Detailed GPU Simulator, in IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Boston, MA, April 19-21, 2009.
GPU video series by @NotesByNick on YouTube