For our project, we are simulating GPUs in CADSS.
So far, we have focused on putting together a simple streaming multiprocessor with basic warp scheduling. What we have so far is:
Of the deliverables we originally proposed, we have already put together the basic streaming multiprocessor (Deliverable 2) which supports CADSS’ built-in trace file format (Deliverable 1).
We have not yet implemented support for multiple streaming multiprocessors (Deliverable 3). From there, the only remaining deliverable we proposed was integrations with other CADSS components (caches etc.). After reviewing GPU architecture further we have decided to only implement the GPU and “memory” components with variable delays based on whether the accesses are to shared or global memory, which should cut down on implementation time.
After implementing some pieces of our project, we have found that the most interesting parts pertain to scheduling: both of warps onto SMs, and of instructions within individual warps.
Currently, a limitation of our simulator is that threads within warps never diverge. In essence, all threads within a warp always execute the same instruction. This limits our ability to perform interesting simulations, for example, ones that model workload imbalance or thread divergence. We have thought of two approaches to tackle this problem:
Approach 1 is much simpler to implement, but has the potential to produce massive trace files for complex programs.
Approach 2 will require significant implementation effort, but would allow us to perform simulations using much shorter trace files (e.g. on a transpiled version of the original NVIDIA PTX) and expicitly model techniques used in modern GPUs to handle thread divergence. We hope this will make our project more interesting.
In addition to modelling computation, this approach would require us to implement algorithms similar to the ones proposed by Aamodt et al. (p.25) in their book if we want to simulate loops without explicitly unrolling them in the trace files. The basic approach would involve building a control flow graph from the assembly at initialization time, then maintaining a simple runtime data structure to make sure that threads properly converge. Additionally, this approach seems to be used by Ali et al in their paper.
The main result we hope to show at the poster session are charts that demonstrate that runtime trends on our simulated GPU have roughly the same runtime curve as on a real GPU as we increase the input size.
If we implement the more advanced execution model, we could also put together a demo of complex programs supported by our simulator.
A significant concern is that modern GPUs may be too difficult to simulate using the simplistic model we have so far. Many implementation details are proprietary and can only be speculated about (at best) using the CUDA specification, white papers, and patents [3]. Additionally, implementation details (microarchitecture) varies from one GPU to another. This may affect our ability to produce performance curves that accurately model those of real GPUs. For example, while experimenting with SAXPY, we found that code that ought to have higher thread divergence actually performed better in certain cases due to (unknown) optimizations.