Team Member: Xinyu Li (xinyul3), William Zou (yangzou)
URL: Project Webpage
We are going to implement a parallel version of the differentiable fault simulator for digital circuits by accelerating its major logic value and fault propagation steps using custom CUDA kernels. Our goal is to improve performance over the current Python implementation. We will analyze performance on NVIDIA GPUs available in GHC cluster and in the ECE lab if feasible.
The differentiable fault simulator is a tool for post-silicon validation of digital circuits. It checks whether a given input pattern can activate a fault and make it observable at the circuit’s output. The circuit is represented as a directed graph with logic gates as nodes and wires as edges. By using smooth approximations of Boolean logic, the simulator can compute logic values and fault observability in a way that supports gradient-based search for valid test patterns.
The computationally intensive components of the simulator are the logic value propagation and fault propagation phases, which are currently implemented sequentially in Python. Several aspects of the problem are well-suited for parallelization: (1) the value and fault propagation steps involve repeated element-wise products, sums, and probabilistic masking operations, all of which are highly data-parallel; (2) the current implementation processes only one type of gate at a time, but different gate types could be propagated concurrently; and (3) although logic gates have data dependencies, many of these dependencies may be resolved earlier than a strict topological layer-by-layer approach would allow, enabling finer-grained parallelism across the graph.
Through this project, we aim to learn how to adapt dependent directed-graph-based workloads to the GPU’s SIMD architecture while balancing correctness and performance.
Digital circuits are modeled as directed graphs where the computation at each node (gate) depends on the results from its predecessor nodes. This dependency means that logic value propagation and fault observability must be updated sequentially overall.
The simulation involves repeated element-wise operations that are inherently parallelizable. But since the circuit logic gates are represented as a graph structure. This representation might introduce irregular memory access patterns given there are critical dependencies between layers, which can lead to inefficient cache utilization and increased global memory communication. It also makes locality analysis and utilization difficult.
Different gate types have different propagation formulas, which means varying amounts of computation. This variability can lead to divergent execution, where some threads finish their tasks sooner than others. Such divergence may also result in load imbalances.
The simulator has a moderately high communication-to-computation ratio, because each logic gate’s operation is lightweight, but requires pulling values from multiple neighbors and writing back results.
The GPU's SIMD architecture is best suited for uniform, dense computations with regular memory access. Our application involves sparse, topology-driven updates across a directed graph. Because each gate depends on the outputs of its predecessors, we must carefully preserve execution order while maximizing available parallelism. Additionally, gates of different types require different propagation logic, which introduces thread divergence and complicates kernel design. Irregular access to node and edge features also reduce spatial locality and makes coalesced memory access difficult to achieve.
We are starting from an existing Python-based differentiable fault simulator developed by an ECE PhD student (Wei Li) in his ongoing research. The simulator is written in PyTorch and DGL and models logic value and fault propagation across a circuit graph using continuous relaxations of Boolean logic. While this provides a working high-level reference implementation, there is no existing CUDA or C++ codebase for accelerating this graph-system-based simulator, so we will implement the GPU kernels from scratch. We will primarily use the NVIDIA GPUs available in the GHC cluster. If feasible (subject to the PhD’s feedback), we will benchmark the performance on more advanced GPUs in the ECE lab.
Achieve 10x speedup over the current Python-based differentiable fault simulator. This performance target would close the runtime gap between our differentiable, gradient-based approach and existing heuristic-based ATPG (Automatic Test Pattern Generation) tools on CPU commonly used in industry.
Implement a CUDA kernel to offload the most computationally intensive part of the simulator - the functions responsible for logic value propagation and fault observability propagation. This would allow multiple different gates to propagate concurrently. We will integrate different gates’ propagation rules directly into our new custom CUDA kernel to minimize overhead and better exploit warp-level parallelism.
Implement a CUDA kernel so that gradients can be propagated through primary inputs directly, enabling fully GPU-resident training. Develop and integrate a dependency table that dynamically tracks which gate computations have resolved their input dependencies.
We will implement the two propagations as standalone CUDA kernels exposed via custom PyTorch ops, and call them manually after extracting the necessary input features from the graph.
We have chosen C++ and CUDA C++ as the core language for implementing our parallel fault simulator due to its ability to exploit GPU parallelism. It allows us to tailor our custom kernels for critical gate-level operations, ensuring that the logic value and fault propagation computations are executed as efficiently as possible. We will also adjust the Python baseline to use our code via torch.utils.cpp_extension
We will utilize GHC machines equipped with RTX 2080 GPUs. These machines offer a balanced combination of robust CUDA support and sufficient computational power. Additionally, if feasible, we plan to extend our experiments for larger circuits on the ECE lab’s machines, which boast GPUs with a greater number of CUDA cores.
Week 1 (03/27-04/01):
pull()
and apply_edges()
propagate logic and fault values. (Done)Week 2 (04/02-04/08):
torch.utils.cpp_extension
. (Done)Week 3 (04/09-04/15):
pull()
and apply_edges()
following the rule for each gate. (To be continued)Week 4-1 (04/16-04/18):
apply_edges()
.Week 4-2 (04/19-04/21):
Week 5-1 (04/22-04/24):
Week 5-2 (04/25-04/28):