Parallel Differentiable Circuit Fault Simulator

Team Member: Xinyu Li (xinyul3), William Zou (yangzou)

1 Summary

We are going to implement a parallel version of the differentiable fault simulator for digital circuits by accelerating its major logic value and fault propagation steps using custom CUDA kernels. Our goal is to improve performance over the current Python implementation. We will analyze performance on NVIDIA GPUs available in GHC cluster and in the ECE lab if feasible.

2 Background

2.1 Application Description

The differentiable fault simulator is a tool for post-silicon validation of digital circuits. It checks whether a given input pattern can activate a fault and make it observable at the circuit’s output. The circuit is represented as a directed graph with logic gates as nodes and wires as edges. By using smooth approximations of Boolean logic, the simulator can compute logic values and fault observability in a way that supports gradient-based search for valid test patterns.

2.2 Potential Aspects to Apply Parallelism

The computationally intensive components of the simulator are the logic value propagation and fault propagation phases, which are currently implemented sequentially in Python. Several aspects of the problem are well-suited for parallelization: (1) the value and fault propagation steps involve repeated element-wise products, sums, and probabilistic masking operations, all of which are highly data-parallel; (2) the current implementation processes only one type of gate at a time, but different gate types could be propagated concurrently; and (3) although logic gates have data dependencies, many of these dependencies may be resolved earlier than a strict topological layer-by-layer approach would allow, enabling finer-grained parallelism across the graph.

3 Challenge

Through this project, we aim to learn how to adapt dependent directed-graph-based workloads to the GPU’s SIMD architecture while balancing correctness and performance.

3.1 Workload

3.1.1 Dependency

Digital circuits are modeled as directed graphs where the computation at each node (gate) depends on the results from its predecessor nodes. This dependency means that logic value propagation and fault observability must be updated sequentially overall.

3.1.2 Memory Characteristics

The simulation involves repeated element-wise operations that are inherently parallelizable. But since the circuit logic gates are represented as a graph structure. This representation might introduce irregular memory access patterns given there are critical dependencies between layers, which can lead to inefficient cache utilization and increased global memory communication. It also makes locality analysis and utilization difficult.

3.1.3 Divergent Execution

Different gate types have different propagation formulas, which means varying amounts of computation. This variability can lead to divergent execution, where some threads finish their tasks sooner than others. Such divergence may also result in load imbalances.

3.1.4 Communication-to-Computation Ratio

The simulator has a moderately high communication-to-computation ratio, because each logic gate’s operation is lightweight, but requires pulling values from multiple neighbors and writing back results.

3.2 Constraints

The GPU's SIMD architecture is best suited for uniform, dense computations with regular memory access. Our application involves sparse, topology-driven updates across a directed graph. Because each gate depends on the outputs of its predecessors, we must carefully preserve execution order while maximizing available parallelism. Additionally, gates of different types require different propagation logic, which introduces thread divergence and complicates kernel design. Irregular access to node and edge features also reduce spatial locality and makes coalesced memory access difficult to achieve.

4 Resources

We are starting from an existing Python-based differentiable fault simulator developed by an ECE PhD student (Wei Li) in his ongoing research. The simulator is written in PyTorch and DGL and models logic value and fault propagation across a circuit graph using continuous relaxations of Boolean logic. While this provides a working high-level reference implementation, there is no existing CUDA or C++ codebase for accelerating this graph-system-based simulator, so we will implement the GPU kernels from scratch. We will primarily use the NVIDIA GPUs available in the GHC cluster. If feasible (subject to the PhD’s feedback), we will benchmark the performance on more advanced GPUs in the ECE lab.

5 Goals and deliverables

5.1 Plan to Achieve

5.1.1 Performance

Achieve 10x speedup over the current Python-based differentiable fault simulator. This performance target would close the runtime gap between our differentiable, gradient-based approach and existing heuristic-based ATPG (Automatic Test Pattern Generation) tools on CPU commonly used in industry.

5.1.2 System Capability

Implement a CUDA kernel to offload the most computationally intensive part of the simulator - the functions responsible for logic value propagation and fault observability propagation. This would allow multiple different gates to propagate concurrently. We will integrate different gates’ propagation rules directly into our new custom CUDA kernel to minimize overhead and better exploit warp-level parallelism.

5.2 Hope to Achieve

Implement a CUDA kernel so that gradients can be propagated through primary inputs directly, enabling fully GPU-resident training. Develop and integrate a dependency table that dynamically tracks which gate computations have resolved their input dependencies.

5.3 Minimum Expectation if Goes Slow

We will implement the two propagations as standalone CUDA kernels exposed via custom PyTorch ops, and call them manually after extracting the necessary input features from the graph.

6 Platform Choice

6.1 Language

We have chosen C++ and CUDA C++ as the core language for implementing our parallel fault simulator due to its ability to exploit GPU parallelism. It allows us to tailor our custom kernels for critical gate-level operations, ensuring that the logic value and fault propagation computations are executed as efficiently as possible. We will also adjust the Python baseline to use our code via torch.utils.cpp_extension

6.2 Computer

We will utilize GHC machines equipped with RTX 2080 GPUs. These machines offer a balanced combination of robust CUDA support and sufficient computational power. Additionally, if feasible, we plan to extend our experiments for larger circuits on the ECE lab’s machines, which boast GPUs with a greater number of CUDA cores.

7 Schedule

Week 1 (03/27-04/01):

Analyze the current baseline simulator to understand how DGL pull() and apply_edges() propagate logic and fault values. (Done) by William & Xinyu
Write data structures to represent circuit graphs and gate metadata compatible with CUDA. (Done) by William & Xinyu

Week 2 (04/02-04/08):

Implement the initial kernel for logic value propagation and fault observability rules. (Done) by William & Xinyu
Start integrating the kernel into the Python pipeline using torch.utils.cpp_extension. (Done) by William & Xinyu

Week 3 (04/09-04/15):

Complete integration of the kernel into the simulation pipeline. (Done) by William & Xinyu
Verify correctness by comparing outputs against the baseline. (Done) by William & Xinyu

Week 4-1 (04/16-04/18):

Finish implementing kernels for apply_edges(). (Done) by William & Xinyu

Week 4-2 (04/19-04/21):

Conduct thorough performance analysis using different benchmark circuits with respect to the baseline model. (Done) by William & Xinyu

Week 5-1 (04/22-04/24):

Experiment with different CUDA parameters (e.g., block size, shared memory usage, warp size) to find the best settings for different circuit sizes. (Done) by William & Xinyu
Analyze bottlenecks and apply edge optimizations such as memory coalescing or instruction-level parallelism. (Done) by William & Xinyu

Week 5-2 (04/25-04/28):

Complete the final report and update the project website. (Done) by William & Xinyu
Design presentation posters and visual summaries of simulation results and speedups. (Done) by William & Xinyu