SynchroTrace: Synchronization-aware Architecture-agnostic Traces for Light-Weight Multicore Simulation of CMP and HPC Workloads

Karthik Sangaiah, Drexel University
Michael Lui, Drexel University
Radhika Jagtap, Arm Ltd.
Stephan Diestelhorst, Arm Ltd.
Siddharth Nilakantan, NVIDIA Corporation
Ankit More, Intel Corporation
Baris Taskin, Drexel University
Mark Hempstead, Tufts University

ACM Transactions on Architecture and Code Optimization (TACO), Vol. 15, No. 1, Article 2. March 2018.

[PDF]

Abstract

Trace-driven simulation of chip multi-processor (CMP) systems offers many advantages over executiondriven simulation, such as reducing simulation time and complexity, allowing portability, and scalability. However, trace-based simulation approaches have difficulty capturing and accurately replaying multi-threaded traces due to the inherent non-determinism in the execution of multi-threaded programs. In this work, we present SynchroTrace, a scalable, flexible, and accurate trace-based multi-threaded simulation methodology. By recording synchronization events relevant to modern threading libraries (e.g. Pthreads and OpenMP) and dependencies in the traces, independent of the host architecture, the methodology is able to accurately model the non-determinism of multi-threaded programs for different hardware platforms and threading paradigms. Through capturing high-level instruction categories, the SynchroTrace average CPI trace replay timing model offers fast and accurate simulation of many-core in-order CMPs. We perform two case studies to validate the SynchroTrace simulation flow against the gem5 full-system simulator: 1) a constraint-based design space exploration with traditional CMP benchmarks and 2) a thread-scalability study with HPC-representative applications. The results from these case studies show that 1) our trace-based approach with trace filtering has a peak speedup of up to 18.7× over simulation in gem5 full-system with an average of 9.6× speedup, 2) SynchroTrace maintains the thread-scaling accuracy of gem5 and can efficiently scale up to 64 threads, and 3) SynchroTrace can trace in one platform and model any platform in early stages of design.