Improving Cache Performance in Structured GPGPU Workloads via Specialized Thread Schedules
Summary
Efficient cache utilization is critical in programs with high data throughput. Improving performance in this area often requires niche knowledge of computer architecture, extensive benchmarking, and algorithms that do more than intuively required. By changing the order in which tasks are executed, the order in which memory gets accessed gets changed. This way, we can manipulate how caches get used. This thesis proposes a column iterator which reschedules a 2D workload. We show that performance can be increased compared to the naive method by implementing the proposed method in C++ with CUDA and as an extension to the data parallel DSL Accelerate.