Reducing the overhead of data transfer in data-parallel programs
MetadataShow full item record
Graphics processing units (GPUs) have evolved to allow for general purpose many-core programming. Most GPUs have their own separate memory, requiring that input data be transferred to the GPU before running the program and transferring the results back to the CPU upon completion. This transfer of data imposes significant overhead that we would like to reduce. A possible solution is to split up a program into many smaller pieces, called tiles, and then setting up a pipeline that overlaps data transfers with program execution (on the GPU). This can reduce the overhead of data transfers significantly. We examine the effectiveness of several variations of this tiling/pipelining transformation for a common class of programs. We introduce a model that predicts the run time performance of these transformations ahead of time, as well as a recipe that guides users in transforming their code. We show that one of these transformations provides a good speed-up, which for some problems is over 2 times faster than versions that do not overlap data transfer and program execution. Finally we show that our model accurately predicts the run time of programs using this transformation.