Research presented at HPCA

Associate Professor Mark Hempstead presented two papers at IEEE International Symposium on High-Performance Computer Architecture.
A man sitting in a computer lab

SnackNoC: Processing in the Communication Layer

Associate Professor Mark Hempstead and collaborators from Drexel University reimagined a multiprocessor chip’s communication subsystem, or Network-on-Chip (NoC), to serve as both a communications system and as a platform to perform computation. The team observed that modern multicore computer systems have high-bandwidth NoCs designed for the worse case traffic but most of the time these NoCs operate inefficiently with long periods of idle-time and unused resources. Like graduate students snacking on leftover food, these leftover NoC resources can perform useful work. The researchers developed SnackNoC, a platform that repurposes the communication subsystem’s idle resources to compute linear algebra kernels. The team’s experiments show that the SnackNoC platform can extract additional performance equivalent to between two and six x86 cores.  Running these additional software kernels on SnackNoC do not harm the performance of the main workload and with only few additional resources when compared to a standard NoC and Uncore.

Architectural Implications of Facebook’s DNN-based Personalized Recommendation

To recommend content to its users, an online platform will collect a user’s interactions on the web and input that data into a network that models a user’s likes and dislikes. These networks are one type of deep neural networks (DNNs) and little attention has been devoted to their computational performance when used as recommendation systems. In a recent paper, Hempstead and collaborators at Facebook investigated the computational performance of these recommendation systems with a set of real-world, production-scale DNNs for personalized content recommendation. The team observed that inference latency, or the time required for data to process within a DNN, varied by nearly 60% across three Intel server generations. The researchers found that batching inputs can greatly improve latency and that diversity across recommendation systems leads to varying optimization strategies.