Keynote speech

Title: Memory Errors in Modern Systems

Speaker: Vilas Sridharan, AMD Inc.

Hardware faults are commonplace, especially in memory subsystems consisting of DRAM and SRAM devices. These memory subsystems need to provide resilience techniques to tolerate these faults when deployed in mission-critical or high-reliability environments such as supercomputers or data centers. In order to design resilient memory systems, one must understand what faults are likely to occur. One mechanism to do this is to analyze what faults do occur in systems deployed in the field. In this talk, I will focus on learnings about DRAM and SRAM reliability gathered from systems in the field. I will also touch on issues involved in performing large-scale studies of systems in the field.

Vilas Sridharan works in the RAS (Reliability, Availability, and Serviceability) Architecture group at AMD, Inc., where he is responsible for defining the reliability features of all AMD server products. He received his Ph.D. and M.S.E. from the Department of Electrical and Computer Engineering at Northeastern University, and his B.S.E. in Computer Engineering from Princeton University in 2000. From 2000 - 2004, he worked in the SPARC server division at Sun Microsystems. His research focuses on the modeling of hardware faults and architectural and micro-architectural approaches to reliability and fault tolerance in high-performance microprocessors.