Keynote Talks @ DFT 2018

Dr. Nirmal R. Saxena, NVIDIA

Redundancy & Testability Hit the Road for Resilient Autonomous Driving

 

According to the 2015 vehicular accidents report [www.nhtsa.gov], there were more than 35000 fatal crashes and more than 6 million non-fatal crashes. Translating the 35K fatal crashes, over 3 trillion driven miles, to a FIT (failures in time, time = one billion hours) rate we get a fatality FIT rate in the range 250-500. A drive system, in a fully autonomous car, that replaces the human driver must at least be an order of magnitude more resilient. In fact the ISO 26262 Auto Safety Standard, stipulates a probabilistic metric for hardware failures (PMHF) to be at most 10 FITs. This requirement is almost two orders of magnitude improvement over driver related fatal accident FIT rate; notwithstanding, the fact that not all hardware failures result in fatalities. Among other components in a drive system, deep neural networks use the computational power of massively parallel processors. Autonomous driving demands extremely high resiliency and trillions of operations per second of computing performance to process sensor data with extreme accuracy. This keynote examines various approaches to achieve resiliency in autonomous cars and makes the case for design diversity based redundancy.

 

Nirmal R. Saxena is currently a distinguished engineer at NVIDIA and is responsible for high-performance and automotive resilient computing. From 2011 through 2015, Nirmal was associated with Inphi Corp as CTO for Storage & Computing and with Samsung Electronics as Sr. Director working on fault-tolerant DRAM memory and storage array architectures. During 2006 through 2011, Nirmal held roles as a Principal Architect, Chief Server Hardware Architect & VP at NVIDIA. From 1991 through 2009, he was also associated with Stanford University’s Center for Reliable Computing and EE Department as Associate Director and Consulting Professor respectively. During his association with Stanford University, he taught courses in Logic Design, Computer Architecture, Fault-Tolerant Computing, supervised six PhD students and was co-investigator with Professor Edward J. McCluskey on DARPA’s ROAR (Reliability Obtained through Adaptive Reconfiguration) project. Nirmal was the Executive VP, CTO, and COO at Alliance Semiconductor, Santa Clara. Prior to Alliance, Nirmal was VP of Architecture at Chip Engines. Nirmal has served in senior technical and management positions at Tiara Networks, Silicon Graphics, HaL Computers, and Hewlett Packard. Nirmal received his BE ECE degree (1982) from Osmania University, India; MSEE degree (1984) from the University of Iowa; and Ph.D. EE degree (1991) from Stanford University. He is a Fellow of IEEE (2002) and was cited for his contributions to reliable computing.

Dr. Nathan DeBardeleben, Los Alamos National Laboratory

Supercomputer Reliability - Actionable Insights from Neutrons, Data Analytics, and Field Data

 

This talk discusses research and development efforts at Los Alamos National Laboratory (LANL)'s Ultrascale Systems Research Center (USRC) to characterize supercomputer reliability and fault tolerance. The talk will explore how neutron beam testing coupled with tens of billions of device hours of field data can lead to production datacenter planning decisions. The talk will also present data from neutron detectors used to characterize the datacenter and environment at LANL and look at how machine learning and data analytics are playing a major role in studying the extreme volumes of telemetry and system logs LANL is ingesting every day.

 

Nathan DeBardeleben is a senior research scientist at Los Alamos National Laboratory (LANL) and the Co-Executive Director of the Ultrascale Systems Research Center (USRC). He received his Ph.D. in Computer Engineering from Clemson University in 2004 and joined LANL the same year. After working on several scientific code teams, he began leading the fault tolerance efforts at LANL with an emphasis on characterization of hardware and software. Nathan leads a research team which develops the software fault injector, PFSEFI, a parallel software fault injector for simulating soft error fault injections into parallel applications which scientists use to evaluate application resiliency to certain types of faults. His team's other major focus is on supercomputer reliability characterization through data analytics of system logs and telemetry, neutron beam experiments, and neutron detection for environmental characterization.