A timer bug causes AMD’s chips to crash after about three years unless they are rebooted.
A revision guide for AMD EPYC 7002 “Rome” server processor reveals that a chip core could hang after 1,044 days of uptime (approximately three years). The revision notice, which was issued in April 2023, states:
A core will fail to exit CC6 after about 1044 days after the last system reset. The time of failure may vary depending upon spread spectrum and REFCLK frequency.
Tom’s Hardware reported the issue this week. It noted that the bug “can cause a core on the chip to hang after 1,044 days of uptime (~2.93 years), after which you’ll have to reset the server for the chip to run correctly”.
The revision guide also notes that AMD will not fix the issue. However, the workaround is simple. Users must either reboot before 1,044 days of uptime, which resets the CPU to restart your 1,044-day “timer,” or disable the CC6 sleep state.
AMD first introduced the Rome CPUs at their Next Horizon Event in November 2018. The EPYC Rome chips are based on the Zen 2 core architecture and are some of the most competitive chips that the Red team has introduced for the data center market.
Chip errata is common
“With billions of transistors in play, issues are inevitable”, Tom’s Hardware notes. “It isn’t uncommon for a chip to have a thousand or more errata/bugs that are corrected in newer steppings of the chip or with firmware tweaks before launch”.
As a frame of reference, we should note that Intel‘s 8th-gen has more than 150 listed errata that still remain, and those chips were launched in 2017. “We don’t know how many errata the Rome chips have had because AMD has removed the listings for errata that have been solved”, the article says. “However, we do know that 39 errata remain, which actually doesn’t seem too bad against the Intel backdrop”.