Today I experienced an issue with a Cisco UCS chassis. The error in question was “F0411 Thermal condition on chassis is upper-non-recoverable”. F0411 Thermal condition would make the chassis go red within equipment under the Cisco Unified Computing System Manager, then from there it would go yellow, with the fans stating they were “inoperable” even though they weren’t. This process continually went back and fourth between yellow and red thousands of times. While the alert was concerning to see, it’s important to note the UCS Chassis itself was functioning normally with no issues. Concerning to the eye at the very least, and worth investigating. Upon inspection of the device, everything was working normally; however, the fans were running full blast. To resolve the issue I opened a ticket with Cisco TAC.
First thing first, I had to upload logs from the chassis. In order to get chassis logs, see the below image from Cisco’s website
- Login to Cisco UCS Manager
- Go to admin tab in the upper left hand corner, then select ALL
- Select Create and Download Tech Support File
- Select the path to place the downloaded files
- Under options, select chassis
- Ensure the appropriate chassis ID is selected
- Click OK and wait for the file to generate
Once this completed, I attached the files to the support case and waited to hear back from TAC.
Cisco’s response times are super fast. Hooray for good customer support! I heard back within the hour, and they indicated they were reviewing the logs. That afternoon, TAC let me know the issue was with the I2C bus. Basically the I2C bus transports information about the different components to the Unified System (Chassis, IOMs, Fans, PSU, etc) The issue in a nutshell, was that the I2C bus got overwhelmed, and therefore was throwing the error, even though no thermal event was actually occurring. Support suggested I re-seat Fans, Power Supplies, IOMs, etc. If that didn’t resolve the issue, then an entire chassis power cycle would be needed!
I preformed the steps TAC indicated, and removed each fan one by one (waiting three minutes between re-inserting, and moving on to the next fan). This all went according to plan, with the fans going from yellow/alerting to not alerting on the individual fans once re-seated. Progress was being made, so it appeared.
While I was completing this, I noticed the Performance and Temp for each fan said “N/A”, per support this was normal, so I continued on. Once I got the Fan 5, the fun begin. As soon as I pulled Fan 5, I had several fans all start blinking amber randomly at different intervals, which made for quite the excitement. Some of the power supplies had no light for a few minutes. This process continued for several minutes, with lights going green/amber/off randomly on multiple fans.
Fortunately, this just meant I found the component that was causing the error. After a few several minutes, with fans reeving up and down, all lights stabilized green and everything returned to normal. The Performance and Temp also went green, which was a welcome sight to see. This resolved my issue, and all alerts cleared within the UCS Manager.
In conclusion, this was a quick exercise that helped me learn more about UCS, and how to troubleshoot hardware problems.
***Disclaimer, this blog is my own, and any steps you take to troubleshoot your system are at your own risk, I take no responsibility. Always check with support before performing any maintenance activities***