Crashes and CPU Errors at Boot since Kernel 5.10

Nah, “LTS” just means it will need support for a long time to fix all the bugs… :wink:

2 Likes

I just had a quick look at it and it seems to have everything zen has and some other things. I will give it a go. This release has been the messiest I’ve seen and I’ve been using Linux for 6 years now. All my friends think I’m a fool for using Linux and with kernel 5.10 I kind of actually look like the fool.

I think it is the build, drive format, gpus, cpus or a combo of things.

I have absolutely zero issues with 5.10 zen (Gaming/PC, intel cpu/nvidia gpu) and 5.10 standard (laptop, intel cpu and gpu,) both ext4 formatted.

Linux kernel 5.10.8 is now in Arch…hit the update.

2 Likes

It seems worse after updating to kernel 5.10.8, I tried with both linux-zen and linux-lqx. I crashed twice during a 24 man raid on ffxiv.

DSC_0006

2 Likes

Have you tried disabling iommu?

IF that doesn’t work just use the 5.4 LTS kernel, check each new 5.10 point release to see if fixed.

2 Likes

in aur there is a logging tool : mcelog ,

got from this ;

2 Likes

I added the processor.max_cstate=5 and then set up mcelog as @ringo suggested. Let’s cross my finger, I will try disabling iommu later if I am still crashing. I’m just trying things one by one so I can pinpoint the solution when I find it.

1 Like

So it happened again with processor.max_cstate=5 and kernel 5.10.9. MCELOG gives mcelog: ERROR: AMD Processor family 23: mcelog does not support this processor. Please use the edac_mce_amd module instead. CPU is unsupported I also tried rasdaemon and get No MCE errors.

Have you tried disabling SMT or the uOP cache?

Please see this post in the Manjaro forum:

Same mce error message like you have but they do not cause a hang. They are just cosmetic and can eventually be supressed by kernel comamndline parameters.

So chances are that the mce messages you see are just coincidental but not the root cause of the crash.

Please also have a look at the wiki page: https://en.wikipedia.org/wiki/Machine-check_exception

Specifically where it says:

Machine checks are a hardware problem, not a software problem.

amd_iommu=off
2 Likes

@ricklinux I have not touched SMT nor uOP.

@mbod I changed the kernel parameter to add .rasdaemon and it seems odd that I do not see those message when rebooting or booting normal, it’s only after crashing.

@otherbarry I added that now so we will wait and see.

A big shoutout and thank you to everyone helping troubleshoot this.

I suggest you focus on a hardware issue.

Have you recently changed HW or BIOS?
I see that you have a new Ryzen 9 3900X. Is that CPU fully supported by your board?
What is your RAM resp. RAM speed?
Are you overclocking anything?
What is memtest86 telling you?

No recent hardware changes, I did update the BIOS multiple times though. Looking at rasdaemon I see the following for mce errors:
1 2021-01-21 12:28:37 -0500 error: Corrected error, no action required., CPU 2, bank Platform Security Processor (bank=25), mcg mcgstatus=0, mci CECC, mcgcap=0x0000011c, status=0x98004000003e0000, misc=0xd01a000b00000000, walltime=0x6009b9c6, cpuid=0x00870f10, bank=0x00000019

and this for disk errors:
1089 2021-01-21 16:58:56 -0500 error: dev=0:66304, sector=721912480, nr_sector=16, error='unknown block error', rwbs='RA', cmd='',

Excuse me if it has already been mentioned, but did you update your BIOS?
Not necessarily the latest version, but it should have AGESA 1.0.8.0 or maybe 1.1.0.0.

I had plenty of RAM related issues with Ryzen 9, like for example very long startup time due to faulty memory training, reboots with MCE errors etc.
They have been solved with a more recent BIOS version.

1 Like

I do not experience this phenomenon at the moment.

A lot of people using NVIDIA drivers notices it BUT:

It seems to be more prevalent with users of the dkms version which makes sense since several commands that has been in the kernel forever and many, many dkms drivers (NVIDIA and others!) use have been removed.

Searching online shows that Fedora users with NVIDIA systems are the second hardest hit by the 5.10 kernels after Arch based users and they use a similar system to install the drivers. And again it also hits users of wifi and network cards that use dkms modules.

Basically this is not really a kernel problem as much as a driver problem, where hardware manufacturers haven’t adjusted their drivers in time to parry for the changes in 5.10.
(This does not really explain why the latest dkms NVIDIA drivers work very well in the pre-release 5.11.x kernel however; I don’t know if they added those commands back?).

Anyway, there are other problems as well; there are Intel CPU lockups, hardware accelerated graphics lockup, wifi stops working and so on and so forth. 5.10 is just a huge step backwards in usability which makes the idea that it should be the next LTS a little… scary, to be honest.

1 Like

I’m aware of the nVidia problems, but my old Manjaro box with a GTX 1070 works just fine.
I wouldn’t go as far as saying that 5.10 is a step backwards, maybe it is in terms of nVidia drivers (and the kernel devs do not give a flying f*ck about closed source blobs), but certainly not in general.

2 Likes