Weird whole system freezes when IO happens

Hi everyone!

I’m a recent EndeavourOS adopter, but I’ve been having an awesome experience with it. In the last few days, however, my system has been acting a little weird.

It all started while I was programming on neovim, and saved my file. It took way longer than normal! Something like 2-3 seconds, while normally it is something that isn’t even noticeable! I tried troubleshooting it, thinking it was related to my SSD, but larger reads and writes seemed to be working just fine.

Eventually I realized that the problem wasn’t just saving a file: the whole system was sluggish - going from the display manager to the desktop took longer than normal, and opening a terminal also took quite some time. Sometimes, when I executed multiple actions, nothing would happen for a while and then everything would happen at once, like the whole system froze but my mouse. The freezes lasted anywhere between 2 to 20 seconds.

The problem persisted for the whole night and I couldn’t get anything useful done because I was too annoyed with it and kept trying to fix it. I tried trimming my SSD, defragmenting it, and all that.

The next day, the problem was seemingly gone: system didn’t feel sluggish anymore and saving files was back to the snappy experience. All good now, I thought (I was wrong). I decided to play some Overwatch 2 and, to my surprise, the system freezes came back while playing. Except that now the system didn’t feel sluggish.

Looking at a system monitor, it seemed like it had an IO spike after the freeze, but I’m not sure.

Then, yesterday, I opened my PC, tried reconnecting the SATA cables of my SSD and switching it’s SATA port on the motherboard, and booted it back up. The problem seemed to be solved! I could play Overwatch 2 no problems!

Except that I just woke up and tried playing Overwatch 2 again, and the freezes are back. I don’t know what to do anymore, really. Maybe I’m misjudging the problem or something like that, but I can’t really know. I never had this while I was on OpenSUSE Tumbleweed, and my SSD is a recent buy.

Does anyone know what might be happening? I’d really be grateful for some help. These are my system specs. My EndeavourOS install is KDE, but I mainly live on i3. Thanks in advance!!

Have you tried switching to the LTS kernel?

No, I haven’t. I’ll switch and see how it goes!

1 Like

I also had the problem once with identical behavior/effects. In my case, the SATA cable was defective…

I usually blame sunspot activity! :laughing: :sun_with_face:

1 Like

Tried switching to the LTS Kernel, rebooted and have been stuck in the message log screen for like 2 hours.

I really hope it isn’t hardware related! Specially the SSD itself…

Did you remember to install the headers as well?

Could it be that baloo the file indexer of KDE is running in the background? That could explain why it only sometimes happens, especially if you are on an older system. I remember having trouble with that tool on my old system and disabled it to get rid of problems. On a modern system its not a problem (plus I only enable indexing filenames instead its content on my new system).

Otherwise maybe you should create a bootable disk or usb stick to boot from external distribution. Then check if you can access the drive without the system freezing. Maybe check for health of drive and do a memtest of RAM. I don’t have specific tools to recommend, maybe look here for more info: https://www.system-rescue.org/System-tools/ But make sure not to play around too much with tools, just do when you exactly know what you are doing, as these rescue tools are very powerful.

Edit: Oh I missed the link to the specs of your machine. So it is not that old and therefore most likely not the issue here.

But no idea why you stuck on LTS Kernel. Can you still boot into your previous Kernel?

I don’t know? All I did was sudo pacman -S linux-lts and sudo grub-mkconfig -o /boot/grub/grub.cfg. Am I missing a step here?

Yeah, booting into the previous Kernel works. Seems like I missed a step, as dalto pointed out.

Yes, you need to also install linux-lts-headers if you have any dkms packages which you probably do.

Alright, installed the headers and booted up into LTS. Joined a training match and the freezes are still happening (the system freezes so hard I even lose connection to the game). I’ll test other games to see if I can replicate the freezes with them too.

Edit: Yup, Sea of Thieves also freezes - in the first loading screen, even.

I want to say, that I remember freezing issues and sometimes complete system lockups (or maybe I did not wait long enough) when playing Overwatch 2 on my old machine (which I do not use anymore). My old machine had 16gb RAM, a 10year old cpu and GTX 1070. And I had to close all applications, especially Firefox, when I was playing the game. Otherwise there was lockup issues. For whatever reason the system RAM got almost full otherwise. So maybe it is not having enough RAM? Try playing the game while closing everything else if it gets more stable.

I still think that you should check the health status of your drive, those S.M.A.R.T. status. And maybe a memory (RAM) test.

The game is the only thing open, though! And even other games (Sea of Thieves, for example) have freezes, too. And I was able to play Sea of Thieves just fine last week…

My drive seems fine S.M.A.R.T. wise, or at least that’s what GSmartControl and CrystalDiskInfo (executed from my Windows partition) tell me.

I didn’t think much about RAM because Windows doesn’t seem to exhibit any issues. How would I go about testing RAM, however? Never tested it so not really sure.

Edit: Just a reminder - my Windows partition is in another drive.

I’m out of ideas for now and memory test is the only thing that comes to my mind. Even if it turns out not to be the problem, at least you know this basic thing then. You can try other solutions or tests and if nothing seems to work, give a shot at testing memory would be worth it to me.

The usual way to test memory would be to create an external boot drive with usb drive in example and run a typical memtest tool. I don’t know if the regular distros such as EndeavourOS include such tools. I think some UEFI comes with memtest tools too? Maybe look into the previously posted systemrescue link or at https://www.memtest.org/ .

Ran memtest86+, passed no problems. Starting to think it really is a physical problem, maybe with the SATA cable as amarkuni said.

Thing is, I went out to try and buy one and I couldn’t find a single store that had it (somehow). And buying it online won’t arrive until the 3rd of January, so I’m kind of on a stalemate here.

Gonna keep searching for clues on what the problem might be (assuming it’s not physical)…

You say you have three drives/cables and one is for the Windows drive which is probably not in use at the same time. Shuffle the cables around and see if the issue moves onto another drive.

Also worth looking into journalctl for error messages (around the time you notice the issue).

What file system is it, did you check for errors?

I don’t have much time to switch the SATA cables right now, but I’ll try it once I do. Good idea.

journalctl does show a bunch of errors regarding IO, but I don’t know what to make of them besides “IO problems beep boop”: https://pastebin.com/1twgdvqn

It’s a BTRFS system, and I’m not very sure on how to check for errors beside S.M.A.R.T. (which says everything is ok on both short and extended self-tests) - especially because a lot of online information about btrfs check says it has a bunch of problems and I don’t wanna risk my system any further.

Some guidance would be very appreciated, if it’s not a nuisance.

I searched a bit in the web for some of the error messages such as “{ DRDY ERR }” and “failed command: READ FPDMA QUEUED”. And from what I could gathered, these indicate a hardware problem, either the drive or maybe the cable. This is not an assumption, just a warning that some random internet people say.

New drives can be broken at arrival or after short period of time. I myself had a brand new SSD which showed some alarming SMART value after a few weeks. And I immediately replaced it. It’s a possibility that from time to time happens in real life. Be ready to make backups and eventually replacing the drive. Just prepare yourself mentally.