NVME drop outs lead to data corruption under high load

Im having problems with my NVME drive on linux

It is fine under low load but when downloading or stressing it under high usage it appears to momentarily drop out. BTRFS then reports checksum errrors and my test case for it, downloading from steam (1gb internet) reports files are corrupt.

Running Rampage4 extreme motherboard
3930k CPU
Sabrent SB-ROCKET-2TB (no new firmware available)

Mobo bios has had the elements injected into it so it boots NVME drives.

  • I have tried turning off all overclocks: no change although multithreading seems to make things worse
  • I have tried applying “nvme_core.default_ps_max_latency_us=0” to no improvement.
  • I have tried “iommu=soft” to no improvement
  • The drive operates normally under windows
  • The drive behaved badly under ext4 too initially so I do not think it’s a clash with BTRFS
  • Running Xanmod custom kernel but i have tried with normal, lts and zen with the same results.
  • Tried running the bios PCI-E in gen2 more rather than gen3 with same results.

Any suggestions?

Pritty annoying as I have this setup just how I like it now so I don’t want to switch desktop back to windows.

If all else fails I can try getting another NVME or Sata SSD I suppose.

edit: reported drive and system temps are all normal. It has a heatsink and the errors occur rather quickly not indicative of any “heat soak”

Hmm… I was going to suggest ext4, but you already have tried that.
For me it sounds like a firmware or hardware issue.
But I’m no expert on those. Hopefully someone more familiar with those can help.

My immediate thought is that Sabrent is a VERY suspiciously low priced for drives with such huge capacity, so logically i would imagine it’s possible that they’re made of faulty crap…

That being said, if you’re 100% positive it happens only under heavy load - probably it needs more powerful cooling (plate)

BTW, Welcome to the forum! :smile:

1 Like

Disagree on the cooling part as it does have a small heatsink and this problem only ever happens within linux.

Im getting full speed in windows as in linux (apart from the drops)

I agree its a cheap drive and its highly possible they have done something janky which happens to work ok in windows by chance. I bought it from amazon so if it hadn’t worked right id have sent it back when i got it.

Im just hoping someone comes in with a little thing i can try or maybe has experienced similar. If i cant fix it i’ll have to decide my exit strategy of what hardware to replace it with.

Linux is more sensitive to errors detection on drives, as far as my experience goes, so it’s no wonder (sadly it doesn’t mean you don’t have corrupted data on Windoze as well)

Have you tested it with SMART?

Since personally i don’t know what else to advice for that particular case, i can advice on replacement - Samsung evo or pro. Yes it costs, but for a good reason, it’s the most reliable SSD drives you can possibly find. :upside_down_face:

P.S. Oh yeah, Welcome aboard! :partying_face:

What do you mean with “behaved badly under ext4”? btrfs is reporting checksum errors but ext4 has no checksums. What is going bad with ext4?

And how do you know that Windows is all fine? Windows does not report an error but are you sure the data is not corrupt? Can you try downloading or copying a big ISO file for which you know the sha256 checksum and check if the data is corrupt or not?

Steam does verify checksums once files are downloaded and no errors are ever present on windows.

The reason i said bad behaviour was seen on ext4 was from when i tried to setup a virtual machine while distro hopping.

During windows update downloads of the VM’s initial setup it was complaining of bad download and data errors repeatidly. In hindsight these are the same symptoms - Data corruption under high load of updating or downloading files. At the time i didnt give it enough thought to check the kernel logs at all.

What do you mean by this?

Edit Bios is 4901 from 2014/06/18

Does the drive have a heatsink? I have heard of problems with NVMe drives under high temps…I am running 2 of them (Samsung 970 & WD SN850)–both with heatsinks & no problems.

Im on 4901 but modified with https://github.com/LongSoft/UEFITool to insert the NvmExpressDxe_4 module so it can boot from NVME.

I did this a year ago myself but dont rememeber the exact guide i followed. I did however do it myself rather than using a prebuilt bios.

It appeared to work fine. this guide roughly outlines what i did:https://forums.overclockers.com.au/threads/guide-nvme-boot-x79-via-bios-mod.1310754/ but i dont think i followed that guide precisely.

Heatsink is on the drive yes and i can monitor the temperature even when its faulting it only reaches high 30’s celcius

1 Like

I see what you are talking about. It’s a modified Bios. So is it MBR not UEFI?

UEFI it was a top spec board at the time so had a lot in it before its time like gen3 etc.

Now days its 10 years old and ive got it hobbling along still doing pritty well.

I’m surprised it didn’t support booting Nvme drives as that was 2011.

no not by default. i think it was at the cusp of it.

Are you sure the issue isn’t related to a faulty power supply under load?

It might be worth trying a new nvme drive.

I can see the power supply temps and they are running normally as are the fans.
It is a high quality PSU as i have been bitten in the past with bad power supplys and i run a decent sized graphics card so its not under any load but its a good point.

The fact that it occurs more frequently with hyperthreading on is interesting and is probably related to the nature of its intermittent failure maybe when threads are switching etc.

Well i personally wouldn’t go by temps on power supply but actual testing of power supply unit itself. For current draw and output etc. Any chance at trying a new nvme drive to see if it’s the culprit?

plan tonight is to test the same procedures onto my spinning disk data drive just to make sure its NVME related and not something else in the system.

Them some sort of swap is in order i might be able to convince amazon as its just under 12 months old which shows how bad my memory is for modding the bios.

ideally i’ll DD the drives but for that i need to look how many slots are left on the mobo for another pci-e card and cross my fingers its not a system problem unique to my frankenbios.

1 Like