What’s the best way to monitor NVMe health?

Hi everyone,

First of all, apologies if this isn’t the right category — admins, please feel free to move it if needed.

Recently, I’ve been reading about how NVMe drives can sometimes fail without much warning, which got me thinking about health monitoring. What are the best tools or methods to keep an eye on NVMe drive health?

I came across a tool called nvme-cli and installed it — it lets me view the current state of my drive. Is this the right tool for monitoring NVMe health, or are there other (perhaps better) tools you’d recommend?

Also, if any early warning signs or bad symptoms show up, is there any way to fix or mitigate them?

Output from - sudo nvme smart-log /dev/nvme0

1 Like

To my knowledge, it depends on how much TBW is specified by the disk manufacturer.
Normally between 150 and 600 TB.

1 Like

https://archlinux.org/packages/?name=smartmontools
https://wiki.archlinux.org/title/S.M.A.R.T.#smartctl

Another tool.

3 Likes

This might help…
I use smartctl
It’s a tool for checking SMART data on all kinds of drives.

Install smartmontools If you don’t have it.
Check drive health: Use smartctl to get a general health status. You may need to find the device name with lsblk first.
sudo smartctl -a /dev/nvme0n1 (replace /dev/nvme0n1 with your device name).
Look for the “SMART overall-health self-assessment test result” or a similar line to see if it reports “PASSED”.

3 Likes

I need to check the manufacturer’s specifications to see the drive’s rated TBW. Based on what I can see, the current usage is only 1%.

@cactux / @Noodly, I’ve installed smartmontools — the output seems much more user-friendly compared to nvme-cli. Thanks.

My next question: if an error occurs, is it possible to fix it?

1 Like

I am far from knowledgeable in this field but if by error you mean a failure at hardware level, I doubt that there will be any fix for it. I may be wrong though.

Errors at filesystem/software level might be fixable. Depending on the filesystem there are some tools to be used to check the health and fix some issues.

If you get filesystem errors often, that may also be indicative of hardware failure.

The number of units written on your disk seems to indicate that it is still in its “infancy”.

3 Likes

That would be very dependent on what the “Error” is.
Physical Level damage is probably zero for the Average person and Will cost a bit money for those willing to try and get data back.

Filessystem damage is quite common and fortunately easily fixed with proper disk checkers.

2 Likes

Yes, it’s still considered brand new. I bought it along with my mini PC.

Got it. Thanks for the answer @thefrog .

@swh, I’ve checked the drive’s specifications — the TBW is 220 TB.

Interestingly, it seems the product has already been marked as EOL :grinning_face_with_smiling_eyes:

I’ll mark @cactux’s answer as the solution :slightly_smiling_face:

Thanks, everyone — really appreciate it!

2 Likes

So no worry then. You have 2.65 read and 5.43 written. It should last for a while longer.

2 Likes

I like Scrutiny to monitor Drive Health. It also uses smartctl.
Scrutiny has also a nice WebUI :slight_smile:

2 Likes

ooh, I’ll have to check that out. Thanks mate.

Get the Western Digital Black. I’ve beat the shit out of my nvme drives and never had an issue. Lots of wiping out, reinstalls, bad shutdowns … you name it. Different file sysytems etc etc.

3 Likes

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.