BTRFS: Is my Samsung 970 EVO SSD dying?

TryZee · May 23, 2025, 5:53pm

Hello everyone,

I tried backing up my valuable data to my external Samsung 970 EVO SSD, but BTRFS keeps rejecting it.

Ran a scrub on it and got this:

Scrub started:    Fri May 23 14:13:45 2025
Status:           finished
Duration:         0:00:48 
Total to scrub:   91.79GiB
Rate:             1.91GiB/s
Error summary:    read=608
  Corrected:      591
  Uncorrectable:  17 
  Unverified:     0

But smartctl says the drive PASSED the test .

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO 2TB
Serial Number:                      S464NB0KB07410E
Firmware Version:                   2B2QEXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 2.000.398.934.016 [2,00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      4
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2.000.398.934.016 [2,00 TB]
Namespace 1 Utilization:            1.248.927.186.944 [1,24 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5b81b20711
Local Time is:                      Fri May 23 18:33:39 2025 CEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     82 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.20W       -        -    0  0  0  0        0       0
 1 +     4.30W       -        -    1  1  1  1        0       0
 2 +     2.10W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning:                   0x00
Temperature:                        37 Celsius
Available Spare:                    82%
Available Spare Threshold:          10%
Percentage Used:                    1%
Data Units Read:                    152.177.799 [77,9 TB]
Data Units Written:                 122.103.840 [62,5 TB]
Host Read Commands:                 1.084.768.751
Host Write Commands:                1.192.797.948
Controller Busy Time:               4.329
Power Cycles:                       8.498
Power On Hours:                     3.347
Unsafe Shutdowns:                   388
Media and Data Integrity Errors:    404
Error Information Log Entries:      14.303
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               37 Celsius
Temperature Sensor 2:               45 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0      14303     6  0x5100  0xc502  0x000   3272213906     1     -  Unrecovered Read Error
  1      14302     6  0xf135  0xc502  0x000   3272213906     1     -  Unrecovered Read Error
  2      14301     1  0x6382  0xc502  0x000   3272213906     1     -  Unrecovered Read Error
  3      14300     2  0xc275  0xc502  0x000   3272213906     1     -  Unrecovered Read Error
  4      14299     9  0x0202  0xc502  0x000   3272213906     1     -  Unrecovered Read Error
  5      14298     5  0x8335  0xc502  0x000   3272213906     1     -  Unrecovered Read Error
  6      14297     5  0x8008  0xc502  0x000   3272416916     1     -  Unrecovered Read Error
  7      14296     5  0x4005  0xc502  0x000   3272213906     1     -  Unrecovered Read Error
  8      14295    12  0x2175  0xc502  0x000   3272213906     1     -  Unrecovered Read Error
  9      14294     9  0x6132  0xc502  0x000   3384972122     1     -  Unrecovered Read Error
 10      14293     6  0xf342  0xc502  0x000   3272213906     1     -  Unrecovered Read Error
 11      14292     6  0x4375  0xc502  0x000   3272213906     1     -  Unrecovered Read Error
 12      14291     1  0x338d  0xc502  0x000   3384972122     1     -  Unrecovered Read Error
 13      14290     1  0x3394  0xc502  0x000   3384972122     1     -  Unrecovered Read Error
 14      14289     2  0xe057  0xc502  0x000   3379521660     1     -  Unrecovered Read Error
 15      14288     2  0x328d  0xc502  0x000   3272420466     1     -  Unrecovered Read Error
... (48 entries not read)

Self-test Log (NVMe Log 0x06, NSID 0xffffffff)
Self-test status: No self-test in progress
Num  Test_Description  Status                       Power_on_Hours  Failing_LBA  NSID Seg SCT Code
 0   Short             Completed: failed segments             2799      8591828     1   7   -

I can read and access my data on the drive just fine.
So what should I believe? Is BTRFS reporting false errors, or is my SSD healthy really an issue?

Schlaefer · May 23, 2025, 6:06pm

In my experience BTRFS reporting errors are a sign for a hardware issue.

The overall parameters reported to smartctl are in expected range and therefore “PASSED”, but there is also a Completed: failed segments in the short test.

I would consider that drive as failing and tread it with caution until established otherwise. Maybe start a long test and see how it goes.

Make a backup if something important is on that drive as soon as possible.

Cphusion · May 23, 2025, 6:07pm

Same, had my Proxmox system in raid1 btrfs setup. I then saw i/o errors. I replace the the nvme ssd and all was good again.

TryZee · May 23, 2025, 6:38pm

I looked at dmesg and see BTRFS is seriously complaning about a bunch of errors and warnings. Several of them mentioned that only 4 of my files are bad.

[ 2598.646563] nvme1n1: Read(0x2) @ LBA 3272213760, 256 blocks, Unrecovered Read Error (sct 0x2 / sc 0x81) MORE DNR 
[ 2598.646570] critical medium error, dev nvme1n1, sector 3272213760 op 0x0:(READ) flags 0x4000 phys_seg 2 prio class 3
[ 2598.730528] nvme1n1: Read(0x2) @ LBA 3272213904, 8 blocks, Unrecovered Read Error (sct 0x2 / sc 0x81) MORE DNR 
[ 2598.730534] critical medium error, dev nvme1n1, sector 3272213904 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 2598.731743] scrub_stripe_report_errors: 53 callbacks suppressed
[ 2598.731745] BTRFS error (device nvme1n1p8): fixed up error at logical 5332633518080 on dev /dev/nvme1n1p8 physical 37781897216
[ 2598.731750] BTRFS error (device nvme1n1p8): fixed up error at logical 5332633518080 on dev /dev/nvme1n1p8 physical 37781897216
[ 2598.731753] BTRFS error (device nvme1n1p8): fixed up error at logical 5332633518080 on dev /dev/nvme1n1p8 physical 37781897216
[ 2598.731755] BTRFS error (device nvme1n1p8): fixed up error at logical 5332633518080 on dev /dev/nvme1n1p8 physical 37781897216
[ 2598.731757] BTRFS error (device nvme1n1p8): fixed up error at logical 5332633518080 on dev /dev/nvme1n1p8 physical 37781897216
[ 2598.731759] BTRFS error (device nvme1n1p8): fixed up error at logical 5332633518080 on dev /dev/nvme1n1p8 physical 37781897216
[ 2598.731762] BTRFS error (device nvme1n1p8): fixed up error at logical 5332633518080 on dev /dev/nvme1n1p8 physical 37781897216
[ 2598.731764] BTRFS error (device nvme1n1p8): fixed up error at logical 5332633518080 on dev /dev/nvme1n1p8 physical 37781897216
[ 2598.731766] BTRFS error (device nvme1n1p8): fixed up error at logical 5332633518080 on dev /dev/nvme1n1p8 physical 37781897216
[ 2598.731768] BTRFS error (device nvme1n1p8): fixed up error at logical 5332633518080 on dev /dev/nvme1n1p8 physical 37781897216
[ 2598.731790] BTRFS error (device nvme1n1p8): unable to fixup (regular) error at logical 5332633583616 on dev /dev/nvme1n1p8 physical 37781962752
and more

Welp, time to grab a new SSD for backups before digging a grave for my dead SSD. Rest in pieces.

I have to give BTRFS credit, it did its job by stopping the backup.

Appreciate everyone’s help!