New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
How to detect SMART error on NMVe Disk?
Hi all,
I need help how to determine NVMe disk failing in SMART
My Hetrix monitoring tools said that my raid is not healthy, and I got this
cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 nvme1n1p1[1] nvme0n1p1[0]
1047552 blocks super 1.2 [2/2] [UU]
md1 : active raid1 nvme1n1p2[1] nvme0n1p2[0]
498925888 blocks super 1.2 [2/2] [UU]
[===========>.........] check = 59.1% (295085568/498925888) finish=16.9min speed=200056K/sec
bitmap: 4/4 pages [16KB], 65536KB chunk
unused devices: <none>
But I don't see SMART value like as usual SSD in NVMe disk
smartctl -a /dev/nvme0
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.0.15-1-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: SAMSUNG MZVLB512HAJQ-00000
Serial Number: S3W8NPSE48888
Firmware Version: EXA7301Q
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 512,110,190,592 [512 GB]
Unallocated NVM Capacity: 0
Controller ID: 4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 512,110,190,592 [512 GB]
Namespace 1 Utilization: 418,515,906,560 [418 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 8491b74abf
Local Time is: Tue Aug 18 03:13:21 2020 CEST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 81 Celsius
Critical Comp. Temp. Threshold: 82 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 7.02W - - 0 0 0 0 0 0
1 + 6.30W - - 1 1 1 1 0 0
2 + 3.50W - - 2 2 2 2 0 0
3 - 0.0760W - - 3 3 3 3 210 1200
4 - 0.0050W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 39 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 3%
Data Units Read: 19,867,513 [10.1 TB]
Data Units Written: 48,641,973 [24.9 TB]
Host Read Commands: 206,530,808
Host Write Commands: 1,943,029,925
Controller Busy Time: 9,376
Power Cycles: 4
Power On Hours: 8,754
Unsafe Shutdowns: 0
Media and Data Integrity Errors: 0
Error Information Log Entries: 2
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 39 Celsius
Temperature Sensor 2: 56 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged
smartctl -a /dev/nvme1
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.0.15-1-pve] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: SAMSUNG MZVLB512HAJQ-00000
Serial Number: S3W8NZAC525555
Firmware Version: EXA7301Q
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 512,110,190,592 [512 GB]
Unallocated NVM Capacity: 0
Controller ID: 4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 512,110,190,592 [512 GB]
Namespace 1 Utilization: 511,991,951,360 [511 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 85910c8f1c
Local Time is: Tue Aug 18 03:13:39 2020 CEST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 81 Celsius
Critical Comp. Temp. Threshold: 82 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 7.02W - - 0 0 0 0 0 0
1 + 6.30W - - 1 1 1 1 0 0
2 + 3.50W - - 2 2 2 2 0 0
3 - 0.0760W - - 3 3 3 3 210 1200
4 - 0.0050W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 42 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 9%
Data Units Read: 14,375,193 [7.36 TB]
Data Units Written: 52,642,220 [26.9 TB]
Host Read Commands: 164,570,410
Host Write Commands: 1,949,236,315
Controller Busy Time: 9,720
Power Cycles: 4
Power On Hours: 8,755
Unsafe Shutdowns: 0
Media and Data Integrity Errors: 0
Error Information Log Entries: 2
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 42 Celsius
Temperature Sensor 2: 62 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged
Is it Media and Data Integrity Errors
? Or where do I need to look at?
Thank you
Comments
Try
smartctl -a /dev/nvme0n1
Also looking at kernel log might help understand what happened. To me it looks like simple scheduled raid check.
Yes, Indeed that looks like a check, not a rebuild.
By default there's a cron in /etc/cron/* that issues a
check
every couple weeks.Francisco
I wouldn't use bitmap on a NVMe RAID anyway.
It is extremely rare that you re-add an NVMe device.
Tried it, and the result is similar. I don't know what is the different between nvme0 and nvme0n1. What is the meaning of n(X) ?
Yeah seems like only check. The scheduler is in
/etc/cron.d/mdadm
. Thank you for the answer!!Do you mind to explain this further? Are there any other "method" than bitmap?
This set up was done by hetzner installimage. Usually I don't change RAID set up by the provider
Thank you!!
nvme-cli
might be able to show more information: https://nvmexpress.org/open-source-nvme-management-utility-nvme-command-line-interface-nvme-cli/Write intent bitmaps are used by mdadm to speed up resyncs if you remove and the readd a disk, or if the system crashes. They come at the cost of a bit of runtime performance.
For most ssds, a full resync is usually fast enough that you don't need to use the write intent bitmap.
https://raid.wiki.kernel.org/index.php/Write-intent_bitmap
n(x) is namespace, which you can google to get more detailed explanation than i can ever give, but in short it is another storage abstraction on ssd controller level, which also is not (fully) supported by most common nvme ssd-s (and they simply have one namespace).
Guess how it works depends on smartmontools version and specific SSD-s.
I second trying out nvme-cli - it's been a while since I used it but I remember it being very good for gathering information on your nvme drives