I suspected one of my magnetic hard disks was failing when it repeatedly failed to install an OS, so I ran a smartctl
self-test and viewed the result – and had no idea what I was looking at. To address this gap in my knowledge, I read through several guides/posts, as well as the official documentation to understand each section of the result.
Here’s my condensed TL;DR guide to understand a smartctl
result report – for busy people looking for a straightforward, no frills guide in English.
The command used to view the SMART information for my SATA disk:
$ sudo smartctl -a /dev/sda
Note: install the smartmontools
package if the command is not found.
Section #1 – License Information
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-43-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
The first few lines of the output shows the version, copyright, license, home page and SVN revision information for the current smartctl
executable used. This is the first few lines of the smartctl
output when used with the -V
, --version
, --copyright
, or --license
flags.
Section #2 – General Information
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Blue Serial ATA
Device Model: WDC WD1600AAJS-65WAA0
Serial Number: WD-WCAS23509303
LU WWN Device Id: 5 0014ee 156557a28
Firmware Version: 58.01D58
User Capacity: 160,041,885,696 bytes [160 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 2.5, 3.0 Gb/s
Local Time is: Sun Jun 25 17:17:10 2023 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
The Information section can be triggered with the -i
/--info
flags. The smartctl
man
page states:
Prints the device model number, serial number, firmware version, and ATA Standard version/revision information. Says if the device supports SMART, and if so, whether SMART support is currently enabled or disabled.
If the device supports Logical Block Address mode (LBA mode) print current user drive capacity in bytes. (If drive is has a user protected area reserved, or is “clipped”, this may be smaller than the potential maximum drive capacity.)
Indicates if the drive is in the smartmontools database (see ‘-v’ options below). If so, the drive model family may also be printed. If ‘-n’ (see below) is specified, the power mode of the drive is printed.
https://linux.die.net/man/8/smartctl
Section #3 – SMART Data: Health Status of Device
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Generated using the -H
/--health
flag. As per the man
pages:
If the device reports failing health status, this means either that the device has already failed, or that it is predicting its own failure within the next 24 hours.
Section #4 – SMART Data: Capabilities
General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 4380) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 55) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x303f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
For ATA devices, this section contains the generic SMART capabilities that show what SMART features are implemented/supported. Most of it is self explanatory, so reading it is straight forward.
For NVMe devices, this section is obtained by reading the “Critical Warning” byte from the SMART/Health Information log.
Section #5 – SMART Data: Attributes
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 4
3 Spin_Up_Time 0x0003 153 152 021 Pre-fail Always - 3350
4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1979
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x000e 200 200 051 Old_age Always - 0
9 Power_On_Hours 0x0032 087 087 000 Old_age Always - 10003
10 Spin_Retry_Count 0x0012 100 100 051 Old_age Always - 0
11 Calibration_Retry_Count 0x0012 100 100 051 Old_age Always - 0
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 1975
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 203
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2067
194 Temperature_Celsius 0x0022 098 093 000 Old_age Always - 45
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 151 151 000 Old_age Always - 1301
198 Offline_Uncorrectable 0x0010 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 051 Old_age Offline - 0
This section prints only the vendor specific SMART attributes for the given device (see this link for generic and vendor specific attributes). This is from using the -A
/--attributes
flag.
The following list explains what each column means:
- ID# – The SMART attribute numbered from 1 to 253
- ATTRIBUTE_NAME – the SMART attribute name
- FLAG – An integer that defines which category the SMART attribute belongs to. By default, this will be a hexadecimal representation. Under the “brief” format, this will be in binary format, with each set bit denoting the category it belongs to:
$ sudo smartctl -A /dev/sdb -f brief
...
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
...
198 Offline_Uncorrectable ----C- 100 253 000 - 0
199 UDMA_CRC_Error_Count -OSRCK 200 200 000 - 0
200 Multi_Zone_Error_Rate ---R-- 100 253 051 - 0
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
- VALUE – normalized value within the range of 1 to 254
- Uses vendor specific algorithm to convert from RAW_VALUE
- WORST – The closest to failure value that the disk has recorded in its lifetime when SMART was enabled
- THRESH – the threshold for failure within the range of 0 to 255. Attribute considered fail when VALUE < THRESH.
- TYPE – describes the kind of SMART attribute:
- “Pre-failure” – indicates pending disk failure when VALUE < THRESH
- “Old Age” – indicates device reaching end of lifespan due to normal wear and tear when VALUE < THRESH
- UPDATED – Two possible values:
- “Always” – attribute values updated during both normal operation and off-line testing
- “Offline” – attribute values updated only during offline-testing
- WHEN_FAILED
- “FAILING_NOW” – when current VALUE < THRESH
- “In_the_past” – when current VALUE > THRESH, but WORST < THRESH
- “-” – When both VALUE and WORST more than THRESH (i.e. no entry)
- RAW_VALUE – six-byte raw value that is vendor specific, with no standard defined across them
Section #6 – SMART Data – Error Logs
When using the -a
flag, this section prints the Summary SMART error log (-l error
).
A device without errors will return something like this:
SMART Error Log Version: 1
No Errors Logged
A device with errors (like the one I had) will return something like this:
SMART Error Log Version: 1
ATA Error Count: 1045 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 1045 occurred at disk power-on lifetime: 9993 hours (416 days + 9 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 b1 86 44 40 Error: UNC at LBA = 0x004486b1 = 4490929
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 88 b0 86 44 08 08 01:24:39.802 READ FPDMA QUEUED
ef 10 02 00 00 00 00 08 01:24:39.796 SET FEATURES [Enable SATA feature]
ec 00 00 00 00 00 00 08 01:24:39.796 IDENTIFY DEVICE
ef 03 46 00 00 00 00 08 01:24:39.795 SET FEATURES [Set transfer mode]
ef 10 02 00 00 00 00 08 01:24:39.795 SET FEATURES [Enable SATA feature]
Error 1044 occurred at disk power-on lifetime: 9993 hours (416 days + 9 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 b1 86 44 40 Error: UNC at LBA = 0x004486b1 = 4490929
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 d8 b0 86 44 08 08 01:24:36.383 READ FPDMA QUEUED
ef 10 02 00 00 00 00 08 01:24:36.380 SET FEATURES [Enable SATA feature]
ec 00 00 00 00 00 00 08 01:24:36.380 IDENTIFY DEVICE
ef 03 46 00 00 00 00 08 01:24:36.380 SET FEATURES [Set transfer mode]
ef 10 02 00 00 00 00 08 01:24:36.380 SET FEATURES [Enable SATA feature]
...
The first part of the error log explains what each two letter heading means, followed by the most recent five non-trivial errors, with the trailing five commands that led up to the error.
Depending on your disk, you may get a different error type.
Section #7 – SMART Data – Self-test Results
Depending on the type of disk, this section shows the recent self-test results of various types:
- ATA – selftest, selective logs
- SCSI – selftest
- NVMe – none
As my disk type is of Serial ATA type, it shows the following:
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 9995 78900536
# 2 Extended offline Completed: read failure 90% 9994 78900536
# 3 Short offline Completed: read failure 90% 9994 78900536
# 4 Short offline Completed without error 00% 5751 -
# 5 Short offline Completed without error 00% 2758 -
# 6 Short offline Completed without error 00% 0 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
For the first set of self-test results (i.e. the selftest
target type), it shows the following data of the most recent twenty-one self-tests:
- type of test (i.e. short, extended, offline, captive)
- final status (if completed)
- percentage of test remaining
- age of disk when test was done (note: time wraps after 2^16, likely due to the
short
data type) - the Logical Block Address (LBA) of the first error (if applicable)
There are slight differences for SCSI devices – refer to the man
pages.
The second set of self-test results (i.e. the selective
target type) shows:
- the start and end LBAs being tested
- current test status
This is the end of the smartctl -a
result sections.
Understanding My Disk Error Report
From my disk’s self-test results, we can see that there was a long operational gap (4000+ hrs!) between the last successful and first erroneous offline test was run. Thus it is likely that the errors are due to normal wear and tear from usage over time.
Additionally, I ran hw-probe
and uploaded the results to linux-hardware.org – a project that anonymously collects hardware details of Linux computers to enable collaboration in debugging hardware and checking compatibility. You can find the HDD report here. What is neat about the report is that it actually detected that the disk is failing based on the smartctl
results, and came up with the following warning:
—
Further Reading
Here are some useful links I referenced to better understand my disk’s SMART information.