TL;DR Guide to Understand smartctl Results for SATA Disks

I suspected one of my magnetic hard disks was failing when it repeatedly failed to install an OS, so I ran a smartctl self-test and viewed the result – and had no idea what I was looking at. To address this gap in my knowledge, I read through several guides/posts, as well as the official documentation to understand each section of the result.

Here’s my condensed TL;DR guide to understand a smartctl result report – for busy people looking for a straightforward, no frills guide in English.

The command used to view the SMART information for my SATA disk:

$ sudo smartctl -a /dev/sda

Note: install the smartmontools package if the command is not found.

Section #1 – License Information

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-43-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

The first few lines of the output shows the version, copyright, license, home page and SVN revision information for the current smartctl executable used. This is the first few lines of the smartctl output when used with the -V, --version, --copyright, or --license flags.

Section #2 – General Information

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Blue Serial ATA
Device Model:     WDC WD1600AAJS-65WAA0
Serial Number:    WD-WCAS23509303
LU WWN Device Id: 5 0014ee 156557a28
Firmware Version: 58.01D58
User Capacity:    160,041,885,696 bytes [160 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.5, 3.0 Gb/s
Local Time is:    Sun Jun 25 17:17:10 2023 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

The Information section can be triggered with the -i/--info flags. The smartctl man page states:

Prints the device model number, serial number, firmware version, and ATA Standard version/revision information. Says if the device supports SMART, and if so, whether SMART support is currently enabled or disabled.

If the device supports Logical Block Address mode (LBA mode) print current user drive capacity in bytes. (If drive is has a user protected area reserved, or is “clipped”, this may be smaller than the potential maximum drive capacity.)

Indicates if the drive is in the smartmontools database (see ‘-v’ options below). If so, the drive model family may also be printed. If ‘-n’ (see below) is specified, the power mode of the drive is printed.
https://linux.die.net/man/8/smartctl

Section #3 – SMART Data: Health Status of Device

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

Generated using the -H/--health flag. As per the man pages:

If the device reports failing health status, this means either that the device has already failed, or that it is predicting its own failure within the next 24 hours.

Section #4 – SMART Data: Capabilities

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                ( 4380) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (  55) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

For ATA devices, this section contains the generic SMART capabilities that show what SMART features are implemented/supported. Most of it is self explanatory, so reading it is straight forward.

For NVMe devices, this section is obtained by reading the “Critical Warning” byte from the SMART/Health Information log.

Section #5 – SMART Data: Attributes

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       4
  3 Spin_Up_Time            0x0003   153   152   021    Pre-fail  Always       -       3350
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1979
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000e   200   200   051    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   087   087   000    Old_age   Always       -       10003
 10 Spin_Retry_Count        0x0012   100   100   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0012   100   100   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1975
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       203
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2067
194 Temperature_Celsius     0x0022   098   093   000    Old_age   Always       -       45
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   151   151   000    Old_age   Always       -       1301
198 Offline_Uncorrectable   0x0010   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   051    Old_age   Offline      -       0

This section prints only the vendor specific SMART attributes for the given device (see this link for generic and vendor specific attributes). This is from using the -A/--attributes flag.

The following list explains what each column means:

ID# – The SMART attribute numbered from 1 to 253
ATTRIBUTE_NAME – the SMART attribute name
FLAG – An integer that defines which category the SMART attribute belongs to. By default, this will be a hexadecimal representation. Under the “brief” format, this will be in binary format, with each set bit denoting the category it belongs to:

$ sudo smartctl -A /dev/sdb -f brief
...
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
...
198 Offline_Uncorrectable   ----C-   100   253   000    -    0
199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   100   253   051    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

VALUE – normalized value within the range of 1 to 254
- Uses vendor specific algorithm to convert from RAW_VALUE
WORST – The closest to failure value that the disk has recorded in its lifetime when SMART was enabled
THRESH – the threshold for failure within the range of 0 to 255. Attribute considered fail when VALUE < THRESH.
TYPE – describes the kind of SMART attribute:
- “Pre-failure” – indicates pending disk failure when VALUE < THRESH
- “Old Age” – indicates device reaching end of lifespan due to normal wear and tear when VALUE < THRESH
UPDATED – Two possible values:
- “Always” – attribute values updated during both normal operation and off-line testing
- “Offline” – attribute values updated only during offline-testing
WHEN_FAILED
- “FAILING_NOW” – when current VALUE < THRESH
- “In_the_past” – when current VALUE > THRESH, but WORST < THRESH
- “-” – When both VALUE and WORST more than THRESH (i.e. no entry)
RAW_VALUE – six-byte raw value that is vendor specific, with no standard defined across them

Section #6 – SMART Data – Error Logs

When using the -a flag, this section prints the Summary SMART error log (-l error).

A device without errors will return something like this:

SMART Error Log Version: 1
No Errors Logged

A device with errors (like the one I had) will return something like this:

SMART Error Log Version: 1
ATA Error Count: 1045 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1045 occurred at disk power-on lifetime: 9993 hours (416 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 b1 86 44 40  Error: UNC at LBA = 0x004486b1 = 4490929

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 88 b0 86 44 08 08      01:24:39.802  READ FPDMA QUEUED
  ef 10 02 00 00 00 00 08      01:24:39.796  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 00 08      01:24:39.796  IDENTIFY DEVICE
  ef 03 46 00 00 00 00 08      01:24:39.795  SET FEATURES [Set transfer mode]
  ef 10 02 00 00 00 00 08      01:24:39.795  SET FEATURES [Enable SATA feature]

Error 1044 occurred at disk power-on lifetime: 9993 hours (416 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 b1 86 44 40  Error: UNC at LBA = 0x004486b1 = 4490929

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 d8 b0 86 44 08 08      01:24:36.383  READ FPDMA QUEUED
  ef 10 02 00 00 00 00 08      01:24:36.380  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 00 08      01:24:36.380  IDENTIFY DEVICE
  ef 03 46 00 00 00 00 08      01:24:36.380  SET FEATURES [Set transfer mode]
  ef 10 02 00 00 00 00 08      01:24:36.380  SET FEATURES [Enable SATA feature]

...

The first part of the error log explains what each two letter heading means, followed by the most recent five non-trivial errors, with the trailing five commands that led up to the error.

Depending on your disk, you may get a different error type.

Section #7 – SMART Data – Self-test Results

Depending on the type of disk, this section shows the recent self-test results of various types:

ATA – selftest, selective logs
SCSI – selftest
NVMe – none

As my disk type is of Serial ATA type, it shows the following:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      9995         78900536
# 2  Extended offline    Completed: read failure       90%      9994         78900536
# 3  Short offline       Completed: read failure       90%      9994         78900536
# 4  Short offline       Completed without error       00%      5751         -
# 5  Short offline       Completed without error       00%      2758         -
# 6  Short offline       Completed without error       00%         0         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

For the first set of self-test results (i.e. the selftest target type), it shows the following data of the most recent twenty-one self-tests:

type of test (i.e. short, extended, offline, captive)
final status (if completed)
percentage of test remaining
age of disk when test was done (note: time wraps after 2^16, likely due to the short data type)
the Logical Block Address (LBA) of the first error (if applicable)

There are slight differences for SCSI devices – refer to the man pages.

The second set of self-test results (i.e. the selective target type) shows:

the start and end LBAs being tested
current test status

This is the end of the smartctl -a result sections.

Understanding My Disk Error Report

From my disk’s self-test results, we can see that there was a long operational gap (4000+ hrs!) between the last successful and first erroneous offline test was run. Thus it is likely that the errors are due to normal wear and tear from usage over time.

Additionally, I ran hw-probe and uploaded the results to linux-hardware.org – a project that anonymously collects hardware details of Linux computers to enable collaboration in debugging hardware and checking compatibility. You can find the HDD report here. What is neat about the report is that it actually detected that the disk is failing based on the smartctl results, and came up with the following warning:

—

TL;DR Guide to Understand smartctl Results for SATA Disks

Section #1 – License Information

Section #2 – General Information

Section #3 – SMART Data: Health Status of Device

Section #4 – SMART Data: Capabilities

Section #5 – SMART Data: Attributes

Section #6 – SMART Data – Error Logs

Section #7 – SMART Data – Self-test Results

Understanding My Disk Error Report

Further Reading

Leave a comment Cancel reply

Section #1 – License Information

Section #2 – General Information

Section #3 – SMART Data: Health Status of Device

Section #4 – SMART Data: Capabilities

Section #5 – SMART Data: Attributes

Section #6 – SMART Data – Error Logs

Section #7 – SMART Data – Self-test Results

Understanding My Disk Error Report

Further Reading

Share this:

Leave a comment Cancel reply