HOME / コンピュータTips / Solaris / zfs / ディスクの不良を確認する(iostat -EnのError等)
Date: 2017/02/28 | | Tags: Soft Errors, Hard Errors, Transport Errors, iostat, zfs, Solaris
ディスク関係のスタックをざっくり書くと次の様になっています
例、SAS Expanderを使う場合、
zfs |
SD(SCSI Disk Driver) |
mpt_sas(LSI HostBus Adaptor's Driver) |
RAID CARD |
SAS Expander |
Disks |
例、AHCIなどを使う場合、
zfs |
SD(SCSI Disk Driver) |
AHCI |
Disks |
これらのエラーは、次の様なコマンドで知ることができます。
iostat -En
出力結果例
c0t5001B44F1C7C0C93d0 Soft Errors: 39022 Hard Errors: 1 Transport Errors: 9 Vendor: ATA Product: SanDisk SDSSDXPS Revision: 00RL Serial No: 154902401171 Size: 480.10GB <480103981056 bytes> Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 39022 Illegal Request: 0 Predictive Failure Analysis: 0 Non-Aligned Writes: 0
このなかで、
たとえば、SATAディスクを利用している際に負荷が上がってしまい、ディスクの応答が間に合わないシチュエーションでは、Transport Errorが一気に増えていきます。そして最後はこのディスクが外されます。これが頻繁に起きる場合、もっと速いディスクを使うべきですが、DiskのNative Tag Queuingを無効にしたり、Write Back Cacheを無効にしたりすると、若干、改善したりします。なぜか?一般にこれらのパラメタはSATAを高速化するものですが、逆に大きな命令になったり、SATAディスクに対して処理を投げっぱなしになったりすることで、後の処理でその負荷による遅延が発生しやすくなるためです。SASのCommand Tag QueinigはDisConnectが実装されている為、この限りではありません。
Hard Errorは、ストレージドライバの層から送られてくるものなのですが、その理由はドライバによって異なると思われます。いわゆる「装置側」が出してきたエラーと考えれば良いでしょう。
Soft Errorは、最上位の層でのエラーなので、これが増加しまくる場合は、バスが飽和していたり、信号線が不安定な場合もあるでしょう。
これらの値は、測定しておくと良いとは思います。
最後に、下記はSMARTの値です。これらの値も測定しておくと良いかと。
jpc@dp1-storage8% sudo /usr/sbin/smartctl -a /dev/rdsk/c0t5001B44F1C7C0C93d0 -d sat smartctl 6.5 2016-05-07 r4318 [i386-pc-solaris2.11] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Marvell based SanDisk SSDs Device Model: SanDisk SDSSDXPS480G Serial Number: 154902401171 LU WWN Device Id: 5 001b44 f1c7c0c93 Firmware Version: X21200RL User Capacity: 480,103,981,056 bytes [480 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Form Factor: 2.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2 T13/2015-D revision 3 SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Tue Feb 28 15:50:32 2017 JST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART Status not supported: Incomplete response, ATA output registers missing SMART overall-health self-assessment test result: PASSED Warning: This result is based on an Attribute check. General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x11) SMART execute Offline immediate. No Auto Offline data collection support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. No Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 10) minutes. SMART Attributes Data Structure revision number: 4 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0032 100 100 --- Old_age Always - 0 9 Power_On_Hours 0x0032 253 100 --- Old_age Always - 7751 12 Power_Cycle_Count 0x0032 100 100 --- Old_age Always - 4 166 Min_W/E_Cycle 0x0032 100 100 --- Old_age Always - 1 167 Min_Bad_Block/Die 0x0032 100 100 --- Old_age Always - 55 168 Maximum_Erase_Cycle 0x0032 100 100 --- Old_age Always - 511 169 Total_Bad_Block 0x0032 100 100 --- Old_age Always - 923 171 Program_Fail_Count 0x0032 100 100 --- Old_age Always - 0 172 Erase_Fail_Count 0x0032 100 100 --- Old_age Always - 0 173 Avg_Write/Erase_Count 0x0032 100 100 --- Old_age Always - 407 174 Unexpect_Power_Loss_Ct 0x0032 100 100 --- Old_age Always - 2 184 End-to-End_Error 0x0032 100 100 --- Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 --- Old_age Always - 0 188 Command_Timeout 0x0032 100 100 --- Old_age Always - 2 194 Temperature_Celsius 0x0022 071 030 --- Old_age Always - 29 (Min/Max 22/30) 199 SATA_CRC_Error 0x0032 100 100 --- Old_age Always - 0 212 SATA_PHY_Error 0x0032 100 100 --- Old_age Always - 0 230 Perc_Write/Erase_Count 0x0032 100 100 --- Old_age Always - 3384 232 Perc_Avail_Resrvd_Space 0x0033 100 100 004 Pre-fail Always - 100 233 Total_NAND_Writes_GiB 0x0032 100 100 --- Old_age Always - 213703 241 Total_Writes_GiB 0x0030 253 253 --- Old_age Offline - 27536 242 Total_Reads_GiB 0x0030 253 253 --- Old_age Offline - 3085 244 Thermal_Throttle 0x0032 000 100 --- Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] Selective Self-tests/Logging not supported