HOME / コンピュータTips / Solaris / zfs / ディスクの不良を確認する(iostat -EnのError等)
Date: 2017/02/28 |
|
Tags: Soft Errors, Hard Errors, Transport Errors, iostat, zfs, Solaris
ディスク関係のスタックをざっくり書くと次の様になっています
例、SAS Expanderを使う場合、
| zfs |
| SD(SCSI Disk Driver) |
| mpt_sas(LSI HostBus Adaptor's Driver) |
| RAID CARD |
| SAS Expander |
| Disks |
例、AHCIなどを使う場合、
| zfs |
| SD(SCSI Disk Driver) |
| AHCI |
| Disks |
これらのエラーは、次の様なコマンドで知ることができます。
iostat -En
出力結果例
c0t5001B44F1C7C0C93d0 Soft Errors: 39022 Hard Errors: 1 Transport Errors: 9 Vendor: ATA Product: SanDisk SDSSDXPS Revision: 00RL Serial No: 154902401171 Size: 480.10GB <480103981056 bytes> Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 39022 Illegal Request: 0 Predictive Failure Analysis: 0 Non-Aligned Writes: 0
このなかで、
たとえば、SATAディスクを利用している際に負荷が上がってしまい、ディスクの応答が間に合わないシチュエーションでは、Transport Errorが一気に増えていきます。そして最後はこのディスクが外されます。これが頻繁に起きる場合、もっと速いディスクを使うべきですが、DiskのNative Tag Queuingを無効にしたり、Write Back Cacheを無効にしたりすると、若干、改善したりします。なぜか?一般にこれらのパラメタはSATAを高速化するものですが、逆に大きな命令になったり、SATAディスクに対して処理を投げっぱなしになったりすることで、後の処理でその負荷による遅延が発生しやすくなるためです。SASのCommand Tag QueinigはDisConnectが実装されている為、この限りではありません。
Hard Errorは、ストレージドライバの層から送られてくるものなのですが、その理由はドライバによって異なると思われます。いわゆる「装置側」が出してきたエラーと考えれば良いでしょう。
Soft Errorは、最上位の層でのエラーなので、これが増加しまくる場合は、バスが飽和していたり、信号線が不安定な場合もあるでしょう。
これらの値は、測定しておくと良いとは思います。
最後に、下記はSMARTの値です。これらの値も測定しておくと良いかと。
jpc@dp1-storage8% sudo /usr/sbin/smartctl -a /dev/rdsk/c0t5001B44F1C7C0C93d0 -d sat
smartctl 6.5 2016-05-07 r4318 [i386-pc-solaris2.11] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Marvell based SanDisk SSDs
Device Model: SanDisk SDSSDXPS480G
Serial Number: 154902401171
LU WWN Device Id: 5 001b44 f1c7c0c93
Firmware Version: X21200RL
User Capacity: 480,103,981,056 bytes [480 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 T13/2015-D revision 3
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Feb 28 15:50:32 2017 JST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x11) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 10) minutes.
SMART Attributes Data Structure revision number: 4
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 --- Old_age Always - 0
9 Power_On_Hours 0x0032 253 100 --- Old_age Always - 7751
12 Power_Cycle_Count 0x0032 100 100 --- Old_age Always - 4
166 Min_W/E_Cycle 0x0032 100 100 --- Old_age Always - 1
167 Min_Bad_Block/Die 0x0032 100 100 --- Old_age Always - 55
168 Maximum_Erase_Cycle 0x0032 100 100 --- Old_age Always - 511
169 Total_Bad_Block 0x0032 100 100 --- Old_age Always - 923
171 Program_Fail_Count 0x0032 100 100 --- Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 --- Old_age Always - 0
173 Avg_Write/Erase_Count 0x0032 100 100 --- Old_age Always - 407
174 Unexpect_Power_Loss_Ct 0x0032 100 100 --- Old_age Always - 2
184 End-to-End_Error 0x0032 100 100 --- Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 --- Old_age Always - 0
188 Command_Timeout 0x0032 100 100 --- Old_age Always - 2
194 Temperature_Celsius 0x0022 071 030 --- Old_age Always - 29 (Min/Max 22/30)
199 SATA_CRC_Error 0x0032 100 100 --- Old_age Always - 0
212 SATA_PHY_Error 0x0032 100 100 --- Old_age Always - 0
230 Perc_Write/Erase_Count 0x0032 100 100 --- Old_age Always - 3384
232 Perc_Avail_Resrvd_Space 0x0033 100 100 004 Pre-fail Always - 100
233 Total_NAND_Writes_GiB 0x0032 100 100 --- Old_age Always - 213703
241 Total_Writes_GiB 0x0030 253 253 --- Old_age Offline - 27536
242 Total_Reads_GiB 0x0030 253 253 --- Old_age Offline - 3085
244 Thermal_Throttle 0x0032 000 100 --- Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
Selective Self-tests/Logging not supported