We have a storage server with a bunch of marginal disks in it. They aren’t bad enough to return read errors, and of course the SMART data claims they’re just fine, but the built-in error-recovery takes much too long (sometimes on the order of seconds), so I’ve been writing some DTrace scripts to try to identify the problem drives and get them replaced. This is my first real foray into DTrace, so it’s been a learning process.
From one of the DTrace gurus (I’ve forgotten which one), I got the idea of postprocessing DTrace output with awk
to find drives whose maximum response time is more than six standard deviations away from the mean. I had to write another awk
script to parse the output of gmultipath list
to generate some code for the first script to translate disk units into the names we actually use. (I wouldn’t have had to do this if I had instrumented the gmultipath
layer, but I understood the CAM peripheral driver for SCSI disks, /sys/cam/scsi/scsi_da.c
, much better than I understood how requests are passed around GEOM, so that’s where I wrote my DTrace instrumentation. Finally, it became clear that the SSDs in our storage servers were pulling the mean and standard deviation down too much, so I had to kluge the DTrace script to exclude those devices by hand. (I don’t think there’s a way to identify SSDs in pure DTrace code; if anyone has an idea how to do it, please let me know.)
I am by no means a DTrace expert. I’m not even an awk
expert — I would probably have done better to write this code in Perl or Ruby, which would have made it much simpler and avoided the whole business of scripts-running-scripts-to-generate-scripts. (Or better yet, I could learn R and DTrace at the same time, joy!) But what I had seems to work, so I’m sharing it here.
The driver shell script, findslowdisks
:
#!/bin/sh temp=$(mktemp /tmp/sixsigma.awk.XXXXXX) trap "rm -f $temp" 0 1 3 15 { echo 'BEGIN {' gmultipath list | awk ' /Providers:/ { state = 1 } /Consumers:/ { state = 2 } /Name:/ && state == 1 { providername = $NF } /Name:/ && state == 2 { printf("providers[\"%s\"] = \"%s\";\n", $NF, providername); }' echo '}' cat sixsigma.awk } > $temp dtrace -s daslow2.d | { trap '' INT; awk -f $temp; }
The DTrace script, daslow2.d
:
#!/usr/sbin/dtrace -s #pragma D option quiet #pragma D option dynvarsize=2m dtrace:::BEGIN { /* ZeusRAM */ ssd[6] = 1; ssd[54] = 1; ssd[102] = 1; ssd[126] = 1; /* flash */ ssd[31] = 1; ssd[79] = 1; ssd[151] = 1; ssd[175] = 1; } fbt::dastrategy:entry /ssd[((struct cam_periph *)args[0]->bio_disk->d_drv1)->unit_number] == 0/ { start_time[args[0]] = timestamp; } fbt::dadone:entry /(this->bp = (struct bio *)args[1]->ccb_h.periph_priv.entries[1].ptr) && start_time[this->bp] / { this->delta = (timestamp - start_time[this->bp]) / 1000; @sigma = stddev(this->delta); @mu = avg(this->delta); @tmax[args[0]->unit_number] = max(this->delta); start_time[this->bp] = 0; } tick-5s { printa("mu %@d\n", @mu); printa("sigma %@d\n", @sigma); printa("da%d tmax %@d\n", @tmax); printf("\n"); clear(@tmax); } tick-60s { clear(@mu); clear(@sigma); } dtrace:::END { printa("mu %@d\n", @mu); printa("sigma %@d\n", @sigma); printa("da%d tmax %@d\n", @tmax); }
This DTrace script generates output that looks like this:
mu 21655 sigma 59125 da0 tmax 0 da1 tmax 0 da73 tmax 72163 da7 tmax 75229 da72 tmax 80966 da98 tmax 84600 da96 tmax 92696 da80 tmax 92742 da23 tmax 95349 da83 tmax 95807 da27 tmax 97571 da95 tmax 99413 da76 tmax 101202 da8 tmax 104040 da22 tmax 104586 da59 tmax 106235 da50 tmax 106523 da99 tmax 107958 da53 tmax 108361 da81 tmax 108938 da49 tmax 109653 da36 tmax 110585 da52 tmax 111933 da97 tmax 111972 da94 tmax 113516 da100 tmax 115035 da11 tmax 116034 da82 tmax 116527 da56 tmax 117776 da29 tmax 118982 da34 tmax 119512 da26 tmax 120884 da74 tmax 122808 da84 tmax 123039 da58 tmax 123882 da87 tmax 124629 da25 tmax 125528 da47 tmax 127538 da70 tmax 128523 da55 tmax 134356 da101 tmax 137618 da35 tmax 137632 da10 tmax 139364 da60 tmax 141884 da33 tmax 144913 da32 tmax 153962 da48 tmax 162202 da13 tmax 170598 da86 tmax 173577 da9 tmax 178848 da90 tmax 183577 da61 tmax 185236 da85 tmax 190790 da14 tmax 192199 da38 tmax 204740 da68 tmax 217665 da69 tmax 218239 da12 tmax 230249 da64 tmax 233869 da92 tmax 235744 da44 tmax 243507 da20 tmax 250953 da63 tmax 267525 da42 tmax 267868 da19 tmax 292027 da18 tmax 293504 da37 tmax 301723 da45 tmax 304621 da28 tmax 308943 da91 tmax 309112 da39 tmax 324018 da62 tmax 329701 da93 tmax 339661 da16 tmax 341316 da17 tmax 345690 da15 tmax 349360 da41 tmax 351064 da67 tmax 355273 da40 tmax 376592 da66 tmax 386287 da88 tmax 426114 da89 tmax 426324 da21 tmax 500642 da30 tmax 502026 da75 tmax 734421 da65 tmax 1319797 da71 tmax 1347838 da46 tmax 1354558 da57 tmax 1525634 da51 tmax 2403745
…which is obviously not very human-readable, but you can look at it and see that it’s generating data that looks like it might be correct. sixsigma.awk
is the tail end of the script (the head being a BEGIN block that is constructed by the shell script above):
$1 == "mu" { mu = $2 * 1.0; } $1 == "sigma" { sigma = $2 * 1.0; } $2 == "tmax" { tmax = $3 * 1.0; if (tmax > (mu + 6*sigma)) { printf("%s: %.1f ms > %.1f ms\n", providers[$1], tmax / 1000.0, (mu + 6*sigma) / 1000.0); } } /^$/ { print }
The output, updated every five seconds, looks like this:
multipath/s29d17: 1688.0 ms > 547.1 ms multipath/s29d22: 2847.9 ms > 547.1 ms multipath/s25d12: 3381.1 ms > 547.1 ms multipath/s25d22: 986.1 ms > 585.7 ms multipath/s29d17: 1498.7 ms > 585.7 ms multipath/s25d12: 2728.8 ms > 585.7 ms multipath/s25d4: 2894.7 ms > 585.7 ms multipath/s29d22: 3099.7 ms > 585.7 ms multipath/s29d17: 719.5 ms > 651.1 ms multipath/s25d18: 895.4 ms > 651.1 ms multipath/s25d4: 1749.2 ms > 651.1 ms multipath/s29d22: 1995.8 ms > 651.1 ms multipath/s25d12: 3047.1 ms > 651.1 ms multipath/s25d18: 580.4 ms > 579.1 ms multipath/s25d4: 734.8 ms > 579.1 ms multipath/s29d22: 1379.6 ms > 579.1 ms multipath/s25d12: 2586.5 ms > 579.1 ms multipath/s29d22: 3021.9 ms > 645.4 ms multipath/s25d12: 3492.1 ms > 645.4 ms
This gives me a pretty good idea of which disks still need to be replaced.