My first (well, ok, third) DTrace hack

We have a storage server with a bunch of marginal disks in it. They aren’t bad enough to return read errors, and of course the SMART data claims they’re just fine, but the built-in error-recovery takes much too long (sometimes on the order of seconds), so I’ve been writing some DTrace scripts to try to identify the problem drives and get them replaced. This is my first real foray into DTrace, so it’s been a learning process.

From one of the DTrace gurus (I’ve forgotten which one), I got the idea of postprocessing DTrace output with awk to find drives whose maximum response time is more than six standard deviations away from the mean. I had to write another awk script to parse the output of gmultipath list to generate some code for the first script to translate disk units into the names we actually use. (I wouldn’t have had to do this if I had instrumented the gmultipath layer, but I understood the CAM peripheral driver for SCSI disks, /sys/cam/scsi/scsi_da.c, much better than I understood how requests are passed around GEOM, so that’s where I wrote my DTrace instrumentation. Finally, it became clear that the SSDs in our storage servers were pulling the mean and standard deviation down too much, so I had to kluge the DTrace script to exclude those devices by hand. (I don’t think there’s a way to identify SSDs in pure DTrace code; if anyone has an idea how to do it, please let me know.)

I am by no means a DTrace expert. I’m not even an awk expert — I would probably have done better to write this code in Perl or Ruby, which would have made it much simpler and avoided the whole business of scripts-running-scripts-to-generate-scripts. (Or better yet, I could learn R and DTrace at the same time, joy!) But what I had seems to work, so I’m sharing it here.

The driver shell script, findslowdisks:

#!/bin/sh

temp=$(mktemp /tmp/sixsigma.awk.XXXXXX)
trap "rm -f $temp" 0 1 3 15

{
        echo 'BEGIN {'
        gmultipath list | awk '
/Providers:/ { state = 1 }
/Consumers:/ { state = 2 }
/Name:/ && state == 1 { providername = $NF }
/Name:/ && state == 2 { printf("providers[\"%s\"] = \"%s\";\n", $NF, providername); }'
        echo '}'
        cat sixsigma.awk
} > $temp
dtrace -s daslow2.d | { trap '' INT; awk -f $temp; }

The DTrace script, daslow2.d:

#!/usr/sbin/dtrace -s

#pragma D option quiet
#pragma D option dynvarsize=2m

dtrace:::BEGIN
{
        /* ZeusRAM */
        ssd[6] = 1;
        ssd[54] = 1;
        ssd[102] = 1;
        ssd[126] = 1;
        /* flash */
        ssd[31] = 1;
        ssd[79] = 1;
        ssd[151] = 1;
        ssd[175] = 1;
}

fbt::dastrategy:entry
/ssd[((struct cam_periph *)args[0]->bio_disk->d_drv1)->unit_number] == 0/
{
        start_time[args[0]] = timestamp;
}

fbt::dadone:entry
/(this->bp = (struct bio *)args[1]->ccb_h.periph_priv.entries[1].ptr) && start_time[this->bp] /
{
        this->delta = (timestamp - start_time[this->bp]) / 1000;
        @sigma = stddev(this->delta);
        @mu = avg(this->delta);
        @tmax[args[0]->unit_number] = max(this->delta);
        start_time[this->bp] = 0;
}

tick-5s
{
        printa("mu %@d\n", @mu);
        printa("sigma %@d\n", @sigma);
        printa("da%d tmax %@d\n", @tmax);
        printf("\n");
        clear(@tmax);
}

tick-60s
{
        clear(@mu);
        clear(@sigma);
}

dtrace:::END
{
        printa("mu %@d\n", @mu);
        printa("sigma %@d\n", @sigma);
        printa("da%d tmax %@d\n", @tmax);
}

This DTrace script generates output that looks like this:

mu 21655
sigma 59125
da0 tmax 0
da1 tmax 0
da73 tmax 72163
da7 tmax 75229
da72 tmax 80966
da98 tmax 84600
da96 tmax 92696
da80 tmax 92742
da23 tmax 95349
da83 tmax 95807
da27 tmax 97571
da95 tmax 99413
da76 tmax 101202
da8 tmax 104040
da22 tmax 104586
da59 tmax 106235
da50 tmax 106523
da99 tmax 107958
da53 tmax 108361
da81 tmax 108938
da49 tmax 109653
da36 tmax 110585
da52 tmax 111933
da97 tmax 111972
da94 tmax 113516
da100 tmax 115035
da11 tmax 116034
da82 tmax 116527
da56 tmax 117776
da29 tmax 118982
da34 tmax 119512
da26 tmax 120884
da74 tmax 122808
da84 tmax 123039
da58 tmax 123882
da87 tmax 124629
da25 tmax 125528
da47 tmax 127538
da70 tmax 128523
da55 tmax 134356
da101 tmax 137618
da35 tmax 137632
da10 tmax 139364
da60 tmax 141884
da33 tmax 144913
da32 tmax 153962
da48 tmax 162202
da13 tmax 170598
da86 tmax 173577
da9 tmax 178848
da90 tmax 183577
da61 tmax 185236
da85 tmax 190790
da14 tmax 192199
da38 tmax 204740
da68 tmax 217665
da69 tmax 218239
da12 tmax 230249
da64 tmax 233869
da92 tmax 235744
da44 tmax 243507
da20 tmax 250953
da63 tmax 267525
da42 tmax 267868
da19 tmax 292027
da18 tmax 293504
da37 tmax 301723
da45 tmax 304621
da28 tmax 308943
da91 tmax 309112
da39 tmax 324018
da62 tmax 329701
da93 tmax 339661
da16 tmax 341316
da17 tmax 345690
da15 tmax 349360
da41 tmax 351064
da67 tmax 355273
da40 tmax 376592
da66 tmax 386287
da88 tmax 426114
da89 tmax 426324
da21 tmax 500642
da30 tmax 502026
da75 tmax 734421
da65 tmax 1319797
da71 tmax 1347838
da46 tmax 1354558
da57 tmax 1525634
da51 tmax 2403745

…which is obviously not very human-readable, but you can look at it and see that it’s generating data that looks like it might be correct. sixsigma.awk is the tail end of the script (the head being a BEGIN block that is constructed by the shell script above):

$1 == "mu" { mu = $2 * 1.0; }
$1 == "sigma" { sigma = $2 * 1.0; }
$2 == "tmax" {
        tmax = $3 * 1.0;
        if (tmax > (mu + 6*sigma)) {
                printf("%s: %.1f ms > %.1f ms\n", providers[$1], tmax / 1000.0, 
                        (mu + 6*sigma) / 1000.0);
        }
}
/^$/    { print }

The output, updated every five seconds, looks like this:

multipath/s29d17: 1688.0 ms > 547.1 ms
multipath/s29d22: 2847.9 ms > 547.1 ms
multipath/s25d12: 3381.1 ms > 547.1 ms

multipath/s25d22: 986.1 ms > 585.7 ms
multipath/s29d17: 1498.7 ms > 585.7 ms
multipath/s25d12: 2728.8 ms > 585.7 ms
multipath/s25d4: 2894.7 ms > 585.7 ms
multipath/s29d22: 3099.7 ms > 585.7 ms

multipath/s29d17: 719.5 ms > 651.1 ms
multipath/s25d18: 895.4 ms > 651.1 ms
multipath/s25d4: 1749.2 ms > 651.1 ms
multipath/s29d22: 1995.8 ms > 651.1 ms
multipath/s25d12: 3047.1 ms > 651.1 ms

multipath/s25d18: 580.4 ms > 579.1 ms
multipath/s25d4: 734.8 ms > 579.1 ms
multipath/s29d22: 1379.6 ms > 579.1 ms
multipath/s25d12: 2586.5 ms > 579.1 ms

multipath/s29d22: 3021.9 ms > 645.4 ms
multipath/s25d12: 3492.1 ms > 645.4 ms

This gives me a pretty good idea of which disks still need to be replaced.

This entry was posted in Computing, FreeBSD and tagged , , . Bookmark the permalink.