When Bad Things Happen To Good Disks - aka Disks Don’t Have File Descriptors

A presentation at LinuxCon + CloudOpen + ContainerCon North America 2015 in August 2015 in Seattle, WA, USA by erik riedel

revision 5 from flickr/purplemattfish, Broken hard drive? When Bad Things Happen to Good Disks Erik Riedel, EMC August 2015 aka Disks Don’t Have File Descriptors

from flickr/Blude, floppy disks for breakfast

¨ Load It Into A Rack from flickr/erinhillaw, Math/Physics Bike Rack and flickr/csavage31, Bike Racks

¨ ¨ ¨ Gen1 (2008) 1TB (2010) Promo Code 1Gen2 Front (tray pulled out) 2,3TB Gen3 (2012) 3,4,6TB high capacity drives (as many as possible) x86 servers/controllers (as few as possible) SAS backplanes/cables (not too many, not too few) Gen5 (2015) 8TB Gen4 (2014) 6TB

Scale Out RU 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 NILE DENSE GbE 10 GbE 10 GbE Empty Rinjin 4 Blade Blank Blank Blank Blank Voyager 15 Disk Voyager 15 Disk Voyager 15 Disk Voyager 15Disk Not Used 480 TB/4n60d U400 RU 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 NILE DENSE GbE 10 GbE 10 GbE Empty Rinjin 4 Blade Blank Blank Blank Blank Voyager 30 Disk Voyager 30 Disk Voyager 30 Disk Voyager 30 Disk Not Used 960 TB/4n120d U900 RU 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 NILE DENSE GbE 10 GbE 10 GbE Empty Rinjin 4 Blade Blank Blank Blank Blank Voyager 60 Disk Voyager 60 Disk Voyager 60 Disk Voyager 60 Disk Not Used 1.9 PB/4n240d U1900 RU 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 NILE DENSE GbE 10 GbE 10 GbE Rinjin 4 Blade Rinjin 4 Blade Voyager 60 Disk Voyager 60 Disk Voyager 60 Disk Voyager 60 Disk Voyager 60 Disk Voyager 60 Disk Voyager 60 Disk Voyager 60 Disk Not Used 3.8PB/8n480d U4000 17PB/48n3840d

Problem Overview ¨ ¨ ¨ ¨ set up a collection of 50-node to 500-node Linux clusters at 100s of sites worldwide deployed, managed, monitored, serviced by a diverse group of Ops + Service folks (our “customers”) when something goes (really) wrong, they call your (cell) phone approach: keep it simple; make it easy; be proactive; turn off your (cell) phone

It’s 4am, the clock is ticking, you have 52* minutes to solve a problem, can you debug it? *52 minutes is the allowed yearly downtime at “4x 9s” availability Support calls you at 4am, how many minutes will it take for you to explain what the system is supposed to do, before they can begin to debug and fix it. If it takes 20 minutes to explain the design, you’re down to 30 minutes left to fix whatever is wrong. And then nothing else can go wrong until next year. Marvin Theimer, Amazon (2009 LADDIS workshop talk)

logan-pink:/var/tmp # cs_hal list disks Disks(s): SCSI Device —————/dev/sg6 /dev/sg7 /dev/sg8 /dev/sg12 /dev/sg13 /dev/sg19 /dev/sg20 /dev/sg123 /dev/sg33 /dev/sg34 /dev/sg41 /dev/sg42 /dev/sg43 /dev/sg44 /dev/sg51 /dev/sg89 /dev/sg64 /dev/sg65 /dev/sg68 /dev/sg69 /dev/sg71 /dev/sg72 /dev/sg73 Block Device —————-/dev/sdf /dev/sdg /dev/sdh /dev/sdl /dev/sdm /dev/sds /dev/sdt /dev/sddn /dev/sdaf /dev/sdag /dev/sdan /dev/sdao /dev/sdap /dev/sdaq /dev/sdax /dev/sdch /dev/sdbi /dev/sdbj /dev/sdbm /dev/sdbn /dev/sdbp /dev/sdbq /dev/sdbr RAID array: 1 external: 116 total disks: 117 Enclosure —————/dev/sg1 /dev/sg1 /dev/sg1 /dev/sg1 /dev/sg1 /dev/sg1 /dev/sg1 /dev/sg1 /dev/sg1 /dev/sg1 /dev/sg1 /dev/sg1 /dev/sg1 /dev/sg1 /dev/sg1 /dev/sg62 /dev/sg62 /dev/sg62 /dev/sg62 /dev/sg62 /dev/sg62 /dev/sg62 /dev/sg62 Slot —-C01 A03 A00 D01 D00 E02 B03 E06 A06 A07 B06 A11 B10 B11 E11 E04 A01 A02 A03 A00 A05 A04 D01 Serial Number —————————WMC130020055 Z1F21NXT 1EV0XEJB PAGHLBYV PAGHL9TV YHGPHUVC S1UYJ1LZ109607 AR31021EG513RC Z1F1CHC8 Z1F1CJ5X 9XW0BALH 9XW094H2 PK133WPAG03L9J WD-WMC190009962 Z4D01HKQ 9WM2DQ0K YHH6EJ8A YHGVVE8A YHH5N6MA YHH5GWMA YHH6EWJA YHH682RA Z1F1CGH7 SMART ——-GOOD FAILED FAILED GOOD GOOD FAILED GOOD GOOD GOOD GOOD FAILED GOOD FAILED GOOD GOOD FAILED FAILED SUSPECT FAILED FAILED FAILED SUSPECT SUSPECT Note that /dev/sd* is essentially useless Where are all my disks? Original smartd config: /dev/sd[a-z] shows only 26 drives. What is this, Windows? A: /dev/sd[a-z]+

Disks(s): SCSI Device Status —————n/a /dev/sg0 /dev/sg1 /dev/sg3 /dev/sg4 /dev/sg5 /dev/sg6 /dev/sg7 /dev/sg8 /dev/sg9 /dev/sg10 … … /dev/sg63 Disks(s): SCSI Device Status —————n/a /dev/sg0 /dev/sg1 /dev/sg4 /dev/sg5 /dev/sg6 /dev/sg7 /dev/sg8 /dev/sg9 /dev/sg10 /dev/sg11 … … /dev/sg66 ONE NODE Block Device Enclosure Slot Serial Number SMART —————-/dev/md126 /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj ————-RAID vol intl/sys intl/sys /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 —-n/a 0 1 C00 A01 A02 B00 C01 A03 A00 B01 —————————not supported PWHHBZ7F PWHGVT6F YVHSKHWA YVHRUYEA YVHSSHXA YVHRL21A YVHSB98A YVHSJRRA YVHSMK7A YVHLVEND ——n/a GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD /dev/sdbj /dev/sg2 E07 YVHSB4BA GOOD ANOTHER NODE Block Device Enclosure Slot Serial Number SMART —————-/dev/md126 /dev/sda /dev/sdb /dev/sdu /dev/sdx /dev/sdbk /dev/sdbl /dev/sde /dev/sdbm /dev/sdbn /dev/sdbo ————-RAID vol intl/sys intl/sys /dev/sg3 /dev/sg3 /dev/sg3 /dev/sg3 /dev/sg3 /dev/sg3 /dev/sg3 /dev/sg3 —-n/a 0 1 C00 A01 A02 B00 C01 A03 A00 B01 —————————not supported PWJMRV8D PWJLVH2F YVK2EWWA YVJWLP3D YVK078ED YVK2V6SA YVJWB5KD YVK2V9BA YVK1S2RA YVK2V68A ——n/a GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD /dev/sddl /dev/sg3 E07 YVK3487A GOOD “Need to replace a failed disk on the 3rd node in the cluster, the bad disk is sdf”.

Density 2012 Disks (raw) @ 3TB Disks (protected) Racks @ 480 disks 5 PB 1,700 disks 2,700 disks 6 racks 20 PB 6,700 disks 11,000 disks 23 racks 50 PB 17,000 disks 27,000 disks 56 racks

Updated from “Long-Term Storage”, presented at Library of Congress Workshop in September 2012 Density 2012 Disks (raw) @ 3TB Disks (protected) Racks @ 480 disks 5 PB 1,700 disks 2,700 disks 6 racks 20 PB 6,700 disks 11,000 disks 23 racks 50 PB 17,000 disks 27,000 disks 56 racks 2014 5 PB Disks (raw) @ 6TB 830 disks Disks (protected) Racks @ 480 disks 1,300 disks 3 racks 20 PB 3,300 disks 5,300 disks 12 racks 50 PB 8,300 disks 13,000 disks 28 racks 2016 5 PB Disks (raw) @ 10TB 500 disks Disks (protected) Racks @ 780 disks 700 disks 1 rack 20 PB 2,000 disks 2,800 disks 4 racks 50 PB 5,000 disks 7,000 disks 9 racks

from flickr/purplemattfish, Broken hard drive?

dino-black:~ % cs_hal list disks Disks(s): SCSI Device Block Device Enclosure —————- —————— —————n/a /dev/sda RAID vol /dev/sg0 n/a RAID array /dev/sg1 n/a RAID array /dev/sg3 /dev/sdb /dev/sg18 /dev/sg4 /dev/sdc /dev/sg18 /dev/sg5 /dev/sdd /dev/sg18 /dev/sg6 /dev/sde /dev/sg18 Slot —-n/a 0 1 0 1 2 3 /dev/sg7 /dev/sg8 /dev/sg9 /dev/sg10 /dev/sg11 /dev/sg12 /dev/sg13 /dev/sg14 /dev/sg15 /dev/sg16 /dev/sg17 4 5 6 7 8 9 10 11 12 13 14 /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp RAID array: 2 external: 15 total disks: 17 disks /dev/sg18 /dev/sg18 /dev/sg18 /dev/sg18 /dev/sg18 /dev/sg18 /dev/sg18 /dev/sg18 /dev/sg18 /dev/sg18 /dev/sg18 Serial Number SMART ————————— ——not supported n/a 9QE801ME GOOD 9QE834TG GOOD 9WM0R49P GOOD 9WM0R48T GOOD 9WM0R3Z4 GOOD 9WM0R4VK SUSPECT: Reallocated(5)=19 9WM0RF21 GOOD 9WM0R44B GOOD 9WM0R3E0 GOOD 9WM0RF2X GOOD 9WM0R4TX GOOD 9WM0REHK GOOD 9WM0R3EW GOOD 9WM0R4GY GOOD 9WM0R4NZ GOOD 9WM0RF42 GOOD 9WM0R3AS GOOD Basic cross-layer mapping of disks, and RAIDs, and enclosures.

dino-black:~ % cs_hal list fs Volume(s): SCSI Device Block Device FS UUID —————- —————— ——————————————————/dev/sg2 /dev/sda 0ddb9635-ff27-4cd3-8c2f-58a6f5226d30 /dev/sg2 /dev/sda 2192b3ef-2a44-4450-9b04-327c00215454 /dev/sg2 /dev/sda ffa9607a-4b6f-4218-9266-c083fb1989a1 /dev/sg2 /dev/sda 746b09d4-f07a-49dc-8b40-86220dfc7edc /dev/sg2 /dev/sda f7c37c92-4bc5-4abf-95a5-efa51c46f6bc /dev/sg3 /dev/sdb 90a52650-e0f3-49e4-810b-a505cdcadb51 /dev/sg4 /dev/sdc 173aef8b-80e9-4be2-a510-3b88d3343f8a /dev/sg5 /dev/sdd bcfb1897-152b-482b-bde6-de9665ad7c51 /dev/sg6 /dev/sde bc6946ae-770f-4621-9ea5-f2d1e5ec0f28 /dev/sg7 /dev/sdf 52446742-a566-4036-8b0c-5cd7901474f0 /dev/sg8 /dev/sdg c9ee0971-d8dc-4621-8958-d79890d0f590 /dev/sg9 /dev/sdh 294bcd25-ab19-40ee-8c03-cd71e94e9e06 /dev/sg10 /dev/sdi cb5cac6c-1cdf-49ec-8754-a475db3d4afd /dev/sg11 /dev/sdj 91739495-2a46-47d2-8676-d8b4b3f8fd76 /dev/sg12 /dev/sdk 9f2a0ae1-d97b-4fb1-873e-6a9bfb2c3254 /dev/sg13 /dev/sdl 404a8c5a-19c0-4949-bd33-edd83ca4ee8f /dev/sg14 /dev/sdm da36046f-41f7-46d4-bcaa-af183002b792 /dev/sg15 /dev/sdn a71b6937-8ae5-4a37-96d0-78feeb0e62c4 /dev/sg16 /dev/sdo 34d6f5c5-1f5d-4cea-af5a-af157324aee8 /dev/sg17 /dev/sdp 9cc59415-cab5-4456-881f-a0c533e1823d Type ————ext3 xfs xfs xfs swap v1 xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs Slot Label SMART Mount Point ——- ——- ————————BOOT GOOD /boot GOOD /root2 GOOD /var GOOD / GOOD 0 GOOD /data-disks/ss-90a52650-e0f3-49e4-810b-a505cdcadb51 1 GOOD /data-disks/ss-173aef8b-80e9-4be2-a510-3b88d3343f8a 2 GOOD /data-disks/ss-bcfb1897-152b-482b-bde6-de9665ad7c51 3 SUSPECT /data-disks/ss-bc6946ae-770f-4621-9ea5-f2d1e5ec0f28 4 GOOD /data-disks/ss-52446742-a566-4036-8b0c-5cd7901474f0 5 GOOD /data-disks/ss-c9ee0971-d8dc-4621-8958-d79890d0f590 6 GOOD /meta/294bcd25-ab19-40ee-8c03-cd71e94e9e06 7 GOOD /data-disks/ss-cb5cac6c-1cdf-49ec-8754-a475db3d4afd 8 GOOD /data-disks/ss-91739495-2a46-47d2-8676-d8b4b3f8fd76 9 GOOD /data-disks/ss-9f2a0ae1-d97b-4fb1-873e-6a9bfb2c3254 10 GOOD /meta/404a8c5a-19c0-4949-bd33-edd83ca4ee8f 11 GOOD /data-disks/ss-da36046f-41f7-46d4-bcaa-af183002b792 12 GOOD /data-disks/ss-a71b6937-8ae5-4a37-96d0-78feeb0e62c4 13 GOOD /meta/34d6f5c5-1f5d-4cea-af5a-af157324aee8 14 GOOD /data-disks/ss-9cc59415-cab5-4456-881f-a0c533e1823d dino-black:~ % cs_hal list fs Volume(s): SCSI Device Block Device FS UUID —————- —————— —————————————————-/dev/sg2 /dev/sda 0ddb9635-ff27-4cd3-8c2f-58a6f5226d30 /dev/sg2 /dev/sda 2192b3ef-2a44-4450-9b04-327c00215454 /dev/sg2 /dev/sda ffa9607a-4b6f-4218-9266-c083fb1989a1 /dev/sg2 /dev/sda 746b09d4-f07a-49dc-8b40-86220dfc7edc /dev/sg2 /dev/sda f7c37c92-4bc5-4abf-95a5-efa51c46f6bc volumes (file systems) dino-black:~ % cs_hal list disks Disks(s): SCSI Device Block Device Enclosure —————- —————— —————n/a /dev/sda RAID vol /dev/sg0 n/a RAID array /dev/sg1 n/a RAID array /dev/sg3 /dev/sdb /dev/sg18 Type Slot Label SMART ———- —— ——- ——-ext3 BOOT GOOD xfs GOOD xfs GOOD xfs GOOD swap v1 GOOD Slot —-n/a 0 1 0 Serial Number ————————-not supported 9QE801ME 9QE834TG 9WM0R49P Mount ——-/boot /root2 /var / - SMART ——n/a GOOD GOOD GOOD

silver-is1-004:~ % cs_hal list disks Disks(s): SCSI Device Block Device Enclosure —————- —————— —————n/a /dev/md126 RAID vol /dev/sg1 n/a RAID array /dev/sg0 n/a RAID array /dev/sg27 /dev/sdy /dev/sg2 /dev/sg28 /dev/sdz /dev/sg2 /dev/sg29 /dev/sdaa /dev/sg2 /dev/sg30 /dev/sdab /dev/sg2 /dev/sg31 /dev/sdac /dev/sg2 … … … /dev/sg47 /dev/sdas /dev/sg2 /dev/sg48 /dev/sdat /dev/sg2 /dev/sg49 /dev/sdau /dev/sg2 /dev/sg50 /dev/sdav /dev/sg2 /dev/sg51 /dev/sdaw /dev/sg2 /dev/sg52 /dev/sdax /dev/sg2 /dev/sg53 /dev/sday /dev/sg2 /dev/sg54 /dev/sdaz /dev/sg2 /dev/sg55 /dev/sdba /dev/sg2 /dev/sg56 /dev/sdbb /dev/sg2 RAID array: 2 external: 60 total disks: 62 silver-is1-004:~ SCSI enclosure bsg id S/N expander count zoned zoning supported zone saving disk slot count disk count LED vendor model firmware SCSI id SAS address state HBA Slot —-n/a 1 0 B04 C04 D04 E05 E04 Serial Number —————————not supported KLH6DNZJ KLH6DL7J Z1Z0EVBF Z1Z0EKFZ Z1Z0ETMY Z1Z0EVLG Z1Z0EVH9 SMART Status —————-n/a GOOD GOOD GOOD GOOD GOOD GOOD GOOD C11 D11 C10 D10 C09 D09 E11 E10 E09 C08 Z1Z0ETTT Z1Z0EVAM Z1Z0ETFN Z1Z0EVC4 Z1Z0EVCR Z1Z0ETEP Z1Z0EKG3 Z1Z0ETLV Z1Z0EV1A Z1Z0EV90 GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD % : : : : : : : : : : : : : : : : : : cs_hal info sg2 /dev/sg2 /dev/bsg/expander-1:0 50060480e01b09be 50060480e01b09be 2 no yes yes 60 60 OFF EMC ESES Enclosure 0001 1:0:0:0 50060480e01b09be awake and running 0000:02:00.0 HAL – details abstracted object model embodied as a library silver-is1-004:~ % cs_hal info sg27 SCSI disk : /dev/sg27 block device : /dev/sdy size (via SCSI) : 3726.02 GB size (via blk) : 3726.02 GB vendor : ATA model : ST4000NM0033-9ZM firmware : GT00 SCSI id : 1:0:25:0 S/N : Z1Z0EVBF SAS address : 50060480e832bc16 state : awake and running RAID : no internal : no system disk : no type : rotational volume count : 1 volume : /dev/sdy1 volume size : 3726.02 GB filesystem : 285b59d3-xxxx (xfs; mounted) slot name : B04 parent enc : sg2 parent exp : sg3 parent HBA : 0000:02:00.0 LED : OFF SMART : GOOD

“Try a reboot, it might fix those disks [without sending out spares]” “There are three bad disks in the cluster, they are all out of file descriptors.”

Serial Number: Serial Number: K85ETA1266AF K85ET9B260PB FEBRUARY Serial Serial Serial Serial Serial Serial Serial Serial Serial Serial Serial Serial Serial Serial Serial Number: Number: Number: Number: Number: Number: Number: Number: Number: Number: Number: Number: Number: Number: Number: Serial Number: Serial Number: K85ETA1266AF K85ET9B260PB OCTOBER WCAVY2775469 WCAVY2775477 WCAVY2766210 WCAVY2570279 WCAVY2568648 WCAVY2568648 WCAVY2768780 WCAVY2768776 WCAVY2631983 WCAVY2764736 WCAVY2766146 WCAVY2768581 WCAVY2775457 WCAVY2775980 WCAVY2567467 Serial Serial Serial Serial Serial Serial Serial Serial Serial Serial Serial Serial Serial Serial Serial Number: Number: Number: Number: Number: Number: Number: Number: Number: Number: Number: Number: Number: Number: Number: WCAVY2775469 WCAVY2775477 WCAVY2766210 WCAVY2570279 WCAVY2568648 WCAVY2768844 WCAVY2768780 WCAVY2768776 WCAVY2631983 WCAVY2764736 WCAVY2766146 WCAVY2768581 WCAVY2775457 WCAVY2775980 WCAVY2567467 smartctl version 5.37 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6 Bruce Allen === START OF INFORMATION SECTION === Device Model: WD2002FYPS-12 Serial Number: WCAVY2568648 Firmware Version: 02.S0500 User Capacity: 2,000,398,934,016 bytes ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always 73 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always 0 9 Power_On_Hours 0x0032 086 086 000 Old_age Always 10762 smartctl version 5.37 [x86_64-unknown-linux-gnu] Copyright (C) 2002-6 Bruce Allen === START OF INFORMATION SECTION === Device Model: WD2002FYPS-12 Serial Number: WCAVY2568648 Firmware Version: 02.S0500 User Capacity: 2,000,398,934,016 bytes ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always 73 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always 0 9 Power_On_Hours 0x0032 086 086 000 Old_age Always 10762 “The disk with the duplicate serial number needs to be replaced.”

logan-pink:/var/tmp # cs_hal list disks Disks(s): SCSI Device Block Device Enclosure Slot Serial Number SMART —————- —————— —————- —— —————————- —————-/dev/sg6 /dev/sdf /dev/sg1 C01 WMC130020055 GOOD /dev/sg7 /dev/sdg /dev/sg1 A03 Z1F21NXT FAILED: Reallocated_Sector_Count(5)=1288 /dev/sg8 /dev/sdh /dev/sg1 A00 1EV0XEJB FAILED: self-test fail; read element; /dev/sg12 /dev/sdl /dev/sg1 D01 PAGHLBYV GOOD /dev/sg13 /dev/sdm /dev/sg1 D00 PAGHL9TV GOOD /dev/sg19 /dev/sds /dev/sg1 E02 YHGPHUVC FAILED: Reallocated_Sector_Count(5)=1814 /dev/sg20 /dev/sdt /dev/sg1 B03 S1UYJ1LZ109607 GOOD: self-test in progress; 40% done; /dev/sg123 /dev/sddn /dev/sg1 E06 AR31021EG513RC GOOD /dev/sg33 /dev/sdaf /dev/sg1 A06 Z1F1CHC8 GOOD /dev/sg34 /dev/sdag /dev/sg1 A07 Z1F1CJ5X GOOD /dev/sg41 /dev/sdan /dev/sg1 B06 9XW0BALH FAILED: Reallocated_Sector_Count(5)=168 /dev/sg42 /dev/sdao /dev/sg1 A11 9XW094H2 GOOD /dev/sg43 /dev/sdap /dev/sg1 B10 PK133WPAG03L9J FAILED: Offline_Uncorrectable(198)=202 /dev/sg44 /dev/sdaq /dev/sg1 B11 WD-WMC190009962 GOOD /dev/sg51 /dev/sdax /dev/sg1 E11 Z4D01HKQ GOOD /dev/sg89 /dev/sdch /dev/sg62 E04 9WM2DQ0K FAILED: Offline_Uncorrectable(198)=435 /dev/sg64 /dev/sdbi /dev/sg62 A01 YHH6EJ8A FAILED: Reallocated_Sector_Count(5)=1552 /dev/sg65 /dev/sdbj /dev/sg62 A02 YHGVVE8A SUSPECT: Reallocated_Sector_Count(5)=17 /dev/sg68 /dev/sdbm /dev/sg62 A03 YHH5N6MA FAILED: Reallocated_Sector_Count(5)=2005 /dev/sg69 /dev/sdbn /dev/sg62 A00 YHH5GWMA FAILED: Reallocated_Sector_Count(5)=797 /dev/sg71 /dev/sdbp /dev/sg62 A05 YHH6EWJA FAILED: Reallocated_Sector_Count(5)=906 /dev/sg72 /dev/sdbq /dev/sg62 A04 YHH682RA SUSPECT: Reallocated_Sector_Count(5)=15 /dev/sg73 /dev/sdbr /dev/sg62 D01 Z1F1CGH7 SUSPECT: Reallocated_Sector_Count(5)=16 SMART is not that smart, but you can work with it.

Example – Proactive Smarts erik-riedels-macbook-pro:logs er1p$ /dev/sg4 /dev/sdc /dev/sg3 /dev/sg49 /dev/sdav /dev/sg2 /dev/sg45 /dev/sdaq /dev/sg3 /dev/sg6 /dev/sde /dev/sg3 /dev/sg21 /dev/sdt /dev/sg3 /dev/sg32 /dev/sdae /dev/sg3 /dev/sg35 /dev/sdag /dev/sg3 /dev/sg15 /dev/sdn /dev/sg3 /dev/sg58 /dev/sdbd /dev/sg3 cat 2014-/halreport | grep SUSP C00 YVJZ8XRK SUSPECT: Reallocated(5)=99 D10 YVK6378A SUSPECT: Reallocated(5)=35 B10 YVJZW8EA SUSPECT: Reallocated(5)=19 A02 YVK4UJ5A SUSPECT: Reallocated(5)=10 E02 YVJG6X4D SUSPECT: Reallocated(5)=66 C05 YVK25MKA SUSPECT: Reallocated(5)=78 A06 YVJYBDSA SUSPECT: Reallocated(5)=43 D00 YVJB5TAA SUSPECT: Reallocated(5)=42 C07 YVJYRKYA SUSPECT: Reallocated(5)=59 erik-riedels-macbook-pro:logs er1p$ /dev/sg12 /dev/sdl /dev/sg2 /dev/sg60 /dev/sdbk /dev/sg3 /dev/sg37 /dev/sdai /dev/sg2 /dev/sg41 /dev/sdam /dev/sg3 cat 2014-/halreport | grep FAIL A04 YVJZMN3K FAILED: Reallocated(5)=110 E08 YVK2GNRA FAILED: Reallocated(5)=1577 B09 YVJYR8KA FAILED: Reallocated(5)=101 B08 YVJEZT7A FAILED: Reallocated(5)=682 erik-riedels-macbook-pro:logs er1p$ cat 2014-*/halreport | grep GOOD | wc -l 12227 GOOD 12,227 SUSPECT 9 BAD 4

smartctl 5.40 2010-10-16 r3189 [x86_64-unknown-linux-gnu] (local build) === START OF INFORMATION SECTION === Model Family: Hitachi Ultrastar 7K1000 Device Model: HUA721010KLA330 Serial Number: PBHBL6AF User Capacity: 1,000,204,886,016 bytes === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! Drive failure expected in less than 24 hours. SAVE ALL DATA. Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0033 Pre-fail Always FAILING_NOW 9 9 Power_On_Hours 0x0012 Old_age Always 13073 197 Current_Pending_Sector 0x0022 Old_age Always 1890 198 Offline_Uncorrectable 0x0008 Old_age Offline 9390 FAILED: FAILED: FAILED: FAILED: FAILED: FAILED: FAILED: FAILED: FAILED: FAILED: FAILED: FAILED: FAILED: FAILED: FAILED: FAILED: FAILED: FAILED: FAILED: FAILED: FAILED: Offline_Uncorrectable(198)=435 Offline_Uncorrectable(198)=796 Offline_Uncorrectable(198)=1385 Offline_Uncorrectable(198)=2336 Offline_Uncorrectable(198)=80961 Reallocated_Sector_Count(5)=52472 Reallocated_Sector_Count(5)=797 Reallocated_Sector_Count(5)=906 Reallocated_Sector_Count(5)=1552 Reallocated_Sector_Count(5)=1814 Reallocated_Sector_Count(5)=1818 Reallocated_Sector_Count(5)=1886 Reallocated_Sector_Count(5)=1944 Reallocated_Sector_Count(5)=1999 Reallocated_Sector_Count(5)=2005 Reallocated_Sector_Count(5)=3270 Reallocated_Sector_Count(5)=3809 Reallocated_Sector_Count(5)=4094 Reallocated_Sector_Count(5)=4095 Reallocated_Sector_Count(5)=4281 Reallocated_Sector_Count(5)=5119 keep on running on Even from this “very bad” disk with over 9,000 sector errors; we were able to recover >99% of the data with ddrescue; 9.5 MB out of 1 TB of data was permanently lost, with some difficulty reconstructing directories.

“We moved the cable from eth2 to eth3.” provo-vegas:~ # ip add ls 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: slave-0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master public state UP link/ether 00:1e:67:9f:1b:3e 3: private: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000 link/ether 00:1e:67:69:6d:06 inet 192.168.219.1/24 brd 192.168.219.255 scope global private inet 192.168.219.254/24 brd 192.168.219.255 scope global secondary private 4: unused-0: <BROADCmtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether 00:1e:67:69:6d:07 5: slave-1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master public state UP link/ether 00:1e:67:9f:1b:3e 6: public: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 00:1e:67:9f:1b:3e inet 10.249.250.131/21 brd 10.249.255.255 scope global public 7: private.4@private: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group link/ether 00:1e:67:69:6d:06 inet 169.254.19.1/16 brd 169.254.255.255 scope global private.4 8: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default link/ether 56:84:7a:fe:97:99 inet 172.17.42.1/16 scope global docker0

hare turtle (mgmt) rabbit turtle (diag) “We moved the cable from eth2 to eth3.” provo-vegas:~ # ip add ls 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: slave-0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master public state UP link/ether 00:1e:67:9f:1b:3e 3: private: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP group default qlen 1000 link/ether 00:1e:67:69:6d:06 inet 192.168.219.1/24 brd 192.168.219.255 scope global private inet 192.168.219.254/24 brd 192.168.219.255 scope global secondary private 4: unused-0: <BROADCmtu 1500 qdisc mq state DOWN group default qlen 1000 link/ether 00:1e:67:69:6d:07 5: slave-1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master public state UP link/ether 00:1e:67:9f:1b:3e 6: public: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 00:1e:67:9f:1b:3e inet 10.249.250.131/21 brd 10.249.255.255 scope global public 7: private.4@private: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group link/ether 00:1e:67:69:6d:06 inet 169.254.19.1/16 brd 169.254.255.255 scope global private.4 8: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default link/ether 56:84:7a:fe:97:99 inet 172.17.42.1/16 scope global docker0

provo-vegas:~ # ifconfig docker0 Link encap:Ethernet HWaddr 56:84:7A:FE:97:99 inet addr:172.17.42.1 Bcast:0.0.0.0 UP BROADCAST MULTICAST MTU:1500 Metric:1 “We moved the cable from eth2 to eth3.” lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 private Link encap:Ethernet HWaddr 00:1E:67:69:6D:06 inet addr:192.168.219.1 Bcast:192.168.219.255 Mask:255.255.255.0 inet6 addr: fe80::21e:67ff:fe69:6d06/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1 private.4 Link encap:Ethernet HWaddr 00:1E:67:69:6D:06 inet addr:169.254.19.1 Bcast:169.254.255.255 Mask:255.255.0.0 inet6 addr: fe80::21e:67ff:fe69:6d06/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 public Link encap:Ethernet HWaddr 00:1E:67:9F:1B:3E inet addr:10.249.250.131 Bcast:10.249.255.255 Mask:255.255.248.0 inet6 addr: fe80::21e:67ff:fe9f:1b3e/64 Scope:Link UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 slave-0 Link encap:Ethernet HWaddr 00:1E:67:9F:1B:3E UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 Link encap:Ethernet HWaddr 00:1E:67:9F:1B:3E UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 slave-1

provo-vegas:~ # lsnet 1 lo 00:00:00:00:00:00 link unknown loopback virtual 2 slave-0 00:1e:67:9f:1b:3e link up ether ixgbe 3 private 00:1e:67:69:6d:06 link up ether igb 4 unused-0 00:1e:67:69:6d:07 nolink down ether igb 5 slave-1 00:1e:67:9f:1b:3e link up ether ixgbe 6 public 00:1e:67:9f:1b:3e link up ether virtual 7 private.4 00:1e:67:69:6d:06 link up ether virtual 8 docker0 56:84:7a:fe:97:99 nolink down ether virtual provo-vegas:~ # ethtool -P slave-0 Permanent address: 00:1e:67:9f:1b:3e provo-vegas:~ # ethtool -P slave-1 Permanent address: 00:1e:67:9f:1b:3f provo-vegas:~ # ipmitool lan print IP Address Source : Static IP Address : 0.0.0.0 MAC Address : 00:1e:67:69:6d:08 provo-vegas:~ # ipmitool lan print 2 IP Address Source : Static IP Address : 0.0.0.0 MAC Address : 00:1e:67:69:6d:09 provo-vegas:~ # ipmitool lan print 3 IP Address Source : DHCP Address IP Address : 10.249.250.121 MAC Address : 00:1e:67:69:6d:0a SNMP Community String : public “We moved the cable from eth2 to eth3.”

Topology Map – getrackinfo provo-apricot:/home/emc # getrackinfo -a Node private Node Public Ip Address Id Status Mac =============== ====== ====== ================= 192.168.219.1 N/A P N/A 192.168.219.2 2 MA 00:1e:67:b5:af:84 192.168.219.3 3 SA 00:1e:67:b5:9f:78 192.168.219.4 N/A P N/A 192.168.219.5 5 SA 00:1e:67:b5:9e:80 192.168.219.6 6 SA 00:1e:67:b5:ad:40 192.168.219.7 N/A P N/A 192.168.219.8 8 SA 00:1e:67:b5:ad:6c 192.168.219.9 9 SA 00:1e:67:b5:b1:e0 192.168.219.10 10 SA 00:1e:67:b5:b2:a8 192.168.219.11 11 SA 00:1e:67:b5:b9:48 192.168.219.12 12 SA 00:1e:67:b5:b2:d0 192.168.219.13 13 SA 00:1e:67:b5:b1:34 192.168.219.14 14 SA 00:1e:67:b5:b1:44 192.168.219.15 15 SA 00:1e:67:b5:b1:30 192.168.219.16 16 SA 00:1e:67:b5:a1:b8 192.168.219.17 17 SA 00:1e:67:b5:b5:a8 192.168.219.18 18 SA 00:1e:67:b5:a1:e8 192.168.219.19 N/A P N/A 192.168.219.20 N/A P N/A 192.168.219.21 N/A P N/A 192.168.219.22 22 SA 00:1e:67:b5:ac:90 192.168.219.23 23 SA 00:1e:67:b5:9f:e4 192.168.219.24 24 SA 00:1e:67:b5:a2:80 Status: M - Master, S - Slave E - Epoxy I - Initializing, U - Updating, A - Active P - On, O - Off Ip Address ================ N/A 10.249.249.72 10.249.249.73 N/A 10.249.249.75 10.249.249.76 N/A 10.249.249.78 10.249.249.91 10.249.249.92 10.249.249.93 10.249.249.94 10.249.249.95 10.249.249.96 10.249.249.97 10.249.249.98 10.249.249.111 10.249.249.112 N/A N/A N/A 10.249.249.116 10.249.249.117 10.249.249.118 RMM Mac ================= N/A 00:1e:67:a3:ad:a4 00:1e:67:a3:af:43 N/A 00:1e:67:6a:1c:e1 00:1e:67:6a:23:fd N/A 00:1e:67:6a:1a:a7 00:1e:67:6a:24:3e 00:1e:67:6a:1c:28 00:1e:67:6a:18:b3 00:1e:67:6a:1b:b5 00:1e:67:6a:04:a4 00:1e:67:6a:0e:6d 00:1e:67:a3:b5:3d 00:1e:67:a3:ae:7b 00:1e:67:a3:b1:05 00:1e:67:a3:ba:0b N/A N/A N/A 00:1e:67:69:ed:b1 00:1e:67:69:ef:cd 00:1e:67:6a:20:2e Ip Address ================ N/A 10.249.249.62 10.249.249.63 N/A 10.249.249.65 10.249.249.66 N/A 10.249.249.68 10.249.249.81 10.249.249.82 10.249.249.83 10.249.249.84 10.249.249.85 10.249.249.86 10.249.249.87 10.249.249.88 10.249.249.101 10.249.249.102 N/A N/A N/A 10.249.249.106 10.249.249.107 10.249.249.108 Node Name ========= N/A sandy-apricot orem-apricot N/A layton-apricot logan-apricot N/A murray-apricot boston-apricot chicago-apricot houston-apricot phoenix-apricot dallas-apricot detroit-apricot columbus-apricot austin-apricot memphis-apricot seattle-apricot N/A N/A N/A atlanta-apricot fresno-apricot mesa-apricot

Topology Map – getrackinfo – details provo-vanilla:~ # getrackinfo -a Node private Node Ip Address Id Status =============== ====== ====== 192.168.219.1 1 MA 192.168.219.2 2 SA 192.168.219.3 3 SA 192.168.219.4 N/A noLink 192.168.219.5 N/A noLink 192.168.219.6 N/A noLink 192.168.219.7 N/A noLink 192.168.219.8 N/A noLink 192.168.219.9 N/A O 192.168.219.10 N/A O 192.168.219.11 N/A O 192.168.219.12 N/A O 192.168.219.13 N/A O 192.168.219.14 N/A O 192.168.219.15 N/A O 192.168.219.16 N/A O 192.168.219.17 N/A noLink 192.168.219.18 N/A noLink 192.168.219.19 N/A noLink 192.168.219.20 N/A noLink 192.168.219.21 N/A noLink 192.168.219.22 N/A noLink 192.168.219.23 N/A noLink 192.168.219.24 N/A noLink Status: M - Master, S - Slave E – Epoxy I - Initializing, U - Updating, A P - On, O - Off Public RMM Mac Ip Address Mac Ip Address Node Name ================= ================= ================= ================= ========= 00:1e:67:9f:01:96 10.249.248.111 00:1e:67:69:29:8f 10.249.248.101 provo-vanilla 00:1e:67:9f:01:a2 10.249.248.112 00:1e:67:69:28:72 10.249.248.102 sandy-vanilla 00:1e:67:9e:ff:9e 10.249.248.113 00:1e:67:69:29:99 10.249.248.103 orem-vanilla N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A provo-vanilla:~ # getrackinfo -v N/A N/A N/A N/A N/A N/A N/A N/A N/A =====================================================N/A N/A N/A N/A N/A N/A NodeNameN/A : N/A provo-vanilla N/A N/A N/A Node Id N/A : N/A 1 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A Interfaces(MAC & IP) N/A N/A N/A N/A N/A —————————-N/A N/A N/A N/A N/A N/A N/A N/A public N/A : N/A 00:1e:67:9f:01:96 10.249.248.111/21 N/A N/A N/A N/A N/A private N/A : N/A 00:1e:67:69:29:8b 192.168.219.1/24 N/A N/A N/A private N/A ipmi : N/A 00:1e:67:69:29:8d 192.168.219.101/24 N/A N/A N/A N/A N/A N/A N/A private.4(NAN) : N/A 00:1e:67:69:29:8b 169.254.186.1/16 N/A N/A N/A N/A N/A remote ipmi : N/A 00:1e:67:69:29:8f 10.249.248.101/21 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A Network Services Active ———————-NTP Configuration: server: 10.254.140.21 10.254.140.22 DNS Configuration: domain: search: sea.lab.emc.com corp.emc.com emc.com server: 192.168.219.254 10.6.149.11

silver-is1-004:~ % cs_hal info node Node : silver-is1-004 BIOS date : 06/20/2012 BIOS version : SE5C600.86B.01.03.0002.062020121504 Board model : S2600JF Board S/N : QSJP23007313 Board vendor : Intel Corporation Board version : G28033-506 Chassis S/N : FC6ND131900019 Chassis vendor : ………………………… Chassis model : S2600JF System S/N : FC6AT131900005 Processor count : 8 Total memory : 23.0433GB Available memory : 17.7322GB Total swap : 2GB Available swap : 2GB Shared memory : 0GB Host adapter count : 2 Net interface count : 4 Enclosure count : 1 External disk count : 60 But wait, there’s MORE – BIOS, BIOS settings, BMC firmware, BMC settings, power supply firmware, fan firmware, HBA firmware, HBA NVDATA, enclosure firmware, enclosure power supply firmware, enclosure fan firmware, … Release Notes: “This BIOS update fixes a problem where BIOS update fails 20%-40% of the time.”

What We Did ¨ kept it simple, took control ¤ no hardware RAID; no databases; no events (poll) ¤ sg, sd, md, dm, lvm, fs (ext3, ext4, xfs, btrfs) ¨ built a library – HAL – hardware abstraction layer ¤ common ¨ library for our app-level services to use built some tools – cs-hal ¤ cs-hal ¤ cs-hal ¤ cs-hal ¤ cs-hal ¤ cs-hal (for Support to use) list disks list fs info sg27 led Z1Z0EVBF blink led sg27 blink

Biggest Take-Aways ¨ ¨ when you design a solution for a single machine … … think about the poor sap who has to ¤ diagnose 200 nodes/machines – real machines ¤ … 12,000 drives – real drives ¤ … 12,000 file systems (or more) – w/ real customer data ¤ … from 5,000 miles away – real miles ¤ … in the middle of the night – really dark ¤ … all week long ¤ … especially on Friday nights

Other Conclusions ¨ ¨ most Linux + tools developers don’t have 50+ disks on their systems where did /dev/sddh come from? device briefly offline => new dev!! ¤ hardware comes & goes, the software stays the same ¤ ¨ disks don’t have file descriptors ¤ ¨ ¨ ¨ sg, sd, md, dm, lvm, fs (ext3, ext4, xfs, btrfs) SATA disks are big & cheap and all, but can be a bit “unruly”… temporary disconnects hardware RAID is yucky databases are often stale ¤ trust, but verify => don’t trust, skip right to verify

Containers ¨ http://www.spreadshirt.com/ouat-quote-all-magic-comes-with-a-price All magic comes with a price

Build on 20 Years of Storage Research ¨ APIs vs. mount points – “no slashes required” ¤ blocks ¨ vs. files vs. objects vs. “APIs” App-driven and policy-automated ¤ self-configuring, ¨ self-organizing, self-tuning, self-* Built in data services ¤ self-healing ¨ RAID Unlimited namespace, dynamic ¤ billions ¨ GUI and billions of objects, large and small Native multi-tenancy ¤ security/auth, monitoring, resource isolation /

Example – DAE reconnects Jul 1 21:37:37 localhost kernel: mptbase ioc0 LogInfo(0x31130000) Code={IO Not Yet Executed}, SubCode(0x0000) Jul 1 23:50:06 localhost kernel: mptbase ioc1 LogInfo(0x31112000) Code={Reset}, SubCode(0x2000) Jul 1 23:50:09 localhost kernel: mptbase ioc1 LogInfo(0x31112000) Code={Reset}, SubCode(0x2000) Jul 1 23:50:12 20xx : WARNING : Disk Event : Disk is moved to DAE: Slot ID: 0 : Serial NO: WCAVY4897042 Jul 1 23:50:12 20xx : WARNING : Disk Event : Disk is moved to DAE: Slot ID: 0 : Serial NO: WCAVY5192630 Jul 1 23:50:13 20xx : WARNING : Disk Event : Disk is moved to DAE: Slot ID: 0 : Serial NO: WCAVY5186052 Jul 1 23:50:14 20xx : WARNING : Disk Event : Disk is moved to DAE: Slot ID: 0 : Serial NO: WCAVY3550485 Jul 1 23:50:14 20xx : WARNING : Disk Event : Disk is moved to DAE: Slot ID: 0 : Serial NO: WCAVY360702 (…all 60 disks…) Jul 1 23:50:15 20xx : ERROR : DAE Event : DAE (device path: /dev/sg66) lost. : Serial NO: , Device path: /dev/sg66, Device ID: 5000097a780747be Jul 1 23:50:15 20xx : WARNING : Disk Event : Disk is moved to DAE: Slot ID: 0 : Serial NO: WCAVY5349410 Jul 1 23:51:14 20xx : INFO : DAE Event : New DAE (device path: /dev/sg66) is added. : Serial NO: , Device path: /dev/sg66, Device ID: 5000097a780747be Jul 1 23:51:14 20xx : WARNING : Disk Event : Disk is moved to DAE: 5f4ad992-724e-48af-8cac-a68b7d859593 Slot ID: 11 : Serial NO: WCAVY5182031 , Device path: /dev/sdaq, Slot ID: Jul 1 23:51:14 20xx : WARNING : Disk Event : Disk is moved to DAE: 5f4ad992-724e-48af-8cac-a68b7d859593 Slot ID: 13 : Serial NO: WCAVY5186052 , Device path: /dev/sdas, Slot ID: (…all 60 disks…) Jul 1 23:51:16 20xx : WARNING : Disk Event : Disk is moved to DAE: e70905ad-5736-48d9-8a1b-a15a2d116825 Slot ID: 4 : Serial NO: WCAVY5349410 , Device path: /dev/sday, Slot ID: (outage ends, log ends) Reset on the SAS/SATA bus (expander), enclosure identifiers reassigned to “<NULL>”; enclosure returns after 68 seconds, disks are assigned back where they belong. Entire episode lasts 70 seconds. BUT system management database remembers this “event” for weeks.

HAL – disk view (15 drive node) dino-black:~ % cs_hal list disks Disks(s): SCSI Device Block Device Enclosure —————- —————— —————n/a /dev/sda RAID vol /dev/sg0 n/a RAID array /dev/sg1 n/a RAID array /dev/sg3 /dev/sdb /dev/sg18 /dev/sg4 /dev/sdc /dev/sg18 /dev/sg5 /dev/sdd /dev/sg18 /dev/sg6 /dev/sde /dev/sg18 /dev/sg7 /dev/sdf /dev/sg18 /dev/sg8 /dev/sdg /dev/sg18 /dev/sg9 /dev/sdh /dev/sg18 /dev/sg10 /dev/sdi /dev/sg18 /dev/sg11 /dev/sdj /dev/sg18 /dev/sg12 /dev/sdk /dev/sg18 /dev/sg13 /dev/sdl /dev/sg18 /dev/sg14 /dev/sdm /dev/sg18 /dev/sg15 /dev/sdn /dev/sg18 /dev/sg16 /dev/sdo /dev/sg18 /dev/sg17 /dev/sdp /dev/sg18 RAID array: 2 external: 15 total disks: 17 Slot —-n/a 0 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Serial Number —————————not supported 9QE801ME 9QE834TG 9WM0R49P 9WM0R48T 9WM0R3Z4 9WM0R4VK 9WM0RF21 9WM0R44B 9WM0R3E0 9WM0RF2X 9WM0R4TX 9WM0REHK 9WM0R3EW 9WM0R4GY 9WM0R4NZ 9WM0RF42 9WM0R3AS SMART Status —————-n/a GOOD GOOD GOOD GOOD GOOD SUSPECT: GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD Reallocated(5)=19

HAL – filesystem view (15 drive node) dino-black:~ % cs_hal list fs Volume(s): SCSI Device Block Device FS UUID —————- —————— ——————————————————/dev/sg2 /dev/sda 0ddb9635-ff27-4cd3-8c2f-58a6f5226d30 /dev/sg2 /dev/sda 2192b3ef-2a44-4450-9b04-327c00215454 /dev/sg2 /dev/sda ffa9607a-4b6f-4218-9266-c083fb1989a1 /dev/sg2 /dev/sda 746b09d4-f07a-49dc-8b40-86220dfc7edc /dev/sg2 /dev/sda f7c37c92-4bc5-4abf-95a5-efa51c46f6bc /dev/sg3 /dev/sdb 90a52650-e0f3-49e4-810b-a505cdcadb51 /dev/sg4 /dev/sdc 173aef8b-80e9-4be2-a510-3b88d3343f8a /dev/sg5 /dev/sdd bcfb1897-152b-482b-bde6-de9665ad7c51 /dev/sg6 /dev/sde bc6946ae-770f-4621-9ea5-f2d1e5ec0f28 /dev/sg7 /dev/sdf 52446742-a566-4036-8b0c-5cd7901474f0 /dev/sg8 /dev/sdg c9ee0971-d8dc-4621-8958-d79890d0f590 /dev/sg9 /dev/sdh 294bcd25-ab19-40ee-8c03-cd71e94e9e06 /dev/sg10 /dev/sdi cb5cac6c-1cdf-49ec-8754-a475db3d4afd /dev/sg11 /dev/sdj 91739495-2a46-47d2-8676-d8b4b3f8fd76 /dev/sg12 /dev/sdk 9f2a0ae1-d97b-4fb1-873e-6a9bfb2c3254 /dev/sg13 /dev/sdl 404a8c5a-19c0-4949-bd33-edd83ca4ee8f /dev/sg14 /dev/sdm da36046f-41f7-46d4-bcaa-af183002b792 /dev/sg15 /dev/sdn a71b6937-8ae5-4a37-96d0-78feeb0e62c4 /dev/sg16 /dev/sdo 34d6f5c5-1f5d-4cea-af5a-af157324aee8 /dev/sg17 /dev/sdp 9cc59415-cab5-4456-881f-a0c533e1823d total: 21 Type ————ext3 xfs xfs xfs swap v1 xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs Slot Label SMART Mount Point ——- ——- ————————BOOT GOOD /boot GOOD /root2 GOOD /var GOOD / GOOD 0 GOOD /data-disks/ss-90a52650-e0f3-49e4-810b-a505cdcadb51 1 GOOD /data-disks/ss-173aef8b-80e9-4be2-a510-3b88d3343f8a 2 GOOD /data-disks/ss-bcfb1897-152b-482b-bde6-de9665ad7c51 3 SUSPECT /data-disks/ss-bc6946ae-770f-4621-9ea5-f2d1e5ec0f28 4 GOOD /data-disks/ss-52446742-a566-4036-8b0c-5cd7901474f0 5 GOOD /data-disks/ss-c9ee0971-d8dc-4621-8958-d79890d0f590 6 GOOD /meta/294bcd25-ab19-40ee-8c03-cd71e94e9e06 7 GOOD /data-disks/ss-cb5cac6c-1cdf-49ec-8754-a475db3d4afd 8 GOOD /data-disks/ss-91739495-2a46-47d2-8676-d8b4b3f8fd76 9 GOOD /data-disks/ss-9f2a0ae1-d97b-4fb1-873e-6a9bfb2c3254 10 GOOD /meta/404a8c5a-19c0-4949-bd33-edd83ca4ee8f 11 GOOD /data-disks/ss-da36046f-41f7-46d4-bcaa-af183002b792 12 GOOD /data-disks/ss-a71b6937-8ae5-4a37-96d0-78feeb0e62c4 13 GOOD /meta/34d6f5c5-1f5d-4cea-af5a-af157324aee8 14 GOOD /data-disks/ss-9cc59415-cab5-4456-881f-a0c533e1823d

list disks Enclosure —————RAID vol RAID array RAID array /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 /dev/sg2 Slot —-n/a 1 0 C04 D04 E05 E04 B05 C05 D05 C00 A01 A02 B00 C01 A03 A00 B01 A05 A04 D01 D00 C02 D02 E00 B02 E01 Serial Number —————————not supported PQKJGZNB PQKHYT9B WMAW30330711 WMAW30130282 WMAW30331465 WMAW30400512 WMAW30330840 WMAW30283365 WMAW30331280 WMAW30330725 WMAW30330535 WMAW30330800 WMAW30331330 WMAW30128826 WMAW30199450 WMAW30103257 WMAW30331487 WMAW30327185 WMAW30327102 WMAW30330859 WMAW30331130 WMAW30331192 WMAW30307529 WMAW30196937 WMAW30331240 WCAW32612222 SMART Status —————-n/a GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD HAL – disk view (60 drive node) layton-copper:~ % cs_hal Disks(s): SCSI Device Block Device —————- —————-n/a /dev/md126 /dev/sg1 n/a /dev/sg0 n/a /dev/sg26 /dev/sdz /dev/sg27 /dev/sdaa /dev/sg28 /dev/sdab /dev/sg29 /dev/sdac /dev/sg30 /dev/sdad /dev/sg31 /dev/sdae /dev/sg32 /dev/sdaf /dev/sg3 /dev/sdc /dev/sg4 /dev/sdd /dev/sg5 /dev/sde /dev/sg6 /dev/sdf /dev/sg7 /dev/sdg /dev/sg8 /dev/sdh /dev/sg9 /dev/sdi /dev/sg10 /dev/sdj /dev/sg11 /dev/sdk /dev/sg12 /dev/sdl /dev/sg13 /dev/sdm /dev/sg14 /dev/sdn /dev/sg15 /dev/sdo /dev/sg16 /dev/sdp /dev/sg17 /dev/sdq /dev/sg18 /dev/sdr /dev/sg19 /dev/sds

list fs FS UUID ——————————————————6cf8c9cb-c0c9-498c-ab3f-28140dd66f09 6cf8c9cb-c0c9-498c-ab3f-28140dd66f09 c198e38d-41a1-4263-b46a-39bbdc8ed89c 3429b68b-f599-4679-991a-5b98549b2431 1fccea68-439f-4a8e-be55-a81fd17774bf e520b436-35ef-40d1-bd3b-d6d42957bc41 12c13240-2957-4b7b-b628-df870a6fbd3b 7e00293c-1069-45c0-bc4e-2f7c7cd52a7b 7dec91ad-4985-4ce5-898c-fe491d5818af 05705250-0a35-4618-95da-64d0632395fc 05b98c0c-c77e-4a90-bcec-e5874cf89988 42d87a05-4f8e-4375-8547-909f597fdaf5 eb8657cc-b681-4698-805c-86fbd82fbccc 1c15a217-418e-48e6-85a2-cb058c63a26f cd762f32-19c6-46f0-919d-bdde85261d98 f29d89c8-c0c7-4ec3-9645-de1d58b2a1cd bc18fc92-9676-48e4-817c-47b10df3ee7a d6f8f279-fc48-466c-9db0-ec41064e0b9e 8a38f4b7-bf8c-47fe-a99c-d31fe53b6d1e 55ceca7a-8df1-4eb5-a5b3-003a4fa68c36 40d95e6d-b410-4f3b-bbcb-15f163b63486 a865b961-4406-4bd8-91ab-4be9d446712e 04e94a2a-c01a-4e06-bbe9-41da0ef1a293 1d9051a7-fe09-4b98-bae1-4385bb1ee08c 9a9f43d7-920b-4197-b388-e9a85b953f4b 4b00c0fb-5bb7-4bfe-af6d-c4fba1721db6 ff2d72f8-49aa-4983-a666-b8702fee6916 e04bf3c3-cff8-4316-af77-d1e49a0b26cd d92bca38-296b-45c1-8291-256eebe2b764 852bf5d8-a06a-4df8-804e-635364abb7d9 c19c43d2-f084-4d65-8a63-ec40c90f6e54 4af383d9-71a6-4324-84d6-d2e854900a71 c8343213-f695-4e9b-92c0-106787ea0f40 afc73d9c-1a89-4a62-8536-4410899818ec 99fb488c-7689-4adc-aa13-7af8d5cd91ba 27b3025b-c3f2-4016-8094-c7eeb355f7d4 6660e770-c8fb-46fd-a628-6c485e20ebc0 80ddb764-8337-4ef1-9a0d-e6f66405537f e0614cdd-0662-4845-9c31-ebd93121117e c45cf761-4630-4076-99f5-fe5bbc1eb664 ad9157f0-6382-46fa-899c-5439d84ac64d 5b1d8019-afae-4cdc-9d6c-ccc66c764cc8 0a73ec0d-087d-413f-9cfd-adaf952467a8 abb4d427-f891-4af4-a79a-5795a5c2f1d1 ff4a6afd-12f2-42cd-8efb-e49d691c0b9d 69a19693-609e-4d5e-8482-6de57fa5946e 442e5f89-c528-46fe-8b5a-6a6b01ccf359 8d9052ab-0d4c-4fc5-92ea-e128318d0c21 04bed093-5748-44d6-a9a0-6e9efee05dac a3554dea-8043-43cc-804d-4460860a69f7 a5eab0f3-4780-46fb-a0e2-f363f0f842f3 af4815c2-3ae8-4787-bb90-abc9a8cac8a9 2ffee3bd-866d-432e-ae7d-d7e4b264fea7 87beea7d-0d01-4418-b120-0b83b6edac81 5614a615-fca3-4ab8-8e1f-7e7ddfa9fe0a f1148778-f1bd-45c0-9dd1-bafd6c5ffcad 31aa9f31-c6af-4370-be8c-4726b31341ac 555804d0-4a2a-488e-a92f-be55aa61da37 9d1fe14a-9b03-4918-ab80-febbc960cf9e Type ————ext3 ext3 xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs xfs Slot ——0 1 C04 D04 E05 E04 B05 C05 D05 C00 A01 A02 B00 C01 A03 A00 B01 A05 A04 D01 D00 C02 D02 E00 B02 E01 E02 B03 D03 C03 E03 B04 C07 E06 E08 D06 C06 D07 E07 A06 A07 B09 A08 A09 A10 B08 B07 B06 A11 B10 B11 C11 D11 C10 D10 C09 D09 E11 E10 Label ——————-BOOT BOOT HAL – filesystem view (60 drive node) layton-copper:~ % cs_hal Volume(s): SCSI Device Block Device —————- —————-/dev/sg0 /dev/sda /dev/sg1 /dev/sdb /dev/sg26 /dev/sdz /dev/sg27 /dev/sdaa /dev/sg28 /dev/sdab /dev/sg29 /dev/sdac /dev/sg30 /dev/sdad /dev/sg31 /dev/sdae /dev/sg32 /dev/sdaf /dev/sg3 /dev/sdc /dev/sg4 /dev/sdd /dev/sg5 /dev/sde /dev/sg6 /dev/sdf /dev/sg7 /dev/sdg /dev/sg8 /dev/sdh /dev/sg9 /dev/sdi /dev/sg10 /dev/sdj /dev/sg11 /dev/sdk /dev/sg12 /dev/sdl /dev/sg13 /dev/sdm /dev/sg14 /dev/sdn /dev/sg15 /dev/sdo /dev/sg16 /dev/sdp /dev/sg17 /dev/sdq /dev/sg18 /dev/sdr /dev/sg19 /dev/sds /dev/sg20 /dev/sdt /dev/sg21 /dev/sdu /dev/sg22 /dev/sdv /dev/sg23 /dev/sdw /dev/sg24 /dev/sdx /dev/sg25 /dev/sdy /dev/sg57 /dev/sdbd /dev/sg58 /dev/sdbe /dev/sg59 /dev/sdbf /dev/sg60 /dev/sdbg /dev/sg61 /dev/sdbh /dev/sg62 /dev/sdbi /dev/sg63 /dev/sdbj /dev/sg34 /dev/sdag /dev/sg35 /dev/sdah /dev/sg36 /dev/sdai /dev/sg37 /dev/sdaj /dev/sg38 /dev/sdak /dev/sg39 /dev/sdal /dev/sg40 /dev/sdam /dev/sg41 /dev/sdan /dev/sg42 /dev/sdao /dev/sg43 /dev/sdap /dev/sg44 /dev/sdaq /dev/sg45 /dev/sdar /dev/sg46 /dev/sdas /dev/sg47 /dev/sdat /dev/sg48 /dev/sdau /dev/sg49 /dev/sdav /dev/sg50 /dev/sdaw /dev/sg51 /dev/sdax /dev/sg52 /dev/sday /dev/sg53 /dev/sdaz SMART ———GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD GOOD Mount Point —————-/data-disks/ss-c198e38d-41a1-4263-b46a-39bbdc8ed89c /meta/3429b68b-f599-4679-991a-5b98549b2431 /meta/1fccea68-439f-4a8e-be55-a81fd17774bf /data-disks/ss-e520b436-35ef-40d1-bd3b-d6d42957bc4 /meta/12c13240-2957-4b7b-b628-df870a6fbd3b /meta/7e00293c-1069-45c0-bc4e-2f7c7cd52a7b /meta/7dec91ad-4985-4ce5-898c-fe491d5818af /data-disks/ss-05705250-0a35-4618-95da-64d0632395fc /data-disks/ss-05b98c0c-c77e-4a90-bcec-e5874cf89988 /data-disks/ss-42d87a05-4f8e-4375-8547-909f597fdaf5 /data-disks/ss-eb8657cc-b681-4698-805c-86fbd82fbccc /data-disks/ss-1c15a217-418e-48e6-85a2-cb058c63a26f /data-disks/ss-cd762f32-19c6-46f0-919d-bdde85261d98 /data-disks/ss-f29d89c8-c0c7-4ec3-9645-de1d58b2a1cd /data-disks/ss-bc18fc92-9676-48e4-817c-47b10df3ee7a /data-disks/ss-d6f8f279-fc48-466c-9db0-ec41064e0b9e /data-disks/ss-8a38f4b7-bf8c-47fe-a99c-d31fe53b6d1e /data-disks/ss-55ceca7a-8df1-4eb5-a5b3-003a4fa68c36 /data-disks/ss-40d95e6d-b410-4f3b-bbcb-15f163b63486 /data-disks/ss-a865b961-4406-4bd8-91ab-4be9d446712e /data-disks/ss-04e94a2a-c01a-4e06-bbe9-41da0ef1a293 /data-disks/ss-1d9051a7-fe09-4b98-bae1-4385bb1ee08c /data-disks/ss-9a9f43d7-920b-4197-b388-e9a85b953f4b /data-disks/ss-4b00c0fb-5bb7-4bfe-af6d-c4fba1721db6 /data-disks/ss-ff2d72f8-49aa-4983-a666-b8702fee6916 /data-disks/ss-e04bf3c3-cff8-4316-af77-d1e49a0b26cd /data-disks/ss-d92bca38-296b-45c1-8291-256eebe2b764 /data-disks/ss-852bf5d8-a06a-4df8-804e-635364abb7d9 /data-disks/ss-c19c43d2-f084-4d65-8a63-ec40c90f6e54 /data-disks/ss-4af383d9-71a6-4324-84d6-d2e854900a71 /data-disks/ss-c8343213-f695-4e9b-92c0-106787ea0f40 /data-disks/ss-afc73d9c-1a89-4a62-8536-4410899818ec /data-disks/ss-99fb488c-7689-4adc-aa13-7af8d5cd91ba /data-disks/ss-27b3025b-c3f2-4016-8094-c7eeb355f7d4 /data-disks/ss-6660e770-c8fb-46fd-a628-6c485e20ebc0 /data-disks/ss-80ddb764-8337-4ef1-9a0d-e6f66405537f /data-disks/ss-e0614cdd-0662-4845-9c31-ebd93121117e /data-disks/ss-c45cf761-4630-4076-99f5-fe5bbc1eb664 /data-disks/ss-ad9157f0-6382-46fa-899c-5439d84ac64d /meta/5b1d8019-afae-4cdc-9d6c-ccc66c764cc8 /meta/0a73ec0d-087d-413f-9cfd-adaf952467a8 /data-disks/ss-abb4d427-f891-4af4-a79a-5795a5c2f1d1 /meta/ff4a6afd-12f2-42cd-8efb-e49d691c0b9d /meta/69a19693-609e-4d5e-8482-6de57fa5946e /meta/442e5f89-c528-46fe-8b5a-6a6b01ccf359 /meta/8d9052ab-0d4c-4fc5-92ea-e128318d0c21 /data-disks/ss-04bed093-5748-44d6-a9a0-6e9efee05dac /data-disks/ss-a3554dea-8043-43cc-804d-4460860a69f7 /data-disks/ss-a5eab0f3-4780-46fb-a0e2-f363f0f842f3 /data-disks/ss-af4815c2-3ae8-4787-bb90-abc9a8cac8a9 /data-disks/ss-2ffee3bd-866d-432e-ae7d-d7e4b264fea7 /data-disks/ss-87beea7d-0d01-4418-b120-0b83b6edac81 /data-disks/ss-5614a615-fca3-4ab8-8e1f-7e7ddfa9fe0a /data-disks/ss-f1148778-f1bd-45c0-9dd1-bafd6c5ffcad /meta/31aa9f31-c6af-4370-be8c-4726b31341ac /data-disks/ss-555804d0-4a2a-488e-a92f-be55aa61da37 /data-disks/ss-9d1fe14a-9b03-4918-ab80-febbc960cf9e

HAL – sensors silver-is1-004:~ % cs_hal sensors all Entity Type ————Power Dist Power Unit Power Dist Power Unit System Chassis Chassis Intrusion System Board SEL Disabled System Board System Event System Board Button/Switch I/O Module Module/Board System Board Mgmt Subsys Health System Chassis Other Units-based System Board Temperature System Board Temperature System Board Temperature System Board Temperature System Board Temperature System Board Temperature Front Panel Temperature Drive Backplane Temperature Front Panel Temperature Cooling Device Fan Cooling Device Fan Cooling Device Fan Cooling Device Fan Cooling Device Fan Cooling Device Fan Power Supply PSU Power Supply PSU Power Supply Other Units-based Power Supply Other Units-based Power Supply Current Power Supply Current Power Supply Temperature Power Supply Temperature Processor Processor Processor Processor Label ——Pwr Unit Status Pwr Unit Redund Physical Scrty System Event Log System Event Button IO Mod Presence BMC Health System Airflow BB Inlet Temp SSB Temp BB BMC Temp P1 VR Temp IB QDR Temp Exit Air Temp IOM Temp HSBP PSOC LAN NIC Temp Sys Fan 1A Sys Fan 1B Sys Fan 2A Sys Fan 2B Sys Fan 3A Sys Fan 3B PS1 Status PS2 Status PS1 Input Power PS2 Input Power PS1 Curr Out % PS2 Curr Out % PS1 Temperature PS2 Temperature P1 Status P2 Status Status ——OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK OK Info ——OK; extra info unimplemented; actual: [c0 00 00] fully redundant; OK; extra info unimplemented; OK; extra info unimplemented; OK; extra info unimplemented; OK; extra info unimplemented; OK; extra info unimplemented; 12 CFM 33 Degrees Celsius 63 Degrees Celsius 53 Degrees Celsius 39 Degrees Celsius 48 Degrees Celsius 53 Degrees Celsius 40 Degrees Celsius 40 Degrees Celsius 67 Degrees Celsius 7387 RPM 7482 RPM 7387 RPM 7654 RPM 7387 RPM 7396 RPM actual: actual: actual: actual: actual: [c0 [c0 [c0 [c0 [c0 04 00 00 02 00 00] 00] 00] 00] 00] 224 Watts 196 Watts 17 Unspecified 14 Unspecified 35 Degrees Celsius 36 Degrees Celsius OK; extra info unimplemented; actual: [c0 80 00] OK; extra info unimplemented; actual: [c0 80 00]

References – Failures ¨ “Are Disks the Dominant Contributor for Storage Failures?” ¤ System-level failures http://www.usenix.org/events/fast08/tech/jiang.html Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou (UIUC), Arkady Kenevsky (NetApp) ¤ Additional related studies ¤ n n ¨ http://www.usenix.org/events/fast08/tech/bairavasundaram.html http://www.usenix.org/events/fast08/tech/krioukov.html Google & CMU field reliability studies ¤ ¤ http://www.usenix.org/events/fast07/tech/pinheiro.html http://www.usenix.org/event /fast07/tech/schroeder/schroeder.pdf

References – Designing for Failure @ Scale ¨ Advice (LADIS 2009 workshop) ¤ ¤ ¤ ¤ advice from Amazon - http://bit.ly/iDebZX experience sharing from Google - http://bit.ly/mcvppe from Microsoft - http://bit.ly/ixCh8i - and a number of others http://bit.ly/jJ2VgW The key take-away from Marvin’s Amazon talk was the call for simplicity: n n “It’s 4AM, the clock is ticking, you have 52 minutes to solve problem, can you debug it?” (52 minutes is the allowed yearly downtime at “4 9s” availability – Support calls you at 4am, how many minutes will it take for you to explain what the system is supposed to do, before they can begin to debug and fix it. If it takes 20 minutes to explain the design, you’re down to 30 minutes left to fix what’s wrong. And then nothing else can go wrong until next year.)

erik riedel
@er1p

1 / 57

This talk outlines some of the complexity challenges faced by devs (at their desks) and ops personnel (in the data centers, 6 months later) when trying to design for and then diagnose a widely distributed storage system subject to the slings & arrows of outrageous fortune. A modest sized system with 50 disks per node and 500 nodes has 25,000 disk drives; 30,000 file systems (when everything is working fine); 100 billion files; 1 million open file descriptors (when fine); 10 million hourly log messages (when fine, 1 billion when not). The layering in the Linux storage stack (sata, sas, ses, sg, sd, dm, lvm, fs, etc) is great when trying to find a creative solution to a single-node storage setup, but can be a real pain when trying to diagnose what is going wrong at these scales. We’ll outline how we’ve attacked the problem so far, and where we still daily feel the pain.

When Bad Things Happen To Good Disks - aka Disks Don’t Have File Descriptors

Link for this presentation:

HTML code for embedding:

Share on social media: