A presentation at DevOpsDays Boston 2017 in in Boston, MA, USA by Leon Fayer
h t t p : / / f a y e r p l a y. c o m Leon Fayer lost art of trouble sho ot in g @papa_fire
{me} 20+ years breaking & fixing dev, architect, [devops] vp @ OmniTI fix other people’s @papa_fire
MID ATLANTIC DEV CON Cloud — Mobile — Web — Dev Thanks to Our Sponsors @papa_fire
questions & comments are welcome! https://joind.in/talk/94a2c @papa_fire
why troubleshooting? @papa_fire
cloud ruined everything it really did @papa_fire
when in doubt - reboot DevOps mantra for managing cloud-based systems 2018 1998 Most reliable way to fix Windows problems destroy and rebuild
old McDonald had a farm
old McDonald lost a farm due to mad cow disease
troubleshooting - a form of problem solving @papa_fire
problem solving - ability to fix things that you know nothing about @papa_fire
why is problem solving important? @papa_fire
… because systems are complex @papa_fire
… because of Murphy’s law @papa_fire
… because someone is always watching @papa_fire
your green field coding skills are about as useful as your Renaissance Art degree @mipsytipsy @papa_fire
{disclamer} @papa_fire
@papa_fire
wishful thinking @papa_fire
reality @papa_fire
where to begin? @papa_fire
replicate @papa_fire
bob: I fixed it me: how do you know? bob: it ran fine after the fix me: did it run before the fix? bob: … @papa_fire
OUR TEAM isolate @papa_fire
Logins aren’t working 100% of the time, alerts are going off periodically, and today the system didn’t send the scheduled emails @papa_fire
Logins aren’t working 100% of the time, alerts are going off periodically, and today the system didn’t send the scheduled emails @papa_fire
Logins aren’t working 100% of the time, alerts are going off periodically, and today the system didn’t send the scheduled emails @papa_fire
fix? @papa_fire
what’s the problem? it’s broken! @papa_fire
understanding
OUR TEAM @papa_fire understand problem
“ we can’t support 100s req/min we need to scale better! @papa_fire
“ we can’t support 100s req/min we need to scale better! improve performance @papa_fire
performance problem @papa_fire
perceived problem @papa_fire
actual problem @papa_fire
OUR TEAM @papa_fire understand business
“ I don’t give a **** if the datacenter is on fire as long as I am still making money @papa_fire
what does it mean to you? @papa_fire
@papa_fire
sales @papa_fire
@papa_fire
content @papa_fire
ad revenue content @papa_fire
every technical decision powers a business need @papa_fire
i don’t get paid for this @papa_fire
i get paid for this @papa_fire
OUR TEAM @papa_fire understand impact
@papa_fire
is there a lesser of two evils?
sometimes breaking = fixing @papa_fire
time is money @papa_fire
MTTR @papa_fire
80% now > 100% tomorrow @papa_fire
incremental improvements @papa_fire
anatomy of a problem @papa_fire
anatomy of a problem problem norm @papa_fire norm
anatomy of a problem problem acceptable norm @papa_fire norm
anatomy of a problem problem acceptable norm norm fix @papa_fire fix fix fix
MTTR @papa_fire
@papa_fire understanding of what have we learned? what’s important cause and effect largest impact acceptable risk
what not to do @papa_fire
don’t assume @papa_fire
Spurious Correlations: http://www.tylervigen.com/spurious-correlations @papa_fire
@papa_fire
don’t trust errors @papa_fire
your birthdate has changed @papa_fire
@papa_fire
it’s not documented @papa_fire
I didn’t build it @papa_fire
it passed all the tests @papa_fire
everything looks right @papa_fire
don’t give up @papa_fire
@papa_fire
solve the problem don’t feed your ego @papa_fire
ask for help @papa_fire
OUR TEAM @papa_fire tools
logging monitoring profiling @papa_fire
actionable concise logging parsable @papa_fire
log levels @papa_fire trace debug info notice warn error fatal
production log levels @papa_fire trace debug info notice warn error fatal
OUR TEAM @papa_fire [2017-02-01 16:46:31] Queuing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0 [2017-02-01 16:46:31] AbandonedReservation successfully enqueued. [2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0 [2017-02-01 18:57:02] Parsed args [2017-02-01 18:57:02] Posting to API [2017-02-01 18:57:02] Initializing args [2017-02-01 18:57:02] Loading reservation_form_data [2017-02-01 18:57:03] Reservation Form Data loaded successfully [2017-02-01 18:57:03] Appending campaign info [2017-02-01 18:57:03] Reservation name: [Some very very very long name] [2017-02-01 18:57:03] Using code = SUPERHERO:20171007 for this instance. [2017-02-01 18:57:03] Setting currency to US Dollar [2017-02-01 18:57:03] Appending marketing info [2017-02-01 18:57:03] Have a non-sku source_code [2017-02-01 18:57:03] Marketing: setting campaignid = GOOGLE [2017-02-01 18:57:03] Appending match rule = Match Rule [2017-02-01 18:57:03] Appending user info [2017-02-01 18:57:03] Appending order info [2017-02-01 18:57:03] Fetching cost range for item_id = 975, sku_id = 4871 [2017-02-01 18:57:03] Determining actual cost table [2017-02-01 18:57:03] Appending comment notes [2017-02-01 18:57:03] Appending abandoned flag [2017-02-01 18:57:03] API GET data: { token = gEcre26reWrAdEnufE3HesVupRepahuDumapHuyap2evufreWrufraBebre7u4a6 contact.address1_city = New York contact.address1_country = USA contact.address1_line1 = 123 test lane contact.address1_line2 = contact.address1_postalcode = 12345 contact.address1_stateorprovince = contact.emailaddress1 = joe.smith@gmail.com contact.firstname = joe contact.lastname = smith contact.mobilephone = 1234567890 … } [2017-02-01 19:04:03] Post complete, took 420 seconds [2017-02-01 19:04:03] ERROR: Curl returned unsuccessfully with return code 28 (Timeout was reached) [2017-02-01 19:04:04] Finishing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0
[2017-02-01 16:46:31] Queuing FC0470D4-E19D-11E6-8FB4-CB1814EF18C0 UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0 [2017-02-01 16:46:31] Queuing UUID: [2017-02-01 16:46:31] AbandonedReservation successfully enqueued. [2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0 [2017-02-01 18:57:02] Parsed args [2017-02-01 18:57:02] Posting to API [2017-02-01 18:57:02] Initializing args [2017-02-01 18:57:02] Loading reservation_form_data [2017-02-01 18:57:03] Reservation Form Data loaded successfully [2017-02-01 18:57:03] Appending campaign info [2017-02-01 18:57:03] Reservation name: [Some very very very long name] [2017-02-01 18:57:03] Using code = SUPERHERO:20171007 for this instance. [2017-02-01 18:57:03] Setting currency to US Dollar [2017-02-01 18:57:03] Appending marketing info [2017-02-01 18:57:03] Have a non-sku source_code [2017-02-01 18:57:03] Marketing: setting campaignid = GOOGLE [2017-02-01 18:57:03] Appending match rule = Match Rule [2017-02-01 18:57:03] Appending user info [2017-02-01 18:57:03] Appending order info [2017-02-01 18:57:03] Fetching cost range for item_id = 975, sku_id = 4871 [2017-02-01 18:57:03] Determining actual cost table [2017-02-01 18:57:03] Appending comment notes [2017-02-01 18:57:03] Appending abandoned flag [2017-02-01 18:57:03] API GET data: { token = gEcre26reWrAdEnufE3HesVupRepahuDumapHuyap2evufreWrufraBebre7u4a6 contact.address1_city = New York contact.address1_country = USA contact.address1_line1 = 123 test lane contact.address1_line2 = contact.address1_postalcode = 12345 contact.address1_stateorprovince = contact.emailaddress1 = joe.smith@gmail.com contact.firstname = joe contact.lastname = smith contact.mobilephone = 1234567890 … } [2017-02-01 19:04:03] Post complete, took 420 seconds [2017-02-01 19:04:03] ERROR: Curl returned unsuccessfully with return code 28 (Timeout was reached) [2017-02-01 19:04:04] Finishing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0 OUR TEAM useful information [2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0 [2017-02-01 18:57:03] API GET data: [2017-02-01 19:04:03] Post complete, took 420 seconds @papa_fire [2017-02-01 19:04:04] Finishing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0
[2017-02-01 16:46:31] Queuing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0 [2017-02-01 16:46:31] AbandonedReservation successfully enqueued. [2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0 [2017-02-01 18:57:02] Parsed args [2017-02-01 18:57:02] Posting to API [2017-02-01 18:57:02] Initializing args [2017-02-01 18:57:02] Loading reservation_form_data [2017-02-01 18:57:03] Reservation Form Data loaded successfully [2017-02-01 18:57:03] Appending campaign info [2017-02-01 18:57:03] Reservation name: [Some very very very long name] [2017-02-01 18:57:03] Using code = SUPERHERO:20171007 for this instance. [2017-02-01 18:57:03] Setting currency to US Dollar [2017-02-01 18:57:03] Appending marketing info [2017-02-01 18:57:03] Have a non-sku source_code [2017-02-01 18:57:03] Marketing: setting campaignid = GOOGLE [2017-02-01 18:57:03] Appending match rule = Match Rule [2017-02-01 18:57:03] Appending user info [2017-02-01 18:57:03] Appending order info [2017-02-01 18:57:03] Fetching cost range for item_id = 975, sku_id = 4871 [2017-02-01 18:57:03] Determining actual cost table [2017-02-01 18:57:03] Appending comment notes [2017-02-01 18:57:03] Appending abandoned flag [2017-02-01 18:57:03] API GET data: { token = gEcre26reWrAdEnufE3HesVupRepahuDumapHuyap2evufreWrufraBebre7u4a6 contact.address1_city = New York contact.address1_country = USA contact.address1_line1 = 123 test lane contact.address1_line2 = contact.address1_postalcode = 12345 contact.address1_stateorprovince = contact.emailaddress1 = joe.smith@gmail.com contact.firstname = joe contact.lastname = smith contact.mobilephone = 1234567890 … } [2017-02-01 19:04:03] Post complete, took 420 seconds [2017-02-01 19:04:03] ERROR: Curl returned unsuccessfully with return code 28 (Timeout was reached) [2017-02-01 19:04:04] Finishing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0 OUR TEAM information I need [2017-02-01 18:57:02] Processing UUID: FC0470D4-E19D-11E6-8FB4-CB1814EF18C0 [2017-02-01 19:04:03] ERROR: Curl returned unsuccessfully with return code 28 (Timeout was reached) @papa_fire
verbosity is expensive @papa_fire 2k log/req * 100 req/sec * 60 sec/min * 2 webservers
all inclusive business-first monitoring correlatable @papa_fire
“ in God we trust, all others we monitor @papa_fire
“ in God we trust, all others* we monitor * systems, code, business, marketing, users, databases, performance … @papa_fire
OUR TEAM why monitor? (we have tests and logs) @papa_fire
OUR TEAM because things change @papa_fire
OUR TEAM because things change @papa_fire in production
outside of our control OUR TEAM because things change @papa_fire in production
‣ online marketing company ‣ major e-commerce component ‣ ~100 million users ‣ 1 billion emails/month ‣ 300,000 lines of code @papa_fire ‣ 5600 metrics collected
what’s the problem? it’s broken! @papa_fire
revenue @papa_fire
revenue @papa_fire
revenue user performance @papa_fire
revenue user performance @papa_fire database load
revenue email bounce rate user performance @papa_fire database load
OUR TEAM don’t underestimate correlation @papa_fire
profiling @papa_fire
OUR TEAM when you have the “what” but still have no idea “why” @papa_fire
#!/usr/sbin/dtrace -s #pragma quiet ::ap_process_request:process-request-entry /zonename == “www4”/ { self->uri = copyinstr(arg1); self->runtime = 0; self->waittime = 0; self->oncpu = timestamp; } OUR TEAM sched:::off-cpu /self->uri != 0/ { self->runtime += timestamp - self->oncpu; self->offcpu = timestamp; } TOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL sched:::on-cpu /self->uri != 0/ { self->oncpu = timestamp; self->waittime += timestamp - self->offcpu; } ::ap_process_request:process-request-return /self->uri != 0/ { @duration[self->uri] = sum(self->runtime); @waiting[self->uri] = sum(self->waittime); @count[self->uri] = count(); } :::tick-5min { printf(“\n%Y\n”, walltimestamp); printf(“\nTOTAL TIME SPENT ON CPU BY ALL HITS trunc(@duration,10); printa(@duration); trunc(@duration); ON THIS URL\n”); printf(“\n\nNUMBER OF HITS\n”); trunc(@count,10); printa(@count); trunc(@count); printf(“\n\nTOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL\n”); trunc(@waiting,10); printa(@waiting); trunc(@waiting); } @papa_fire /directory /api/map_search /api/mobile/get_all_items /m/ /m/directory /api/mobile/get_profile /api/holiday_feed /api/mobile/get_all_widgets /api/get_item/60693 /m/events/all /api/all_items /api/mobile/get_all_events 6850049 7341249 7980925 9124747 9175345 11729556 12603853 15043481 19773404 26165132 27362330 368584344
#!/usr/sbin/dtrace -s #pragma quiet ::ap_process_request:process-request-entry /zonename == “www4”/ { self->uri = copyinstr(arg1); self->runtime = 0; self->waittime = 0; self->oncpu = timestamp; } OUR TEAM sched:::off-cpu /self->uri != 0/ { self->runtime += timestamp - self->oncpu; self->offcpu = timestamp; } TOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL sched:::on-cpu /self->uri != 0/ { self->oncpu = timestamp; self->waittime += timestamp - self->offcpu; } ::ap_process_request:process-request-return /self->uri != 0/ { @duration[self->uri] = sum(self->runtime); @waiting[self->uri] = sum(self->waittime); @count[self->uri] = count(); } :::tick-5min { printf(“\n%Y\n”, walltimestamp); printf(“\nTOTAL TIME SPENT ON CPU BY ALL HITS trunc(@duration,10); printa(@duration); trunc(@duration); ON THIS URL\n”); printf(“\n\nNUMBER OF HITS\n”); trunc(@count,10); printa(@count); trunc(@count); printf(“\n\nTOTAL TIME SPENT OFF-CPU BY ALL HITS TO THIS URL\n”); trunc(@waiting,10); printa(@waiting); trunc(@waiting); } @papa_fire /directory /api/map_search /api/mobile/get_all_items /m/ /m/directory /api/mobile/get_profile /api/holiday_feed /api/mobile/get_all_widgets /api/get_item/60693 /m/events/all /api/all_items /api/mobile/get_all_events /api/mobile/get_all_events 6850049 7341249 7980925 9124747 9175345 11729556 12603853 15043481 19773404 26165132 27362330 368584344 368584344
down the rabbit hole @papa_fire
OUR TEAM @papa_fire
unreserve.php 81.60% OUR TEAM @papa_fire
unreserve.php 81.60% OUR TEAM headerdr.inc @papa_fire
unreserve.php 81.60% OUR TEAM headerdr.inc header.inc @papa_fire
// ——- Find items on reserve past cutoff ——# TODO: Use an exists() or something to collapse the two SELECTs. $cutoff = date(‘Y-m-d H:i:s’, (time() - (12 * 3600))); $sql = “SELECT id, CID AS cid FROM cart WHERE timestamp < :time”; $returns = $dbh->fetchAll($sql, array(‘time’ => $cutoff)); foreach($returns as $item) { // ——- Check for completed order ——$row = $dbh->fetchRow(“SELECT COUNT(*) AS count FROM orders WHERE sid = :sid”, array(‘sid’ => $item[‘cid’])); if($row[‘count’] == 0) { // ——- Return items to inventory ——$dbh->execute(“UPDATE items SET reserved = ” WHERE id = :id”, array(‘id’ => $item[‘id’])); $dbh->execute(“DELETE FROM items WHERE id = :id”, array(‘id’ => $item[‘id’])); } } mail(‘foo@bar.com’,’Unreserve Cron ‘,’Cron has running successfully on production.<\n>’); $log_unreserve = “logs/unreserve_chk.log”; file_put_contents($log_unreserve,”\n”.date(“m-d-Y h:i:s”).’ : Unreserve file run successfully’.”\n\n”,FILE_APPEND); @papa_fire
// ——- Find items on reserve past cutoff ——# TODO: Use an exists() or something to collapse the two SELECTs. $cutoff = date(‘Y-m-d H:i:s’, (time() - (12 * 3600))); $sql = “SELECT id, CID AS cid FROM cart WHERE timestamp < :time”; $returns = $dbh->fetchAll($sql, array(‘time’ => $cutoff)); SELECT id, CID AS cid FROM cart WHERE timestamp < :time foreach($returns as $item) { // ——- Check for completed order ——$row = $dbh->fetchRow(“SELECT COUNT(*) AS count FROM orders WHERE sid = :sid”, array(‘sid’ => $item[‘cid’])); if($row[‘count’] == 0) { // ——- Return items to inventory ——$dbh->execute(“UPDATE items SET reserved = ” WHERE id = :id”, array(‘id’ => $item[‘id’])); $dbh->execute(“DELETE FROM items WHERE id = :id”, array(‘id’ => $item[‘id’])); } } mail(‘foo@bar.com’,’Unreserve Cron ‘,’Cron has running successfully on production.<\n>’); $log_unreserve = “logs/unreserve_chk.log”; file_put_contents($log_unreserve,”\n”.date(“m-d-Y h:i:s”).’ : Unreserve file ran successfully’.”\n\n”,FILE_APPEND); @papa_fire
// ——- Find items on reserve past cutoff ——# TODO: Use an exists() or something to collapse the two SELECTs. $cutoff = date(‘Y-m-d H:i:s’, (time() - (12 * 3600))); $sql = “SELECT id, CID AS cid FROM cart WHERE timestamp < :time”; $returns = $dbh->fetchAll($sql, array(‘time’ => $cutoff)); foreach($returns as $item) { // ——- Check for completed order ——$row = $dbh->fetchRow(“SELECT COUNT() AS count FROM orders WHERE sid = :sid”, array(‘sid’ => $item[‘cid’])); if($row[‘count’] == 0) { // ——- Return items to inventory ——$dbh->execute(“UPDATE items SET reserved = ” WHERE id = :id”, array(‘id’ => $item[‘id’])); $dbh->execute(“DELETE FROM items WHERE id = :id”, array(‘id’ => $item[‘id’])); } } mail(‘foo@bar.com’,’Unreserve Cron ‘,’Cron has running successfully on production.<\n>’); $log_unreserve = “logs/unreserve_chk.log”; file_put_contents($log_unreserve,”\n”.date(“m-d-Y h:i:s”).’ : Unreserve file ran successfully’.”\n\n”,FILE_APPEND); foreach($returns as $item) { SELECT COUNT() AS count FROM orders WHERE sid = :sid @papa_fire
// ——- Find items on reserve past cutoff ——# TODO: Use an exists() or something to collapse the two SELECTs. $cutoff = date(‘Y-m-d H:i:s’, (time() - (12 * 3600))); $sql = “SELECT id, CID AS cid FROM cart WHERE timestamp < :time”; $returns = $dbh->fetchAll($sql, array(‘time’ => $cutoff)); foreach($returns as $item) { // ——- Check for completed order ——$row = $dbh->fetchRow(“SELECT COUNT() AS count FROM orders WHERE sid = :sid”, array(‘sid’ => $item[‘cid’])); if($row[‘count’] == 0) { // ——- Return items to inventory ——$dbh->execute(“UPDATE items SET reserved = ” WHERE id = :id”, array(‘id’ => $item[‘id’])); $dbh->execute(“DELETE FROM items WHERE id = :id”, array(‘id’ => $item[‘id’])); } } mail(‘foo@bar.com’,’Unreserve Cron ‘,’Cron has running successfully on production.<\n>’); $log_unreserve = “logs/unreserve_chk.log”; file_put_contents($log_unreserve,”\n”.date(“m-d-Y h:i:s”).’ : Unreserve file ran successfully’.”\n\n”,FILE_APPEND); foreach($returns as $item) { SELECT COUNT() AS count FROM orders WHERE sid = :sid UPDATE items SET reserved = ” WHERE id = :id @papa_fire
// ——- Find items on reserve past cutoff ——# TODO: Use an exists() or something to collapse the two SELECTs. $cutoff = date(‘Y-m-d H:i:s’, (time() - (12 * 3600))); $sql = “SELECT id, CID AS cid FROM cart WHERE timestamp < :time”; $returns = $dbh->fetchAll($sql, array(‘time’ => $cutoff)); foreach($returns as $item) { // ——- Check for completed order ——$row = $dbh->fetchRow(“SELECT COUNT() AS count FROM orders WHERE sid = :sid”, array(‘sid’ => $item[‘cid’])); if($row[‘count’] == 0) { // ——- Return items to inventory ——$dbh->execute(“UPDATE items SET reserved = ” WHERE id = :id”, array(‘id’ => $item[‘id’])); $dbh->execute(“DELETE FROM items WHERE id = :id”, array(‘id’ => $item[‘id’])); } } mail(‘foo@bar.com’,’Unreserve Cron ‘,’Cron has running successfully on production.<\n>’); $log_unreserve = “logs/unreserve_chk.log”; file_put_contents($log_unreserve,”\n”.date(“m-d-Y h:i:s”).’ : Unreserve file ran successfully’.”\n\n”,FILE_APPEND); foreach($returns as $item) { SELECT COUNT() AS count FROM orders WHERE sid = :sid UPDATE items SET reserved = ” WHERE id = :id DELETE FROM items WHERE id = :id @papa_fire
// ——- Find items on reserve past cutoff ——# TODO: Use an exists() or something to collapse the two SELECTs. TODO: Use an exists() or something to collapse the $cutoff = date(‘Y-m-d H:i:s’, (time() - (12 * 3600))); $sql = “SELECT id, CID AS cid FROM cart WHERE timestamp < :time”; $returns = $dbh->fetchAll($sql, array(‘time’ => $cutoff)); two SELECTs. foreach($returns as $item) { // ——- Check for completed order ——$row = $dbh->fetchRow(“SELECT COUNT(*) AS count FROM orders WHERE sid = :sid”, array(‘sid’ => $item[‘cid’])); if($row[‘count’] == 0) { // ——- Return items to inventory ——$dbh->execute(“UPDATE items SET reserved = ” WHERE id = :id”, array(‘id’ => $item[‘id’])); $dbh->execute(“DELETE FROM items WHERE id = :id”, array(‘id’ => $item[‘id’])); } } mail(‘foo@bar.com’,’Unreserve Cron ‘,’Cron has running successfully on production.<\n>’); $log_unreserve = “logs/unreserve_chk.log”; file_put_contents($log_unreserve,”\n”.date(“m-d-Y h:i:s”).’ : Unreserve file ran successfully’.”\n\n”,FILE_APPEND); @papa_fire
// ——- Find items on reserve past cutoff ——# TODO: Use an exists() or something to collapse the two SELECTs. $cutoff = date(‘Y-m-d H:i:s’, (time() - (12 * 3600))); $sql = “SELECT id, CID AS cid FROM cart WHERE timestamp < :time”; $returns = $dbh->fetchAll($sql, array(‘time’ => $cutoff)); foreach($returns as $item) { // ——- Check for completed order ——$row = $dbh->fetchRow(“SELECT COUNT(*) AS count FROM orders WHERE sid = :sid”, array(‘sid’ => $item[‘cid’])); if($row[‘count’] == 0) { // ——- Return items to inventory ——$dbh->execute(“UPDATE items SET reserved = ” WHERE id = :id”, array(‘id’ => $item[‘id’])); $dbh->execute(“DELETE FROM items WHERE id = :id”, array(‘id’ => $item[‘id’])); } } mail(‘foo@bar.com’,’Unreserve Cron ‘,’Cron has running successfully on production.<\n>’); $log_unreserve = “logs/unreserve_chk.log”; file_put_contents($log_unreserve,”\n”.date(“m-d-Y h:i:s”).’ : Unreserve file ran successfully’.”\n\n”,FILE_APPEND); mail(‘foo@bar.com’,’Unreserve Cron’, ‘Cron is running successfully on production.<\n>’); @papa_fire
// ——- Find items on reserve past cutoff ——# TODO: Use an exists() or something to collapse the two SELECTs. $cutoff = date(‘Y-m-d H:i:s’, (time() - (12 * 3600))); $sql = “SELECT id, CID AS cid FROM cart WHERE timestamp < :time”; $returns = $dbh->fetchAll($sql, array(‘time’ => $cutoff)); foreach($returns as $item) { // ——- Check for completed order ——$row = $dbh->fetchRow(“SELECT COUNT(*) AS count FROM orders WHERE sid = :sid”, array(‘sid’ => $item[‘cid’])); if($row[‘count’] == 0) { // ——- Return items to inventory ——$dbh->execute(“UPDATE items SET reserved = ” WHERE id = :id”, array(‘id’ => $item[‘id’])); $dbh->execute(“DELETE FROM items WHERE id = :id”, array(‘id’ => $item[‘id’])); } } mail(‘foo@bar.com’,’Unreserve Cron ‘,’Cron has running successfully on production.<\n>’); $log_unreserve = “logs/unreserve_chk.log”; file_put_contents($log_unreserve,”\n”.date(“m-d-Y h:i:s”).’ : Unreserve file ran successfully’.”\n\n”,FILE_APPEND); file_put_contents($log_unreserve,”\n”.date(“m-d-Y h:i:s”).’ : Unreserve file ran successfully’.”\n\n”,FILE_APPEND); @papa_fire
@papa_fire required skill troubleshooting is … educational iterative frustrating rewarding
@papa_fire
questions? https://joind.in/talk/94a2c @papa_fire
There are a lot of great things about the cloud, but the “destroy and rebuild” philosophy which is really good for building a continuous delivery pipeline, really sucks when applied to troubleshooting production problems. When your application goes haywire, the most valuable engineering skill is not the the ability to bring up a copy of your system or even the knowledge of your technology stack (although it doesn’t hurt). It is the skill of understanding and solving problems.
Finding the root cause of the issue and mitigating it with minimal disruption in production is a must-have skill for engineers responsible for managing and maintaining production systems, which nowadays includes ops, dbas and devs alike. In this talk I will discuss the skills required to troubleshoot complex systems, traits that prevent engineers from being successful at troubleshooting and discuss some techniques and tips and trick for troubleshooting complex systems in production.
Here’s what was said about this presentation on social media.
I loved @papa_fire's example of using Dtrace to get CPU usage per URL path. Wish I could run on Illumos or FreeBSD. #devopsdaysbos
— Joe Mulloy (@jdmulloy) September 19, 2017
"It passed all the tests" is right up there with "Works on my machine". @papa_fire #devopsdaysbos pic.twitter.com/u4XxpvBTtG
— Joe Mulloy (@jdmulloy) September 19, 2017
Good example of why you should collect data on everything. Turns out email blacklisting reduced revenue. @papa_fire #devopsdaysbos pic.twitter.com/jKVitHP4xk
— Joe Mulloy (@jdmulloy) September 19, 2017
Yup! @papa_fire pic.twitter.com/svszO22rl5
— David Fredricks (@dfreddy76) September 19, 2017
@papa_fire speaking @devopsdaysbos! Monitor everything. pic.twitter.com/azrbg7hIof
— David Fredricks (@dfreddy76) September 19, 2017
"Don't cling to a mistake just because you spent a lot of time making it." ~Unknown - It's perfect, @papa_fire! #devops #devopsdaysBOS
— Corey Rastetter (@oystahs) September 19, 2017
#troubleshoot like you are on fire and the water buckets are full of gasoline -- @papa_fire #devopsdays #devopsdaysbos
— Andrew Thompson (@devkmsg) September 19, 2017
Just because you poured your heart into making your mistakes doesn't mean they're good. @papa_fire #devopsdaysbos pic.twitter.com/t2tu0BFMpQ
— aaron aldrich @devopsdays Austin (@crayzeigh) September 19, 2017
@papa_fire has the best slides so far. Love the memes, gifs, stock photos and quotes. #devopsdaysBOS
— Joe Mulloy (@jdmulloy) September 19, 2017
Understanding the acceptable threshold is important in getting business back to normal as quickly as possible. @papa_fire #devopsdaysbos pic.twitter.com/YEcZmiyU5L
— aaron aldrich @devopsdays Austin (@crayzeigh) September 19, 2017
Replicating a problem is critical to troubleshooting the problem. - @papa_fire pic.twitter.com/XzxOhznCdV
— matthew boeckman (@matthewboeckman) September 19, 2017
#devopsdaysBOS @papa_fire First time I’ve seen the “pets vs cattle” analogy extended to mad cow disease.
— Corey Quinn (@QuinnyPig) September 19, 2017
People often forget that back to normal is different than acceptable @papa_fire #madc
— Nikunj Shah (@nikunjshah086) July 14, 2018
All technical folks need to hear @papa_fire ’s fantastic session on troubleshooting. Great advice and examples on becoming a better troubleshooter. Plus, his graphics game is on point! #madc
— Nikunj Shah (@nikunjshah086) July 14, 2018