Skip to content

Production Emergency Protocol

The main reason of this document is back-end developer.

There can be a ton of reasons why the production site https://www.interaction-design.org is down. This document is written to provide the places you need to look at to find the possible reasons. The headers are designed to give you the information in the shortest time. Quickly crawl the headers first before going deeper into the information given under the headers.

Check service statuses

We collect statuses of all services we use on NewRelic. It will help you to understand if the problem is on our side or on the 3rd party services we use.

Alert the whole team

Open Slack. Open #dev channel. Shout by copying and pasting the text below:

@channel: RED ALERT: Production is down! https://www.interaction-design.org Please help!

Then try to coordinate your roles based on the availability of your teammates. E.g.:

  • a front-end developer can help you with checking the state of 3rd party services we use
  • a back-end developer can check logs

Determine the cause

Check the development team email box

Some services like DigitalOcean may warn us about outages by sending emails to development@interaction-design.org and they’re uploaded to #dev channel on Slack.

Check DigitalOcean

Check Amazon Web Services

  1. We are using S3 on US-EAST-1.
  2. We are using Cloudfront CDNon all possible edges.

Check Image Services

Check ImageKit status - https://statuspage.imagekit.io/

Check swap

text
Swap memory or swap space is the on-disk component of the virtual memory system.

Sometimes server is up and running, process is responding, but it's behaving much slower than usual. In cases like that, checking NewRelic may give you some glue. This data can be misleading as well if not properly analysed.

When the server is slow, resources that rely more on disk may be affect more disproportionally than resources just using memory or resources using any sort of cache.

  • MySQL for instance will have it's cache working just fine for simple or high-frequent queries while queries data need full/partial table scan will ber affect in a way that can make the system unusable.
  • For Redis, swap usage will mean the server can not write it's dump to disk and will (depending on the config) reject new writes but keep serving data in-memory.
  • For nginx, requests will be veeery slow or connections will be dropped by server.

Check System Processes

All servers have htop installed to view system processes, kill them, monitor memory/CPU usage in the CLI. You can use htop command (available globally)

Guide: check a quick introduction to htop

Usually, sorting (F6) by PERCENT_CPU or PERCENT_MEM. Additionally, you can filter/search by pressing (F4) if you are targeting a specific process like nginx, MySQL, etc.

Killing a Process

If you are not sure, please contact @ops-team for further instructions.

Search for the process using htop using filters, press (F9) and use the proper signal flag:

  • SIGINT is the interrupt signal. The terminal sends it to the foreground process when the user presses ctrl-c. The default behavior is to terminate the process, but it can be caught or ignored. The intention is to provide a mechanism for an orderly, graceful shutdown.
  • SIGQUIT is the dump core signal. The terminal sends it to the foreground process when the user presses ctrl-. The default behavior is to terminate the process and dump core, but it can be caught or ignored. The intention is to provide a mechanism for the user to abort the process. You can look at SIGINT as "user-initiated happy termination" and SIGQUIT as "user-initiated unhappy termination."
  • SIGTERM is the termination signal. The default behavior is to terminate the process, but it also can be caught or ignored. The intention is to kill the process, gracefully or not, but to first allow it a chance to cleanup.
  • SIGKILL is the kill signal. The only behavior is to kill the process, immediately. As the process cannot catch the signal, it cannot cleanup, and thus this is a signal of last resort.
  • SIGSTOP is the pause signal. The only behavior is to pause the process; the signal cannot be caught or ignored. The shell uses pausing (and its counterpart, resuming via SIGCONT) to implement job control.

Check New Relic

Check logs

You can copy all the latest main production logs locally by running scripts/maintenance/copy-production-logs.sh script or use ssh and vim from command line.

Check nginx access logs

In some cases you also need to enable access logging for all requests on nginx side. To do it please open nginx config (from https://forge.laravel.com/servers/796468/sites/2374219#/application or using SSH session), find access_log off; line (near to line #88) and comment out it. After that save an updated config and reload/restart nginx (forge do it automatically on saving). After that you can analyze data from access.log file or from GUI for it: http://logs.interaction-design.org

Check Laravel logs

Simple

In most cases you might get the information you need by checking the Laravel logs on https://www.interaction-design.org/admin/logs.

Old School

Connect to the production server via ssh. If you don’t know how, check here.

Super easy vim or vi (a *nix based text editor) tips:

  1. Type G to go to the end of the file.
  2. Type / and then some keyword to search inside the file. Click enter and it will find the first instance. Type n for the next instance. Type N for the previous instance.
  3. Hit Esc and type :q to quit the file if you haven’t changed anything. :q! to quit without saving the changes. :wq to save the changes and then quit.
  • Get into the Laravel logs directory by running cd /home/forge/www.interaction-design.org/storage/logs
  • Run ls -al to see all log files generated by Laravel. The example output will be like the following:
text
user@host:~/interaction-design.org/current/storage/logs$ ls -al
total 1489380
drwxrwxr-x 2       4096 Mar  1 00:00 .
drwxrwxr-x 8       4096 Dec 30  2015 ..
-rw-rw-r-- 1         14 Sep 17  2015 .gitignore
-rw-r--r-- 1 1149189651 Mar  1 10:33 geoip.log
-rw-r--r-- 1   71195757 Feb 25 23:59 laravel-2017-02-25.log
-rw-r--r-- 1   75873367 Feb 26 23:59 laravel-2017-02-26.log
-rw-r--r-- 1   89430592 Feb 27 23:59 laravel-2017-02-27.log
-rw-r--r-- 1  103652807 Feb 28 23:59 laravel-2017-02-28.log
-rw-r--r-- 1   30965998 Mar  1 10:36 laravel-2017-03-01.log
-rw-r--r-- 1  971195757 Feb 25 23:59 laravel--debug-2017-02-25.log
-rw-r--r-- 1  975873367 Feb 26 23:59 laravel--debug-2017-02-26.log
-rw-r--r-- 1  989430592 Feb 27 23:59 laravel--debug-2017-02-27.log
-rw-r--r-- 1  193652807 Feb 28 23:59 laravel--debug-2017-02-28.log
-rw-r--r-- 1  930965998 Mar  1 10:36 laravel--debug-2017-03-01.log
-rw-rw-r-- 1     397353 Apr  4 01:00 schedule-aa9d2b40dacd0f587b8120b19297d71f2224fc2b.log
  • You will want to look at the newest log files. Note, that laravel-{date}.log contains only important information, laravel--debug-{date}.log contains ALSO some debug information and 404 error reports. Run vim laravel-20xx-yy-zz.log to open the log file in a text editor. An example: vim laravel-2017-03-01.log.
  • Move to the end of the file to check the latest errors.

Check cron logs

Connect to the production server via ssh. If you don’t know how, check here.

Super easy vim or vi (a *nix based text editor) tips:

  1. Type G to go to the end of the file.
  2. Type / and then some keyword to search inside the file. Click enter and it will find the first instance. Type n for the next instance. Type N for the previous instance.
  3. Hit Esc and type :q to quit the file if you haven’t changed anything. :q! to quit without saving the changes. :wq to save the changes and then quit.
  • Run vim /home/forge/logs/cron.log to open the file containing the cron logs.
  • A possible reason of the downtime can be a stuck cronjob. Starting from the end of the file, look for possible causes.
  • You can also check if there is an ongoing cronjob by checking the framework directory. Run cd /home/forge/www.interaction-design.org/storage/framework. Run ls -al to see the contents of the directory. If there is a file named like schedule-A_UNIQUE_HASH, then there is an ongoing cronjob. You might want to check a bit later to see if it’s still there. Whenever a cron job finishes its job, Laravel removes that file automatically.

Check nginx

Connect to the production server via ssh. If you don’t know how, check here.

Super easy vim or vi (a *nix based text editor) tips:

  1. Type G to go to the end of the file.
  2. Type / and then some keyword to search inside the file. Click enter and it will find the first instance. Type n for the next instance. Type N for the previous instance.
  3. Hit Esc and type :q to quit the file if you haven’t changed anything. :q! to quit without saving the changes. :wq to save the changes and then quit.
  • Look at the access logs: vim /var/log/nginx/access.log
  • Look at the error logs: vim /var/log/nginx/error.log
  • Look at the site-specific error logs: vim /var/log/nginx/interaction-design.org-error.log

You can generally ignore errors containing keywords wp-login. If you can’t find anything causing the downtime though, there might be a chance we’re under DDoS attack.

Check PHP8.X-FPM

We are using PHP 8, so you can ignore PHP5-FPM/PHP7-FPM.

Connect to the production server via ssh. If you don’t know how, check here.

Super easy vim or vi (a *nix based text editor) tips:

  1. Type G to go to the end of the file.
  2. Type / and then some keyword to search inside the file. Click enter and it will find the first instance. Type n for the next instance. Type N for the previous instance.
  3. Hit Esc and type :q to quit the file if you haven’t changed anything. :q! to quit without saving the changes. :wq to save the changes and then quit.

Check MySQL

Connect to the production server via ssh. If you don’t know how, check here.

Super easy vim or vi (a *nix based text editor) tips:

  1. Type G to go to the end of the file.
  2. Type / and then some keyword to search inside the file. Click enter and it will find the first instance. Type n for the next instance. Type N for the previous instance.
  3. Hit Esc and type :q to quit the file if you haven’t changed anything. :q! to quit without saving the changes. :wq to save the changes and then quit.

Check if all required services are running

At first look, you can locate nginx, php8.4-fpm, and mysql. You might need to look for other services too.

Connect to the production server via ssh. If you don’t know how, check here.

  • Run service --status-all to see the statuses of all services on the machine. Running ones will be marked as [ + ].
  • If you see [ ? ], the status of the service is not certain. Run service SERVICE_NAME status to see the status of a specific service.
  • If you want to start a service, run sudo service SERVICE_NAME start. The command might return an error which might be the problem causing the downtime.
  • Find sudo password in the Deploy script on https://forge.laravel.com/servers/796468/sites/2374219

Check available disk space

TL;DR

Run these commands to have on overview for the disk space:

sh
du -sh ~/*
sudo du -sh /*
sudo du -sh /tmp/*

Our server gets daily database backups and every backups have GBs of data. It automatically uploads every backup file to AWS under ixdf--backups bucket so backup files should not take a lot of space on our server. Still, you might want to check them if there is a problem uploading/deleting older backups. Note that only the last day’s backup should be present on the server.

Connect to the production server via ssh. If you don’t know how, check how-to-ssh-into-server.md.

  • Run df -h to see the available disk spaces. The example output will be like the following:
text
user@host:~$ df -h
Filesystem                 Size  Used Avail Use% Mounted on
udev                        16G  4.0K   16G   1% /dev
tmpfs                      3.2G  452K  3.2G   1% /run
/dev/disk/by-label/DOROOT  315G  154G  146G  52% /
none                       4.0K     0  4.0K   0% /sys/fs/cgroup
none                       5.0M     0  5.0M   0% /run/lock
none                        16G  4.0K   16G   1% /run/shm
none                       100M     0  100M   0% /run/user
  • You first need to look at the one / under Mounted on. If it’s more than 90%, there is a possibility that this is causing the downtime.
  • Run cd /home/forge/backups and then run ls -al to see the contents of the backup directory. You should see something like this:
text
user@host:~/backups$ ls -al
-rw-rw-r--  1 forge forge 1383120298 Apr 14 00:49 lastbackup--dev.sql.gz
-rw-rw-r--  1 forge forge         15 Apr 14 00:49 lastbackup--dev.sql.gz.version

Often, logs takes too much space also. You can find big and old ones in ~/interaction-design.org/shared/storage/logs using ls -la --time-style="+%Y-%m-%d".

Another helpful snippet is to display size of subdirectories (often used from /var/log and ~):

sh
du -h --max-depth=1 | sort -hr

Restarting the server

Connect to the production server via ssh. If you don’t know how, check here.

Additional resources

  1. Stripe status
  2. PayPal status
  3. GitHub status
  4. AWS status
  5. ImageKit status