Appearance
Production Emergency Protocol
The main reason of this document is back-end developer.
There can be a ton of reasons why the production site https://www.interaction-design.org is down. This document is written to provide the places you need to look at to find the possible reasons. The headers are designed to give you the information in the shortest time. Quickly crawl the headers first before going deeper into the information given under the headers.
Check service statuses
We collect statuses of all services we use on NewRelic. It will help you to understand if the problem is on our side or on the 3rd party services we use.
Alert the whole team
Open Slack. Open #dev channel. Shout by copying and pasting the text below:
@channel: RED ALERT: Production is down! https://www.interaction-design.org Please help!
Then try to coordinate your roles based on the availability of your teammates. E.g.:
- a front-end developer can help you with checking the state of 3rd party services we use
- a back-end developer can check logs
Determine the cause
Check the development team email box
Some services like DigitalOcean may warn us about outages by sending emails to development@interaction-design.org and they’re uploaded to #dev channel on Slack.
Check DigitalOcean
- Look if there are known issues on DigitalOcean Status (our production server is on NYC3): https://status.digitalocean.com
- Check Twitter if people are complaining: https://x.com/search?q=digitalocean down
- Look if there are peak points on the graphs at overview pages: production-web, production-mysql
Check Amazon Web Services
- We are using S3 on US-EAST-1.
- We are using Cloudfront CDNon all possible edges.
- Look if there are known issues on AWS Service Health Dashboard
- Check X if people are complaining: https://x.com/search?q=aws+is+down
Check Image Services
Check ImageKit status - https://statuspage.imagekit.io/
Check swap
text
Swap memory or swap space is the on-disk component of the virtual memory system.Sometimes server is up and running, process is responding, but it's behaving much slower than usual. In cases like that, checking NewRelic may give you some glue. This data can be misleading as well if not properly analysed.
When the server is slow, resources that rely more on disk may be affect more disproportionally than resources just using memory or resources using any sort of cache.
MySQLfor instance will have it's cache working just fine for simple or high-frequent queries while queries data need full/partial table scan will ber affect in a way that can make the system unusable.- For
Redis, swap usage will mean the server can not write it's dump to disk and will (depending on the config) reject new writes but keep serving data in-memory. - For
nginx, requests will be veeery slow or connections will be dropped by server.
Check System Processes
All servers have htop installed to view system processes, kill them, monitor memory/CPU usage in the CLI. You can use htop command (available globally)
Guide: check a quick introduction to htop
Usually, sorting (F6) by PERCENT_CPU or PERCENT_MEM. Additionally, you can filter/search by pressing (F4) if you are targeting a specific process like nginx, MySQL, etc.
Killing a Process
If you are not sure, please contact @ops-team for further instructions.
Search for the process using htop using filters, press (F9) and use the proper signal flag:
SIGINTis the interrupt signal. The terminal sends it to the foreground process when the user presses ctrl-c. The default behavior is to terminate the process, but it can be caught or ignored. The intention is to provide a mechanism for an orderly, graceful shutdown.SIGQUITis the dump core signal. The terminal sends it to the foreground process when the user presses ctrl-. The default behavior is to terminate the process and dump core, but it can be caught or ignored. The intention is to provide a mechanism for the user to abort the process. You can look at SIGINT as "user-initiated happy termination" and SIGQUIT as "user-initiated unhappy termination."SIGTERMis the termination signal. The default behavior is to terminate the process, but it also can be caught or ignored. The intention is to kill the process, gracefully or not, but to first allow it a chance to cleanup.SIGKILLis the kill signal. The only behavior is to kill the process, immediately. As the process cannot catch the signal, it cannot cleanup, and thus this is a signal of last resort.SIGSTOPis the pause signal. The only behavior is to pause the process; the signal cannot be caught or ignored. The shell uses pausing (and its counterpart, resuming via SIGCONT) to implement job control.
Check New Relic
- Look if there are peak points on the graphs at overview page:
- Look for unusual behavior on error logs
- Look if there is an open issue on previously set violations
Check logs
You can copy all the latest main production logs locally by running scripts/maintenance/copy-production-logs.sh script or use ssh and vim from command line.
Check nginx access logs
In some cases you also need to enable access logging for all requests on nginx side. To do it please open nginx config (from https://forge.laravel.com/servers/796468/sites/2374219#/application or using SSH session), find access_log off; line (near to line #88) and comment out it. After that save an updated config and reload/restart nginx (forge do it automatically on saving). After that you can analyze data from access.log file or from GUI for it: http://logs.interaction-design.org
Check Laravel logs
Simple
In most cases you might get the information you need by checking the Laravel logs on https://www.interaction-design.org/admin/logs.
Old School
Connect to the production server via ssh. If you don’t know how, check here.
Super easy vim or vi (a *nix based text editor) tips:
- Type
Gto go to the end of the file. - Type
/and then some keyword to search inside the file. Clickenterand it will find the first instance. Typenfor the next instance. TypeNfor the previous instance. - Hit
Escand type:qto quit the file if you haven’t changed anything.:q!to quit without saving the changes.:wqto save the changes and then quit.
- Get into the Laravel logs directory by running
cd /home/forge/www.interaction-design.org/storage/logs - Run
ls -alto see all log files generated by Laravel. The example output will be like the following:
text
user@host:~/interaction-design.org/current/storage/logs$ ls -al
total 1489380
drwxrwxr-x 2 4096 Mar 1 00:00 .
drwxrwxr-x 8 4096 Dec 30 2015 ..
-rw-rw-r-- 1 14 Sep 17 2015 .gitignore
-rw-r--r-- 1 1149189651 Mar 1 10:33 geoip.log
-rw-r--r-- 1 71195757 Feb 25 23:59 laravel-2017-02-25.log
-rw-r--r-- 1 75873367 Feb 26 23:59 laravel-2017-02-26.log
-rw-r--r-- 1 89430592 Feb 27 23:59 laravel-2017-02-27.log
-rw-r--r-- 1 103652807 Feb 28 23:59 laravel-2017-02-28.log
-rw-r--r-- 1 30965998 Mar 1 10:36 laravel-2017-03-01.log
-rw-r--r-- 1 971195757 Feb 25 23:59 laravel--debug-2017-02-25.log
-rw-r--r-- 1 975873367 Feb 26 23:59 laravel--debug-2017-02-26.log
-rw-r--r-- 1 989430592 Feb 27 23:59 laravel--debug-2017-02-27.log
-rw-r--r-- 1 193652807 Feb 28 23:59 laravel--debug-2017-02-28.log
-rw-r--r-- 1 930965998 Mar 1 10:36 laravel--debug-2017-03-01.log
-rw-rw-r-- 1 397353 Apr 4 01:00 schedule-aa9d2b40dacd0f587b8120b19297d71f2224fc2b.log- You will want to look at the newest log files. Note, that
laravel-{date}.logcontains only important information,laravel--debug-{date}.logcontains ALSO some debug information and 404 error reports. Runvim laravel-20xx-yy-zz.logto open the log file in a text editor. An example:vim laravel-2017-03-01.log. - Move to the end of the file to check the latest errors.
Check cron logs
Connect to the production server via ssh. If you don’t know how, check here.
Super easy vim or vi (a *nix based text editor) tips:
- Type
Gto go to the end of the file. - Type
/and then some keyword to search inside the file. Clickenterand it will find the first instance. Typenfor the next instance. TypeNfor the previous instance. - Hit
Escand type:qto quit the file if you haven’t changed anything.:q!to quit without saving the changes.:wqto save the changes and then quit.
- Run
vim /home/forge/logs/cron.logto open the file containing the cron logs. - A possible reason of the downtime can be a stuck cronjob. Starting from the end of the file, look for possible causes.
- You can also check if there is an ongoing cronjob by checking the framework directory. Run
cd /home/forge/www.interaction-design.org/storage/framework. Runls -alto see the contents of the directory. If there is a file named likeschedule-A_UNIQUE_HASH, then there is an ongoing cronjob. You might want to check a bit later to see if it’s still there. Whenever a cron job finishes its job, Laravel removes that file automatically.
Check nginx
Connect to the production server via ssh. If you don’t know how, check here.
Super easy vim or vi (a *nix based text editor) tips:
- Type
Gto go to the end of the file. - Type
/and then some keyword to search inside the file. Clickenterand it will find the first instance. Typenfor the next instance. TypeNfor the previous instance. - Hit
Escand type:qto quit the file if you haven’t changed anything.:q!to quit without saving the changes.:wqto save the changes and then quit.
- Look at the access logs:
vim /var/log/nginx/access.log - Look at the error logs:
vim /var/log/nginx/error.log - Look at the site-specific error logs:
vim /var/log/nginx/interaction-design.org-error.log
You can generally ignore errors containing keywords wp-login. If you can’t find anything causing the downtime though, there might be a chance we’re under DDoS attack.
- Find
sudopassword in the Deploy script on https://forge.laravel.com/servers/796468/sites/2374219 - If you want to restart nginx:
sudo service nginx restart
Check PHP8.X-FPM
We are using PHP 8, so you can ignore PHP5-FPM/PHP7-FPM.
Connect to the production server via ssh. If you don’t know how, check here.
Super easy vim or vi (a *nix based text editor) tips:
- Type
Gto go to the end of the file. - Type
/and then some keyword to search inside the file. Clickenterand it will find the first instance. Typenfor the next instance. TypeNfor the previous instance. - Hit
Escand type:qto quit the file if you haven’t changed anything.:q!to quit without saving the changes.:wqto save the changes and then quit.
- Find
sudopassword in the Deployment script on https://forge.laravel.com/servers/796468/sites/2374219 - Look at the logs:
sudo vim /var/log/php8.4-fpm.log - If you want to restart PHP8.4-FPM:
sudo service php8.4-fpm restart
Check MySQL
Connect to the production server via ssh. If you don’t know how, check here.
Super easy vim or vi (a *nix based text editor) tips:
- Type
Gto go to the end of the file. - Type
/and then some keyword to search inside the file. Clickenterand it will find the first instance. Typenfor the next instance. TypeNfor the previous instance. - Hit
Escand type:qto quit the file if you haven’t changed anything.:q!to quit without saving the changes.:wqto save the changes and then quit.
- Find
sudopassword in the Deploy script on https://forge.laravel.com/servers/796468/sites/2374219 - Look at the error logs:
sudo vim /var/log/mysql/error.log - If you want to restart MySQL:
sudo service mysql restart
Check if all required services are running
At first look, you can locate nginx, php8.4-fpm, and mysql. You might need to look for other services too.
Connect to the production server via ssh. If you don’t know how, check here.
- Run
service --status-allto see the statuses of all services on the machine. Running ones will be marked as[ + ]. - If you see
[ ? ], the status of the service is not certain. Runservice SERVICE_NAME statusto see the status of a specific service. - If you want to start a service, run
sudo service SERVICE_NAME start. The command might return an error which might be the problem causing the downtime. - Find
sudopassword in the Deploy script on https://forge.laravel.com/servers/796468/sites/2374219
Check available disk space
TL;DR
Run these commands to have on overview for the disk space:
sh
du -sh ~/*
sudo du -sh /*
sudo du -sh /tmp/*Our server gets daily database backups and every backups have GBs of data. It automatically uploads every backup file to AWS under ixdf--backups bucket so backup files should not take a lot of space on our server. Still, you might want to check them if there is a problem uploading/deleting older backups. Note that only the last day’s backup should be present on the server.
Connect to the production server via ssh. If you don’t know how, check how-to-ssh-into-server.md.
- Run
df -hto see the available disk spaces. The example output will be like the following:
text
user@host:~$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 16G 4.0K 16G 1% /dev
tmpfs 3.2G 452K 3.2G 1% /run
/dev/disk/by-label/DOROOT 315G 154G 146G 52% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
none 5.0M 0 5.0M 0% /run/lock
none 16G 4.0K 16G 1% /run/shm
none 100M 0 100M 0% /run/user- You first need to look at the one
/underMounted on. If it’s more than 90%, there is a possibility that this is causing the downtime. - Run
cd /home/forge/backupsand then runls -alto see the contents of the backup directory. You should see something like this:
text
user@host:~/backups$ ls -al
-rw-rw-r-- 1 forge forge 1383120298 Apr 14 00:49 lastbackup--dev.sql.gz
-rw-rw-r-- 1 forge forge 15 Apr 14 00:49 lastbackup--dev.sql.gz.versionOften, logs takes too much space also. You can find big and old ones in ~/interaction-design.org/shared/storage/logs using ls -la --time-style="+%Y-%m-%d".
Another helpful snippet is to display size of subdirectories (often used from /var/log and ~):
sh
du -h --max-depth=1 | sort -hr- If you can’t find anything to delete, then you might need to increase the disk size of the server. Note that you cannot decrease the disk size later on. Resize the production server here: https://cloud.digitalocean.com/droplets/7036767/resize?i=719225
Restarting the server
Connect to the production server via ssh. If you don’t know how, check here.
- Find
sudopassword in the Deploy script on https://forge.laravel.com/servers/796468/sites/2374219 - Run
sudo rebootto reboot the server. - Your connection to the server will be cut. Wait 10 seconds before you try to connect to the server via ssh again.