Appearance
Support engineer
Support engineer is a temporary role in the dev team, at least one engineer is assigned to this role. Often the role assigned to developers who have just completed a large feature or some other large task who need some time to keep an eye on things and recover energy.
Why do we have this role?
The main goal is to allow other developers to focus on bigger tasks while support engineers work ongoing bug reports and help other teams with their tasks.
This role can be challenging as it involves a lot of critical thinking and initiative.
You have done a great job as a support engineer when:
- The dev team can confidently mute error channels and notifications knowing that you are constantly monitoring for errors and performance issues
- The dev team can focus almost entirely on their sprint goals, knowing that you will only involve them if its really important that you do so
- Other teams feel like their requests are being heard and taken care of in a timely manner
- You reduce the number of recurring errors on error channels (fewer issues/noise for next SEs)
- You simplified the life of next support engineer: solved noisy bugs, improved error messages, enhanced documentation
Your responsibilities
By priority:
- Monitor errors and solve/delegate them (Slack, logs, NewRelic, etc.)
- Manage requests from other teams (reply to @dev-hotline mentions on Slack)
- Optimize support-engineer work
- Solve technical issues and improve code coverage (if you have a time)
⚠️ Outcome of any new issue should be a fix or GitHub issue or a comment on Slack. The team should know that your handled the issue or working on it (just add a comment and/or Slack emoji reaction ⏳👀✔︎).
Monitor errors
By “error channel” we mean any types of channels (Slack, NewRelic, logs, etc.), but Slack is the most important one. However, support engineers should also check other error channels (because our error Handler that routes important Exceptions to Slack may miss something):
- NewRelic Errors Inbox every 3 days
- Laravel logs once a week (hint: use
scripts/maintenance/copy-production-logs.shscript to download logs and check them locally)
Slack channels
There are a number of channels on Slack which display errors. You should unmute all of them once you are the support engineer:
#errors--productionhighest priority#js-errors--productionhighest priority#errors--staging
Other errors--* channels may have a lot of noise if the codebase is outdated while using a new DB dump. To fix it, deploy fresh code to the server.
Setup Slack
Create a new sidebar section on Slack and add these channels to that section. Click here to learn how to do this
The highest priority channel is #errors--production as errors shown here are likely affecting a guest, member or admin user on interaction-design.org. In addition to unmuting this channel, you should change the notification settings to alert you whenever a message is sent in this channel. You might want to do the same for #js-errors--production, but JS errors there generally don't break critical site functionality.
The other error channels can sometimes be quite 'noisy' and the errors aren't directly impacting users, so you don't need to be alerted whenever a message is sent to them. It's sufficient to look through these channels once every few hours.
Organize support engineer channels
When your support engineer shift is over at the end of the sprint, you should mute the error channels again. If you added the channels to a custom section, you can hover over the section name on Slack, click on the 3 dots, and select 'mute all'.
You might be actively monitoring error channels, but you are not necessarily going to solve the errors as soon as they happen. There are 4 actions you could take for any given error message.
Reacting to error messages
It may take you a while to investigate an error message. To let other team members know you are on it (and to remind yourself what message you were looking into), you should react to the message with the 👀 or ⏳ icons.
Once you have resolved the issue via one of the 4 actions listed above, mark the message with a ✔️ so that it's clear to yourself and your fellow teammates that it has been dealt with.
Create an issue on GitHub
This is generally the most helpful action you could take when a new error message appears. Open a new GitHub issue on the IxDF-web repository using the 'Bug Report' template and fill out as much detail as you can about the error. Take a look at the system owners list to determine who to assign and add 'urgency', 'importance' and 'system' labels to ensure it is dealt with at the appropriate time.
WARNING
Before creating the issue, try to make sure that there isn't already an issue already. You can use GitHub's robust search functionality to assist you. Read more about how to effectively search issues and pull requests here
Fix the cause of the error immediately
If the bug is currently affecting live users, and it can be resolved quickly (less than 1 hour of effort), then it might be worth jumping in and fixing it yourself. Before you take this action, you will need to consider the severity of the issue and how much it will affect your ability to work on other important support engineer tasks.
Notify another developer
If the issue is severe enough to warrant immediate action, but too large or complex for you to resolve, you might need to determine who the system owners are for the related system and ping one of them on Slack. Before you get them involved, try and get as much information as you can about the problem, for example:
- Who does the issue affect?
- How can the issue be reproduced?
- Is there any additional context that you know which could help the developer?
The higher the quality of the information you provide to the developer, the sooner they can resolve the issue and get back to working on their sprint goals.
Announce long planned maintenance
If the issue affects Members and Guests, and it would be helpful for them to be aware of it, you can add an announcement to the site by navigating to the Application announcements panel and creating a 'site status' announcement.
Keep an eye on it
Some errors are not urgent enough to be looked into immediately and don't yet make sense to add as a GitHub issue. With these errors, it's best to just keep an eye on them and continuously re-evaluate if another action needs to be taken.
Manage requests from other teams
You will be added to the @dev-hotline tag on Slack when you are assigned the support engineer role. The team knows to use this tag whenever they require developer help. Just like with error messages on the error channels, you are not necessarily going to be the person who handles the request directly, you are the 'traffic manager' who decides what to do with the request, who needs to handle it and when it needs to be handled.
Keep an eye on performance and error monitoring tools
Your next responsibility is to ensure that our infrastructure and applications are running as smoothly as possible. Not every issue leads to an actual error being recorded. If, for example, interaction-design.org is serving requests much slower than it usually does, you won't get an error in any of the channels, but there still might need to be action taken to sort it out.
We use New Relic for most of our monitoring needs. You'll need to get familiar with this platform and the various dashboards and monitoring tools it provides.
In addition, we use BugSnag for monitoring JavaScript errors, so keeping an eye on the overview dashboard there is also important.
Account access
You should be able to find the login details for the IxDF NewRelic and BugSnag accounts on LastPass. If not, please ping the dev team lead for assistance.
How to spend the rest of your time
As mentioned previously, you probably won't have any product tasks during your support engineer stint. There is still plenty to do though!
Optimize support-engineer work
In order to spend support-engineer time more efficiently, you should strive to optimize how you and your colleagues spend our time. Some good examples: 0. Solve noisy issues
Solve technical issues
This is a great opportunity to solve some technical issues. They are marked by issue-type:technical ⚙️ label.
Improve test coverage
If you run our test runner in code coverage mode, you will be able to identify parts of our codebase which have a less than ideal number of tests. While writing tests just for the sake of having tests is not generally a good idea, you may be able to improve the stability and reliability of some systems by writing some additional tests.