As someone responsible for security at a startup, you need to prepare for black swan events: things that almost never happen, but when they do it's a very big deal.
Security breaches, major outages, and natural disasters fall into this category. Due to their low frequency, it takes a long time for both teams and individuals to accumulate real-world experience handling them.
One tactic that is a favorite of security and reliability teams alike is the tabletop exercise.
At Vanta, we think of a tabletop exercise like a game of Dungeons and Dragons. You get a bunch of people together at a table (physically or in the Zoom metaverse), and they all pretend to go on an adventure. In this case, the adventure is responding to a black swan event.
When done right, you'll find that you can learn things about your capabilities, your knowledge gaps, and your team that you didn't know before. Then you can go and fix those things before the real black swan event. It also helps your team develop some experience in handling these low frequency events before they've spent decades in the trenches together.
Business continuity, disaster recovery, and security incident response
You'll see a lot of references to BC/DR (business continuity / disaster recovery) and security incident response when learning about tabletop exercises. Some compliance frameworks specifically call for tabletop exercises in either or both of these. There is a lot of overlap between these concepts, but there are some distinctions as well.
Business continuity refers to your ability to continue business operations during a major event. For example, you may have invested in highly available infrastructure to ensure that your systems aren't taken offline by an outage in one region.
Disaster recovery is about coming back online after a major disaster. For example, if your database server is destroyed, how effectively can you recover backups and restore service?
Security incident response typically focuses on your ability to respond to a malicious adversary who has compromised some system that is important to your security.
Generally speaking, you'll find the first two grouped together as BC/DR.
There are scenarios that would cover all three of these concepts. If an attacker leverages an exposed AWS credential to delete your database, you would require all of these capabilities to respond effectively.
Anatomy of a tabletop
A tabletop exercise typically involves one or more people who run the scenario, a team of responders, a note-taker, and potentially some observers.
The people responsible for devising a scenario will generally plan ahead of time and map it out. It can be as simple or complex as they'd like. At a high level though, it involves some prompt which mirrors how a real world incident would come to light.
For example, "An attacker gets malware on an employee's laptop" is not a very compelling scenario. It gives too much away from the start. As a responding team, you're never going to have such a high confidence first signal without making way too many assumptions.
"The on-call Security Engineer received a Slack notification from our anti-malware tool showing that 5 different endpoints are reporting malware infections" is a much better place to start. These alerts may be false positives; the scenario could have nothing to do with malware. It could even be a hyper targeted red herring. Either way your responders have some questions to answer.
Now that you've got a scenario in mind, you can schedule a meeting and get everyone together at the table. The group running the exercise is responsible for acting out the part of "reality" for the responders.
You read out the prompt, and rather than actually digging into logs, etc. they tell you what they want to do, and you share the result.
For example, they may say "I message each of the employees in question on Slack to ask them if they recently downloaded any new software" and you might respond "One responds no, three are not online, and one says they just joined so they downloaded all of the stuff recommended in the wiki."
Another responder might ask, "What are the roles of these employees?" You shouldn't answer this. A responder can't just ask for facts, they need to ask for the effect of their actions. If they don't know how they would normally find this information, you shouldn't provide it.
A better question might be, "I look up each employee in the HR system (or Vanta ) to find their role, what does it say?" Even here you can be skeptical. You can ask, "Do you have access to that information?" The goal isn't to be difficult, it's to probe the questions that will actually matter in a real incident. If your responders don't have access to information they find important, that's exactly the type of thing a tabletop exercise can help you discover.
After your responders discover the root cause (or don't), deploy a fix (or don't), and recover (or don't), you can end the exercise. Now you should spend some time talking about what happened, why you responded this way or that, and what lessons came out of the exercise.
You should leave the meeting with a list of action items that will make the next incident response more effective, whether it's real or fake.
Running an effective exercise
Keep it simple
When Google runs these exercises, I'm confident they make them as realistic as possible and coordinate across many teams and time zones. You don't need to go so far at a startup. The goal here is to find the low hanging fruit in your response capabilities, fix them, and iterate.
For your first exercise, come up with a scenario that is simple and plausible. You can also time box the exercise to a portion of an hour long meeting. This way you ensure that everyone is fully engaged, and you leave some time for discussion at the end.
Take it seriously, and have fun doing it
It's a small thing, but take seriously your role as "dungeon master" and hold your responding team to that same expectation. Whoever is running the scenario should keep it to themselves until the exercise has begun. Still, make sure it's fun. Be creative with your answers, and don't be afraid to joke around a little.
Make sure that someone present, ideally not any of the participants or the person running the scenario, is taking detailed notes. Every question asked by the responding team, any insights that come up in the moment, and ideas for after the exercise should be noted in raw form. After the exercise you can review them together to pull out lessons learned and specific action items.
Have a plan
If you're the one running the scenario, you should anticipate some of the most likely paths your responders will take. Throw in a few twists and turns, and potentially red herrings. If your scenario involves an AWS outage, check which of your vendors would also be affected. You don't have to do all the research in depth, but be prepared so that your responders need to overcome some plausible challenges.
If there are easy paths to end the exercise, make sure that you have a plausible reason for those paths to be blocked, e.g."You try to roll back the bad code change, but the deployment platform is down."
BC/DR or IR
It's possible to craft your scenario such that it covers both BC/DR and IR. This is especially expedient if compliance with a standard depends on you covering both of those topics in annual exercises.
For the best results though, it helps to focus. If you're running a BC/DR exercise, you can spend more of your time focusing on the difficulties in maintaining and recovering services during an incident. If you're running a security incident response exercise, you can spend more time on dealing with the presence of an active adversary.
Don't forget about comms
Consider who you'll need to communicate with. Customers, legal, IR firms, law enforcement, etc. are all fair game. Just don't dial 911 for real.
Try it and iterate
At the end of the day, the easiest way to get good at these is to schedule the meeting, get everyone in the room, and start. The first couple of times you do this, you'll learn a lot about the process itself, and you'll be able to identify some of the low hanging fruit in your response process too.
Here are some example scenarios to get you started:
- A major earthquake hits the city in which your office is located. Power is lost, commuting to work is not an option.
- An entire region of your cloud provider goes down for an indefinite amount of time.
- An engineer accidentally deletes all of your database tables.
Security incident response
- Your CEO downloads malware by accident and you learn of it an hour later because they realize that email wasn't legit.
- An engineer discovers some EC2 instances that they didn't expect to see. You could include twists like some more instances hiding in random regions your company doesn't normally use and a root cause of an engineer publishing AWS credentials to a public personal repository by mistake.
- Somebody mails an infected USB drive to your office. Your team learns of it because they posted "Wow, I got a free thumb drive!" on Slack.