Incident Response
Coordinated procedures for detecting, responding to, and recovering from security incidents and service disruptions, designed to minimize customer impact and meet regulatory notification obligations.
Version
1.0
Effective Date
March 2026
Incident Commander
Security Team
Response Process Overview
Data breaches are reported within 72 hours per GDPR Article 33. All incidents undergo post-mortem review.
1. Incident Classification
All incidents are assigned a severity level (P1–P4) at the time of triage. Severity determines response speed and escalation path.
| Severity | Name | Definition | Response Time |
|---|---|---|---|
| P1 | Critical | Complete service outage, confirmed data breach, or unauthorized access to customer data | Within 15 minutes |
| P2 | High | Partial service degradation affecting multiple customers, security control failure, or significant data integrity risk | Within 30 minutes |
| P3 | Medium | Elevated error rates, performance degradation, or individual service failure affecting some customers | Within 2 hours |
| P4 | Low | Minor issues, non-customer-impacting errors, or potential (unconfirmed) vulnerabilities | Next business day |
2. Incident Response Team
| Role | Responsibility | Assigned To |
|---|---|---|
| Incident Commander | Overall incident coordination; severity assessment; final decisions on containment and communication; breach notification | Security Officer |
| Technical Lead | Technical investigation; implementing containment and remediation; infrastructure access (Azure, MongoDB Atlas, Cloudflare) | Engineering Lead |
| Internal Communications | Engineering team coordination; Slack incident channel; status updates to team | Senior Developer |
| External Communications | Customer and enterprise notifications; regulatory notifications where required | Incident Commander |
P1/P2 Contact Protocol
- Sentry or automated monitoring sends Slack alert to the #alerts channel
- Engineering Lead acknowledges within 15 minutes
- Engineering Lead notifies the Incident Commander immediately for P1 incidents
- Incident Commander decides on external communication
3. Detection and Identification
| Detection Source | What It Detects | Notification |
|---|---|---|
| Sentry | Application errors, exceptions, crash rates in real-time | Slack #alerts |
| Health-check endpoints | MongoDB Atlas and Redis connectivity loss | Slack #alerts |
| Latency monitoring | API response times exceeding 5-second threshold | Slack #alerts |
| Cloudflare analytics | Traffic spikes, WAF rule triggers, DDoS attempts, bot activity | Cloudflare dashboard and alerts |
| Event tracking | Usage anomalies and unusual patterns across 16+ event types | Sentry and internal logs |
| Manual reporting | Customer reports and team member observations | Email or Slack |
Identification Checklist
When an alert is received, the Engineering Lead performs initial identification:
- What is the scope? (single customer, multiple customers, entire platform)
- Which services are affected? (API, Dashboard, Review Display Service, Widget CDN)
- Is customer data potentially exposed?
- Is the issue ongoing or resolved?
- What is the likely cause? (infrastructure failure, application bug, security incident, external attack)
- Assign severity level P1–P4
- Open dedicated Slack thread for incident tracking
4. Containment
The goal of short-term containment is to stop the spread of impact as quickly as possible, even before the root cause is understood.
| Scenario | Containment Action |
|---|---|
| DDoS or traffic attack | Enable Cloudflare 'Under Attack' mode; tighten WAF rules; block attacking IP ranges |
| Compromised credentials | Immediately rotate compromised credentials in GitHub secrets vault; invalidate active sessions if possible |
| Unauthorized API access | Identify and block source at Cloudflare WAF level; tighten rate limiting |
| Service crash or instability | Azure App Services health checks trigger auto-restart; manual restart via Azure portal if needed |
| Database connectivity loss | MongoDB Atlas replica set automatic failover; verify Atlas cluster health via portal |
| Data integrity issue | Disable the affected write path at the API level while investigating |
5. Eradication and Recovery
| Issue Type | Eradication Action |
|---|---|
| Application vulnerability | Fix developed, code reviewed, tested in staging, deployed via GitHub Actions CI/CD |
| Compromised credentials | All potentially compromised credentials rotated; access logs reviewed for misuse |
| Malicious traffic | Permanent WAF rules applied; IP ranges blocked; bot signatures updated |
| Infrastructure misconfiguration | Configuration corrected via Azure portal or MongoDB Atlas; change documented |
Recovery Procedure
- Deploy fix via GitHub Actions CI/CD: code review, Docker build, push to Azure Container Registry, staging verification, production promotion
- Validate health: confirm health-check endpoints are healthy; Sentry error rates return to baseline; response times normal
- Restore data if needed: MongoDB Atlas point-in-time recovery to any point within the backup window
- Purge CDN cache: Cloudflare cache purge if stale data was served via CDN
- Declare recovery: Engineering Lead closes the incident in the Slack thread; Incident Commander notified
| System | Recovery Mechanism |
|---|---|
| Application | Docker image rebuild and redeployment via GitHub Actions, typically under 30 minutes |
| Database | MongoDB Atlas point-in-time recovery; replica set automatic failover with zero data loss |
| Message queue | Azure Service Bus dead-letter queue; no messages lost; reprocessed after recovery |
| CDN | Cloudflare cache purge on demand; widget delivery from 300+ edge nodes continues even during origin issues |
6. Communication Plan
Internal Communication
| Severity | Communication Steps |
|---|---|
| P1 | Immediate Slack alert → Engineering Lead acknowledges within 15 minutes → Incident Commander notified immediately → active incident thread maintained throughout |
| P2 | Slack alert → Engineering Lead acknowledges within 30 minutes → Incident Commander notified → Slack thread maintained |
| P3 | Slack alert → Engineering team handles → updates in thread |
| P4 | Logged in issue tracker; addressed in normal development cycle |
Customer and External Communication
| Scenario | Notification Commitment |
|---|---|
| Confirmed personal data breach | Affected merchants notified within 72 hours of discovery (GDPR Article 33) |
| Extended service outage (P1) | Customer notification within 4 hours if outage persists beyond 1 hour |
| Partial service degradation (P2) | Customer notification at Incident Commander's discretion based on scope and duration |
| Security vulnerability (no breach) | Disclosed to enterprise customers upon request; patched before disclosure where possible |
| Regulatory notification | Relevant supervisory authority notified where required by GDPR or applicable law |
Breach Notification Content
A data breach notification includes:
- •Nature of the incident (what happened)
- •Categories and approximate number of data subjects affected
- •Categories of personal data involved
- •Likely consequences of the breach
- •Measures taken and proposed to address the breach
- •Contact point for further information
7. Evidence Collection
Evidence is preserved to support root cause analysis, customer communication, and regulatory compliance. Before making remediation changes to a compromised system, the Engineering Lead must preserve relevant logs and configuration snapshots. Evidence must not be overwritten until the post-incident review is complete.
| Evidence Type | Source | Retention |
|---|---|---|
| Application error logs | Sentry (full error context, timestamps, stack traces, user context) | 90 days |
| API request logs | Application-level logging (endpoints, response times, status codes) | 90 days |
| Cloudflare logs | WAF events, traffic analytics, blocked requests | Per Cloudflare plan |
| MongoDB Atlas audit logs | Database operation logs (where enabled) | Per Atlas configuration |
| GitHub Actions logs | Deployment history and build logs | GitHub retention policy |
8. Metrics and KPIs
Baseline values will be established through 2026 incident tracking.
| Metric | Definition | Target |
|---|---|---|
| MTTD | Mean time from incident start to detection | To be baselined |
| MTTA | Mean time from alert to Engineering Lead acknowledgment | P1: ≤15 min; P2: ≤30 min |
| MTTR | Mean time from detection to full resolution | P1: ≤4 hours; P2: ≤8 hours |
| Notification compliance | Percentage of breach notifications sent within 72 hours | 100% |
| Post-incident review | Percentage of P1/P2 incidents with completed post-mortem | 100% |
9. Post-Incident Review
A post-incident review is conducted after every P1 and P2 incident, and selected P3 incidents. Reviews are conducted within 5 business days of resolution.
Review Agenda
- Timeline reconstruction: what happened and when (detection, containment, resolution)
- Root cause: what was the underlying cause
- Detection effectiveness: did monitoring catch this promptly and if not, why
- Response effectiveness: was the response fast and well-coordinated
- Customer impact: how many customers were affected and what was the impact
- Preventive measures: what changes will prevent recurrence
- Action items: specific, assigned, time-bound improvements
10. Incident Response Testing
Current State
- Incident response procedures documented in this plan
- Engineering team familiar with monitoring tools: Sentry, Cloudflare, Azure portal, MongoDB Atlas
- Informal testing occurs through real incident handling
Planned (2027)
- Formal tabletop exercises and incident simulation as part of SOC 2 audit preparation
- Automated runbook documentation as the team scales