Incident Response Plan

Incident Response

Coordinated procedures for detecting, responding to, and recovering from security incidents and service disruptions, designed to minimize customer impact and meet regulatory notification obligations.

Version

1.0

Effective Date

March 2026

Incident Commander

Security Team

Response Process Overview

Data breaches are reported within 72 hours per GDPR Article 33. All incidents undergo post-mortem review.

1. Incident Classification

All incidents are assigned a severity level (P1–P4) at the time of triage. Severity determines response speed and escalation path.

Severity	Name	Definition	Response Time
P1	Critical	Complete service outage, confirmed data breach, or unauthorized access to customer data	Within 15 minutes
P2	High	Partial service degradation affecting multiple customers, security control failure, or significant data integrity risk	Within 30 minutes
P3	Medium	Elevated error rates, performance degradation, or individual service failure affecting some customers	Within 2 hours
P4	Low	Minor issues, non-customer-impacting errors, or potential (unconfirmed) vulnerabilities	Next business day

2. Incident Response Team

Role	Responsibility	Assigned To
Incident Commander	Overall incident coordination; severity assessment; final decisions on containment and communication; breach notification	Security Officer
Technical Lead	Technical investigation; implementing containment and remediation; infrastructure access (Azure, MongoDB Atlas, Cloudflare)	Engineering Lead
Internal Communications	Engineering team coordination; Slack incident channel; status updates to team	Senior Developer
External Communications	Customer and enterprise notifications; regulatory notifications where required	Incident Commander

P1/P2 Contact Protocol

Sentry or automated monitoring sends Slack alert to the #alerts channel
Engineering Lead acknowledges within 15 minutes
Engineering Lead notifies the Incident Commander immediately for P1 incidents
Incident Commander decides on external communication

3. Detection and Identification

Detection Source	What It Detects	Notification
Sentry	Application errors, exceptions, crash rates in real-time	Slack #alerts
Health-check endpoints	MongoDB Atlas and Redis connectivity loss	Slack #alerts
Latency monitoring	API response times exceeding 5-second threshold	Slack #alerts
Cloudflare analytics	Traffic spikes, WAF rule triggers, DDoS attempts, bot activity	Cloudflare dashboard and alerts
Event tracking	Usage anomalies and unusual patterns across 16+ event types	Sentry and internal logs
Manual reporting	Customer reports and team member observations	Email or Slack

Identification Checklist

When an alert is received, the Engineering Lead performs initial identification:

What is the scope? (single customer, multiple customers, entire platform)
Which services are affected? (API, Dashboard, Review Display Service, Widget CDN)
Is customer data potentially exposed?
Is the issue ongoing or resolved?
What is the likely cause? (infrastructure failure, application bug, security incident, external attack)
Assign severity level P1–P4
Open dedicated Slack thread for incident tracking

4. Containment

The goal of short-term containment is to stop the spread of impact as quickly as possible, even before the root cause is understood.

Scenario	Containment Action
DDoS or traffic attack	Enable Cloudflare 'Under Attack' mode; tighten WAF rules; block attacking IP ranges
Compromised credentials	Immediately rotate compromised credentials in GitHub secrets vault; invalidate active sessions if possible
Unauthorized API access	Identify and block source at Cloudflare WAF level; tighten rate limiting
Service crash or instability	Azure App Services health checks trigger auto-restart; manual restart via Azure portal if needed
Database connectivity loss	MongoDB Atlas replica set automatic failover; verify Atlas cluster health via portal
Data integrity issue	Disable the affected write path at the API level while investigating

5. Eradication and Recovery

Issue Type	Eradication Action
Application vulnerability	Fix developed, code reviewed, tested in staging, deployed via GitHub Actions CI/CD
Compromised credentials	All potentially compromised credentials rotated; access logs reviewed for misuse
Malicious traffic	Permanent WAF rules applied; IP ranges blocked; bot signatures updated
Infrastructure misconfiguration	Configuration corrected via Azure portal or MongoDB Atlas; change documented

Recovery Procedure

Deploy fix via GitHub Actions CI/CD: code review, Docker build, push to Azure Container Registry, staging verification, production promotion
Validate health: confirm health-check endpoints are healthy; Sentry error rates return to baseline; response times normal
Restore data if needed: MongoDB Atlas point-in-time recovery to any point within the backup window
Purge CDN cache: Cloudflare cache purge if stale data was served via CDN
Declare recovery: Engineering Lead closes the incident in the Slack thread; Incident Commander notified

System	Recovery Mechanism
Application	Docker image rebuild and redeployment via GitHub Actions, typically under 30 minutes
Database	MongoDB Atlas point-in-time recovery; replica set automatic failover with zero data loss
Message queue	Azure Service Bus dead-letter queue; no messages lost; reprocessed after recovery
CDN	Cloudflare cache purge on demand; widget delivery from 300+ edge nodes continues even during origin issues

6. Communication Plan

Internal Communication

Severity	Communication Steps
P1	Immediate Slack alert → Engineering Lead acknowledges within 15 minutes → Incident Commander notified immediately → active incident thread maintained throughout
P2	Slack alert → Engineering Lead acknowledges within 30 minutes → Incident Commander notified → Slack thread maintained
P3	Slack alert → Engineering team handles → updates in thread
P4	Logged in issue tracker; addressed in normal development cycle

Customer and External Communication

Scenario	Notification Commitment
Confirmed personal data breach	Affected merchants notified within 72 hours of discovery (GDPR Article 33)
Extended service outage (P1)	Customer notification within 4 hours if outage persists beyond 1 hour
Partial service degradation (P2)	Customer notification at Incident Commander's discretion based on scope and duration
Security vulnerability (no breach)	Disclosed to enterprise customers upon request; patched before disclosure where possible
Regulatory notification	Relevant supervisory authority notified where required by GDPR or applicable law

Breach Notification Content

A data breach notification includes:

•Nature of the incident (what happened)
•Categories and approximate number of data subjects affected
•Categories of personal data involved
•Likely consequences of the breach
•Measures taken and proposed to address the breach
•Contact point for further information

7. Evidence Collection

Evidence is preserved to support root cause analysis, customer communication, and regulatory compliance. Before making remediation changes to a compromised system, the Engineering Lead must preserve relevant logs and configuration snapshots. Evidence must not be overwritten until the post-incident review is complete.

Evidence Type	Source	Retention
Application error logs	Sentry (full error context, timestamps, stack traces, user context)	90 days
API request logs	Application-level logging (endpoints, response times, status codes)	90 days
Cloudflare logs	WAF events, traffic analytics, blocked requests	Per Cloudflare plan
MongoDB Atlas audit logs	Database operation logs (where enabled)	Per Atlas configuration
GitHub Actions logs	Deployment history and build logs	GitHub retention policy

8. Metrics and KPIs

Baseline values will be established through 2026 incident tracking.

Metric	Definition	Target
MTTD	Mean time from incident start to detection	To be baselined
MTTA	Mean time from alert to Engineering Lead acknowledgment	P1: ≤15 min; P2: ≤30 min
MTTR	Mean time from detection to full resolution	P1: ≤4 hours; P2: ≤8 hours
Notification compliance	Percentage of breach notifications sent within 72 hours	100%
Post-incident review	Percentage of P1/P2 incidents with completed post-mortem	100%

9. Post-Incident Review

A post-incident review is conducted after every P1 and P2 incident, and selected P3 incidents. Reviews are conducted within 5 business days of resolution.

Review Agenda

Timeline reconstruction: what happened and when (detection, containment, resolution)
Root cause: what was the underlying cause
Detection effectiveness: did monitoring catch this promptly and if not, why
Response effectiveness: was the response fast and well-coordinated
Customer impact: how many customers were affected and what was the impact
Preventive measures: what changes will prevent recurrence
Action items: specific, assigned, time-bound improvements

10. Incident Response Testing

Current State

Incident response procedures documented in this plan
Engineering team familiar with monitoring tools: Sentry, Cloudflare, Azure portal, MongoDB Atlas
Informal testing occurs through real incident handling

Planned (2027)

Formal tabletop exercises and incident simulation as part of SOC 2 audit preparation
Automated runbook documentation as the team scales

Incident Reporting and Security Inquiries

Tatvam Cloud Solutions, LLP

Email: [email protected]