Incident Response Plan

Incident Response

Coordinated procedures for detecting, responding to, and recovering from security incidents and service disruptions, designed to minimize customer impact and meet regulatory notification obligations.

Version

1.0

Effective Date

March 2026

Incident Commander

Security Team

Response Process Overview

Automated DetectionContinuous 24/7 monitoringSentryError TrackingHealth ChecksDB & cacheLatency Alerts>5s thresholdCloudflareAnalyticsSlack Alert → Engineering TeamReal-time notification to #alerts channelTriage & Severity AssessmentEngineering Lead assesses scope and impactSeverity?P1 / P2Critical / HighP3 / P4Medium / LowCEO + Engineering LeadIncident Commander engagedContainmentIsolate service · Reroute via Cloudflare · Revoke credentialsEradication & FixRoot cause analysis · Fix via GitHub Actions CI/CDCustomer NotificationWithin 72 hours (GDPR Art. 33)Engineering TeamStandard development cycleFix via CI/CDDeployed through GitHub Actions pipelineInternal DocumentationIssue logged and tracked internally

Data breaches are reported within 72 hours per GDPR Article 33. All incidents undergo post-mortem review.

1. Incident Classification

All incidents are assigned a severity level (P1–P4) at the time of triage. Severity determines response speed and escalation path.

SeverityNameDefinitionResponse Time
P1CriticalComplete service outage, confirmed data breach, or unauthorized access to customer dataWithin 15 minutes
P2HighPartial service degradation affecting multiple customers, security control failure, or significant data integrity riskWithin 30 minutes
P3MediumElevated error rates, performance degradation, or individual service failure affecting some customersWithin 2 hours
P4LowMinor issues, non-customer-impacting errors, or potential (unconfirmed) vulnerabilitiesNext business day

2. Incident Response Team

RoleResponsibilityAssigned To
Incident CommanderOverall incident coordination; severity assessment; final decisions on containment and communication; breach notificationSecurity Officer
Technical LeadTechnical investigation; implementing containment and remediation; infrastructure access (Azure, MongoDB Atlas, Cloudflare)Engineering Lead
Internal CommunicationsEngineering team coordination; Slack incident channel; status updates to teamSenior Developer
External CommunicationsCustomer and enterprise notifications; regulatory notifications where requiredIncident Commander

P1/P2 Contact Protocol

  1. Sentry or automated monitoring sends Slack alert to the #alerts channel
  2. Engineering Lead acknowledges within 15 minutes
  3. Engineering Lead notifies the Incident Commander immediately for P1 incidents
  4. Incident Commander decides on external communication

3. Detection and Identification

Detection SourceWhat It DetectsNotification
SentryApplication errors, exceptions, crash rates in real-timeSlack #alerts
Health-check endpointsMongoDB Atlas and Redis connectivity lossSlack #alerts
Latency monitoringAPI response times exceeding 5-second thresholdSlack #alerts
Cloudflare analyticsTraffic spikes, WAF rule triggers, DDoS attempts, bot activityCloudflare dashboard and alerts
Event trackingUsage anomalies and unusual patterns across 16+ event typesSentry and internal logs
Manual reportingCustomer reports and team member observationsEmail or Slack

Identification Checklist

When an alert is received, the Engineering Lead performs initial identification:

  • What is the scope? (single customer, multiple customers, entire platform)
  • Which services are affected? (API, Dashboard, Review Display Service, Widget CDN)
  • Is customer data potentially exposed?
  • Is the issue ongoing or resolved?
  • What is the likely cause? (infrastructure failure, application bug, security incident, external attack)
  • Assign severity level P1–P4
  • Open dedicated Slack thread for incident tracking

4. Containment

The goal of short-term containment is to stop the spread of impact as quickly as possible, even before the root cause is understood.

ScenarioContainment Action
DDoS or traffic attackEnable Cloudflare 'Under Attack' mode; tighten WAF rules; block attacking IP ranges
Compromised credentialsImmediately rotate compromised credentials in GitHub secrets vault; invalidate active sessions if possible
Unauthorized API accessIdentify and block source at Cloudflare WAF level; tighten rate limiting
Service crash or instabilityAzure App Services health checks trigger auto-restart; manual restart via Azure portal if needed
Database connectivity lossMongoDB Atlas replica set automatic failover; verify Atlas cluster health via portal
Data integrity issueDisable the affected write path at the API level while investigating

5. Eradication and Recovery

Issue TypeEradication Action
Application vulnerabilityFix developed, code reviewed, tested in staging, deployed via GitHub Actions CI/CD
Compromised credentialsAll potentially compromised credentials rotated; access logs reviewed for misuse
Malicious trafficPermanent WAF rules applied; IP ranges blocked; bot signatures updated
Infrastructure misconfigurationConfiguration corrected via Azure portal or MongoDB Atlas; change documented

Recovery Procedure

  1. Deploy fix via GitHub Actions CI/CD: code review, Docker build, push to Azure Container Registry, staging verification, production promotion
  2. Validate health: confirm health-check endpoints are healthy; Sentry error rates return to baseline; response times normal
  3. Restore data if needed: MongoDB Atlas point-in-time recovery to any point within the backup window
  4. Purge CDN cache: Cloudflare cache purge if stale data was served via CDN
  5. Declare recovery: Engineering Lead closes the incident in the Slack thread; Incident Commander notified
SystemRecovery Mechanism
ApplicationDocker image rebuild and redeployment via GitHub Actions, typically under 30 minutes
DatabaseMongoDB Atlas point-in-time recovery; replica set automatic failover with zero data loss
Message queueAzure Service Bus dead-letter queue; no messages lost; reprocessed after recovery
CDNCloudflare cache purge on demand; widget delivery from 300+ edge nodes continues even during origin issues

6. Communication Plan

Internal Communication

SeverityCommunication Steps
P1Immediate Slack alert → Engineering Lead acknowledges within 15 minutes → Incident Commander notified immediately → active incident thread maintained throughout
P2Slack alert → Engineering Lead acknowledges within 30 minutes → Incident Commander notified → Slack thread maintained
P3Slack alert → Engineering team handles → updates in thread
P4Logged in issue tracker; addressed in normal development cycle

Customer and External Communication

ScenarioNotification Commitment
Confirmed personal data breachAffected merchants notified within 72 hours of discovery (GDPR Article 33)
Extended service outage (P1)Customer notification within 4 hours if outage persists beyond 1 hour
Partial service degradation (P2)Customer notification at Incident Commander's discretion based on scope and duration
Security vulnerability (no breach)Disclosed to enterprise customers upon request; patched before disclosure where possible
Regulatory notificationRelevant supervisory authority notified where required by GDPR or applicable law

Breach Notification Content

A data breach notification includes:

  • Nature of the incident (what happened)
  • Categories and approximate number of data subjects affected
  • Categories of personal data involved
  • Likely consequences of the breach
  • Measures taken and proposed to address the breach
  • Contact point for further information

7. Evidence Collection

Evidence is preserved to support root cause analysis, customer communication, and regulatory compliance. Before making remediation changes to a compromised system, the Engineering Lead must preserve relevant logs and configuration snapshots. Evidence must not be overwritten until the post-incident review is complete.

Evidence TypeSourceRetention
Application error logsSentry (full error context, timestamps, stack traces, user context)90 days
API request logsApplication-level logging (endpoints, response times, status codes)90 days
Cloudflare logsWAF events, traffic analytics, blocked requestsPer Cloudflare plan
MongoDB Atlas audit logsDatabase operation logs (where enabled)Per Atlas configuration
GitHub Actions logsDeployment history and build logsGitHub retention policy

8. Metrics and KPIs

Baseline values will be established through 2026 incident tracking.

MetricDefinitionTarget
MTTDMean time from incident start to detectionTo be baselined
MTTAMean time from alert to Engineering Lead acknowledgmentP1: ≤15 min; P2: ≤30 min
MTTRMean time from detection to full resolutionP1: ≤4 hours; P2: ≤8 hours
Notification compliancePercentage of breach notifications sent within 72 hours100%
Post-incident reviewPercentage of P1/P2 incidents with completed post-mortem100%

9. Post-Incident Review

A post-incident review is conducted after every P1 and P2 incident, and selected P3 incidents. Reviews are conducted within 5 business days of resolution.

Review Agenda

  1. Timeline reconstruction: what happened and when (detection, containment, resolution)
  2. Root cause: what was the underlying cause
  3. Detection effectiveness: did monitoring catch this promptly and if not, why
  4. Response effectiveness: was the response fast and well-coordinated
  5. Customer impact: how many customers were affected and what was the impact
  6. Preventive measures: what changes will prevent recurrence
  7. Action items: specific, assigned, time-bound improvements

10. Incident Response Testing

Current State

  • Incident response procedures documented in this plan
  • Engineering team familiar with monitoring tools: Sentry, Cloudflare, Azure portal, MongoDB Atlas
  • Informal testing occurs through real incident handling

Planned (2027)

  • Formal tabletop exercises and incident simulation as part of SOC 2 audit preparation
  • Automated runbook documentation as the team scales

Incident Reporting and Security Inquiries

Tatvam Cloud Solutions, LLP

Email: [email protected]