Backup and Recovery
WiserReview's approach to data backup, recovery, and business continuity. Every layer of the platform has defined recovery capabilities designed to protect customer data from loss, corruption, or infrastructure failure.
RTO Target
4 hours
RPO Target (Database)
Near-zero
Owner
Security Team
1. Backup Strategy
Primary Database: MongoDB Atlas
All review data, merchant accounts, customer emails, product catalogs, and configuration
| Backup type | Continuous automated backups (cloud-provider managed by MongoDB Atlas) |
| Point-in-time recovery | Data can be restored to any point within the backup window |
| Backup frequency | Continuous (near real-time) |
| Geographic redundancy | Replica sets distribute data across multiple availability zones |
| Encryption | AES-256, consistent with primary data encryption |
| High availability | Minimum 3-node replica sets; automatic failover within seconds if primary node fails; zero data loss |
| Access control | Backup access restricted to Security Officer and Engineering Lead only |
File Storage: Azure Blob Storage and AWS S3
Review photos and videos uploaded by reviewers
| Redundancy | Azure Blob Storage: Locally Redundant Storage (LRS) minimum; Geo-Redundant Storage (GRS) where configured. AWS S3: 99.999999999% (11 nines) durability by design. |
| Encryption | Server-side AES-256 encryption at rest |
| Access control | Time-limited signed URLs for file access: no public bucket access |
Message Queue: Azure Service Bus
Asynchronous email campaign delivery and event-driven workflows
| Dead-letter queue | Messages that cannot be processed are automatically moved to a dead-letter queue: no messages are silently dropped |
| Retry logic | Automatic retry with configurable intervals before dead-lettering |
| Recovery | Dead-letter messages can be inspected and reprocessed after underlying issues are resolved |
Application Layer: Docker Containers
Deployments are reproducible via CI/CD, not backed up in the traditional sense
| Container images | All Docker images stored in Azure Container Registry with version history |
| Previous versions | Prior deployment images retained in the registry and redeployable at any time |
| Source code | All source code stored in GitHub with full version history |
| Rollback | Azure App Services supports deployment slot swapping for instant rollback to previous version |
2. Recovery Objectives
| Metric | Target | Basis |
|---|---|---|
| RTO (Recovery Time Objective) | 4 hours | Azure App Services auto-recovery, MongoDB Atlas failover, and CI/CD redeployment capabilities |
| RPO: Database | Near-zero | MongoDB Atlas continuous backups and replica set replication provides near real-time redundancy |
| RPO: Other systems | 1 hour | File storage and queue-based systems; messages in Azure Service Bus dead-letter queue are recoverable |
| Widget delivery RTO | Near-zero | Cloudflare CDN continues serving cached widget assets from 300+ edge locations even during origin outages |
RTO and RPO targets are based on infrastructure capabilities. Actual recovery times depend on the nature and scope of the incident and will be refined as operational history is established.
3. Recovery Capabilities
Database Recovery
| Scenario | Mechanism | Expected Time |
|---|---|---|
| Primary node failure | Replica set automatic failover: secondary promoted to primary | Seconds to 1–2 minutes (automatic) |
| Data corruption or accidental deletion | Point-in-time recovery to any point within backup window | 1–4 hours depending on data volume |
| Full cluster failure | MongoDB Atlas cluster restoration from backup snapshot | Depends on data volume and Atlas tier |
Application Recovery
| Scenario | Mechanism | Expected Time |
|---|---|---|
| Crashed container | Azure App Services health checks trigger automatic container restart | 1–3 minutes (automatic) |
| Faulty deployment | Rollback to previous Docker image via Azure App Services deployment slot swap | 5–15 minutes |
| Full service rebuild | GitHub Actions CI/CD pipeline: code → Docker build → Azure Container Registry → Azure App Services | 20–30 minutes |
Infrastructure Recovery
| Scenario | Mechanism |
|---|---|
| Azure availability zone outage | Azure App Services with zone redundancy or manual failover to secondary region |
| Cloudflare edge issue | 300+ global edge nodes provide inherent redundancy: widget delivery continues from other edges |
| Redis cache failure | Application falls back to direct database queries; Redis cache is rebuilt from database on restart |
| Azure Service Bus issue | Dead-letter queue preserves all messages; reprocessed after service restoration |
4. Disaster Recovery Scenarios
| Scenario | Response |
|---|---|
| Azure region partial outage | Auto-scaling redistributes load; MongoDB Atlas replica sets span availability zones for database resilience |
| Full Azure region outage | Failover to secondary Azure region; MongoDB Atlas can be configured for cross-region replicas |
| Cloudflare outage | Widget delivery may be impacted; review data and dashboard remain accessible through direct DNS fallback |
| MongoDB Atlas outage | Atlas SLA at 99.995% uptime; replica sets provide database-level redundancy; severe outage handled per Atlas DR procedures |
| Mass data corruption | MongoDB Atlas point-in-time recovery to last known good state; scope assessed before restoration to minimize data loss |
| Security breach requiring platform shutdown | Cloudflare can enable maintenance mode at the edge; Engineering Lead coordinates shutdown and recovery per Incident Response Plan |
5. Backup Testing
Current Practice
- MongoDB Atlas automatic failover is inherently tested through replica set operations
- Application rollback is exercised through normal deployment operations (deployment slot swaps)
- Recovery procedures reviewed and updated as infrastructure evolves
Planned Formalization (2027)
- Periodic scheduled recovery drills: database point-in-time recovery test and application rollback test
- Formalized as part of the 2027 SOC 2 audit preparation
- Recovery tests documented with results and improvement notes
6. Roles and Responsibilities
| Role | Responsibility |
|---|---|
| Security Officer | Oversight of backup and recovery policy; approves changes to recovery objectives; accountable for recovery in a crisis |
| Engineering Lead | Implements and maintains backup configurations; executes recovery procedures; monitors backup health; maintains Azure, MongoDB Atlas, and AWS S3 configurations |
| Development Team | Ensures application code handles infrastructure failures gracefully (retry logic, fallback behavior); participates in recovery procedures as directed |
7. Monitoring and Alerting
| Monitoring | Implementation |
|---|---|
| Database health | Health-check endpoints monitor MongoDB Atlas connectivity; failure triggers immediate Slack alert |
| Cache health | Health-check endpoints monitor Redis connectivity; failure triggers Slack alert |
| Application health | Azure App Services health checks trigger automatic container restart; Sentry monitors error rates |
| Backup status | MongoDB Atlas provides backup status and alerts through the Atlas portal; Engineering Lead reviews periodically |
| Queue health | Azure Service Bus dead-letter queue depth monitored; elevated counts indicate processing issues |
| Storage health | Azure Blob Storage and AWS S3 availability monitoring through respective cloud portals |
All critical infrastructure alerts flow to the engineering team's Slack #alerts channel.