Scale Readiness

Building an eCommerce platform at scale is a deep endeavour and requires deeper engineering efforts across the spectrum of Performance Engineering, Site Reliability & Incident Management. This section describes our approach to building and operating large scale retail platforms on JC

Traffic modelling

We start by drawing a simulation models of all possible user journeys, funnels & conversion percentage for an eCommerce Store-front

Scalability Scenarios

The traffic model is then augmented with scenarios which simulate the actual user behavior

Scale Readiness Example

The example shows how these techniques are applied by our SRE Teams. We use the traffic model to simulate synthetic traffic executed against a replica of the production environmentScenario: Checkout flow Server side Key Performance metrics

Scenario: Checkout flow user arrival rate

Scenario: Checkout flow response time

Monitoring and Observability Tech Stack

Our Monitoring tool-chain helps us to watch and understand system’s state using predefined set of metrics

Metrics Pipeline Architecture

Our metrics pipeline is central to our approach and uses open source proven components which record real time metrics at scale. This helps teams stay on top of issues and get pro-active feedback.

Examples of metrics we monitor
  • CPU Utilization - Percentage of CPU Utilization
  • Database Connections - Number of client network connection to the` instance
  • Read + Write IOPS - Average number of disk read+write IOPS/Sec
  • Disk Queue Depth - Number of read/write iops waiting for disk access

Alerting Layer

The Alerting tool-chain notifies teams about critical events or exceeding threshold limits

Alert Examples
  • Grafana Slack & Pagerduty alerts condition: Threshold breaching/1Min
  • Code Error rate > 10 errors/1Min Sentry sends alerts via Slack
  • Business KPIs: Order loss/1Min, OMS Processing Lag > 1k/5 Min

Traffic Flow Debugging

We’ve also created a custom blueprint helps in faster analysis and identify the exact issue between multiple points in the Infrastructure. Here’s an example -

Production Readiness

We define Production Readiness as the process that ensures sure each feature push is production ready for live customers. We follow a checklist which audits each platform service and the cloud infrastructure. A high level PR process is mentioned below -

Platform Services
  • Monitoring Layer
  • Observability Layer
  • Alerting Layer
  • Pre-warming Data
  • Regression Test Report
  • Define backup plan
  • Monitoring Layer
  • Observability Layer
  • Alerting Layer
  • Auditing
Peak Event Planning

Make sure systems are scaled enough to handle desired traffic

  • SLA Calculation for 5x of expected traffic
  • Conducting Performance Engg cycles
  • Chaos Engineering
  • Production resource upgradation
  • Pre-warming data
  • War Room setups with 24x7 support team
  • Setup sale dashboard for unified view of events

Build and Scale your Business Here

Explore our Platform or request a demo from our Professional Services Team