Sending one SMS is trivial. Sending 10 million SMS messages per day — with sub-second delivery latency, 99.9%+ delivery rates, real-time status tracking, and zero single points of failure — is an engineering challenge that most platforms quietly struggle with. This article explains how it's done.
Whether you are evaluating an enterprise SMS provider or building your own messaging infrastructure, understanding these layers will help you ask the right questions and set realistic expectations.
Layer 1 — Carrier Connectivity: Direct Peering vs Aggregator Routing
Every SMS ultimately travels over a telecom operator's network. How a messaging platform connects to those networks is the most fundamental architectural decision — and the one with the greatest impact on delivery rates and latency.
Aggregator routing (the common approach)
Most SMS providers purchase capacity from wholesale SMS aggregators, who in turn connect to carriers. The typical chain looks like: your application → SMS platform → aggregator A → aggregator B → carrier → subscriber. Each hop adds latency (10–500ms per hop), introduces a potential failure point, and reduces delivery reliability. Grey routes — unauthorised paths that bypass carrier agreements — are common in aggregator networks and violate TRAI regulations.
Direct carrier peering (the enterprise approach)
A direct carrier connection means the messaging platform has a bilateral agreement with the telecom operator and connects via a dedicated SMPP (Short Message Peer-to-Peer) link directly to the carrier's SMSC (Short Message Service Centre). The chain becomes: your application → SMS platform → carrier SMSC → subscriber.
The benefits are substantial:
- ✓Sub-100ms delivery latency for domestic messages
- ✓Direct DLR (Delivery Receipt) from the carrier — not an estimated status from an aggregator
- ✓No grey routing — 100% compliance with TRAI regulations
- ✓Higher throughput limits negotiated directly with the carrier
- ✓Dedicated connections that are not shared with other customers during peak periods
- ✓Carrier-level visibility into why a message failed (handset off, number invalid, DND, etc.)
TextNation maintains direct SMPP connections to all four major Indian carriers: Airtel, Jio, BSNL, and Vi. This means every message is routed based on the subscriber's actual current carrier (not a static lookup), and delivery receipts reflect the actual handset status.
Layer 2 — The SMPP Protocol and Throughput Management
SMPP (Short Message Peer-to-Peer) is the industry-standard protocol for SMS communication between messaging platforms and carrier SMSCs. It operates over TCP/IP using a persistent connection (unlike HTTP, which is stateless).
Key SMPP parameters relevant to high-throughput deployments:
// Key SMPP throughput parameters
max_outstanding_pdus: 500 // concurrent unacknowledged messages per link
window_size: 256 // sliding window for pipelining
enquire_link_timer: 30s // heartbeat to detect dead connections
session_init_timer: 10s // connection establishment timeout
submit_sm_throttle: 1000/s // per-link submit rate
To achieve 1M+ messages per hour, a platform needs multiple parallel SMPP connections to each carrier. A single SMPP link with a window size of 256 and 500ms average latency can theoretically sustain ~500 messages/second. To reach 1,000+ messages/second per carrier, you need at least 2–4 parallel links, with automatic load balancing across them.
Connection pooling — maintaining a pool of pre-established SMPP sessions rather than opening new connections per batch — is essential. SMPP session establishment takes 100–300ms; a cold-start approach would destroy throughput at scale.
Layer 3 — Queue Architecture and Priority Management
Not all messages are equal. An OTP that expires in 60 seconds must be delivered immediately. A promotional campaign can afford to be delivered within a 2-hour window. Queue architecture must reflect this reality.
A production-grade messaging queue has at minimum three priority tiers:
Examples
OTPs, 2FA codes, payment alerts, banking notifications
Throughput mode
Highest — direct to carrier with minimal queuing
Default TTL
60–300 seconds
Examples
Order confirmations, delivery updates, appointment reminders
Throughput mode
High — 5–30 second queue depth acceptable
Default TTL
24 hours
Examples
Marketing campaigns, bulk announcements, promotional offers
Throughput mode
Rate-limited — distributed across configured send window
Default TTL
72 hours or campaign end
At 10M messages/day, promotional campaigns represent the bulk of volume — often 70–80% of daily traffic. Rate-limiting P3 traffic is essential so that a sudden spike in promotional sends does not degrade OTP delivery times for other customers on the platform.
The queue itself is typically built on a distributed message broker (Apache Kafka, RabbitMQ, or AWS SQS, depending on the architecture). Consumer workers pull from the queue and submit to carrier SMPP connections. The number of consumer workers scales horizontally with load.
Layer 4 — DND Filtering, Blacklist Management, and DLT Validation
Before any promotional message enters the queue, it must pass through regulatory filters:
- 1
DND/NDNC registry check
TRAI publishes a National Do Not Call (NDNC) registry. Enterprise platforms download and index this list (updated daily by TRAI) and filter all promotional submissions against it before queuing. At 10M messages/day, this filter must be a fast in-memory lookup — a hash set or bloom filter — not a database query per message.
- 2
Internal blacklist / unsubscribe management
Numbers that have replied with "STOP," raised complaints, or are flagged as invalid must be excluded. This list is maintained in-platform and merged with the NDNC filter before every batch send.
- 3
DLT parameter validation
Every message must carry a valid Entity ID, Header, and Template ID. Invalid DLT parameters result in carrier rejection. Validation at submission time (before queuing) prevents wasted capacity and helps customers identify misconfigured API calls before they become a bulk delivery failure.
- 4
Time-window enforcement
Promotional SMS is restricted to 9am–9pm IST by TRAI. Messages submitted for promotional delivery outside this window must be held in the queue until the window opens — not silently dropped.
Layer 5 — Retry Logic and Failure Handling
Not every message delivers on the first attempt. Handsets are switched off, in areas with poor coverage, or temporarily unreachable. Carriers return specific error codes that indicate whether a retry is warranted:
| Error Code Category | Retry Strategy |
|---|---|
| Absent subscriber (handset off / unreachable) | Exponential backoff: 1 min → 5 min → 15 min → 30 min → 1 hr. Max 6 retries within TTL. |
| Subscriber busy (in a call, for older networks) | Retry after 2 minutes. Max 3 retries. |
| Invalid number / number not in service | No retry. Mark as permanently failed. Flag for customer. |
| DND blocked | No retry. Log as filtered. Update DND cache. |
| Carrier congestion (temporary throttle) | Retry with exponential backoff. Route to secondary carrier if available. |
| DLT validation failure | No retry. Return error to client immediately. Log DLT params for debugging. |
Intelligent retry logic — as opposed to blind retry-everything approaches — is critical for both delivery rates and cost efficiency. Retrying permanently invalid numbers wastes carrier capacity and inflates customer costs.
Layer 6 — Carrier Failover and Load Balancing
No single carrier is reliable 100% of the time. Planned maintenance windows, unplanned outages, and regional degradations happen. At 10M messages/day, a 30-minute carrier outage can mean 200,000+ undelivered messages if there is no failover.
Carrier failover requires:
- ✓Real-time health monitoring of each carrier connection — tracking submission success rates, delivery rates, and response latencies on a per-minute basis.
- ✓Automatic rerouting — when a carrier's success rate drops below a threshold (e.g., 95% within 5 minutes), traffic is automatically shifted to backup carriers.
- ✓Subscriber MNP (Mobile Number Portability) awareness — routing messages to the subscriber's current carrier, not their original carrier. India has had significant MNP adoption; routing to the wrong carrier for a ported number adds an unnecessary hop.
Load balancing across carriers also improves cost efficiency. Carriers offer different pricing tiers for volume commitments. A well-designed platform routes messages to the most cost-effective carrier for each traffic type while maintaining delivery quality thresholds.
Layer 7 — Real-Time Delivery Monitoring
At 10M messages/day, delivery monitoring is not a reporting feature — it is an operational necessity. The monitoring stack needs to answer these questions in real time:
- What is the current delivery rate per carrier, per message type, per customer?
- Are any carrier connections degraded or failed?
- Is the queue depth growing (indicating a processing bottleneck)?
- Are any customers sending at abnormally high rates (potential abuse)?
- What is the average end-to-end latency for P1 messages in the last 5 minutes?
TextNation's analytics dashboard surfaces per-campaign and per-carrier delivery statistics in real time, with alerting for delivery rate drops below configurable thresholds. Enterprise customers can also receive webhooks for each delivery receipt event, enabling real-time status updates in their own applications.
What to Look for in an Enterprise SMS Provider
Based on the architecture layers above, here are the questions to ask any enterprise SMS provider:
Q1: Do you have direct carrier connections or do you use aggregators?
Direct connections = lower latency, higher delivery rates, no grey routing.
Q2: Which Indian carriers do you have direct SMPP agreements with?
You want coverage of Airtel, Jio, BSNL, and Vi minimum.
Q3: What is your sustained throughput capacity per customer?
Ask for limits in messages/second, not messages/month. Burst capacity matters for OTP use cases.
Q4: How do you handle carrier failover?
Ask for specifics: detection threshold, failover time, whether it is automatic or manual.
Q5: Do you provide real DLR from carriers or synthetic status estimates?
Some providers return "delivered" based on submission confirmation, not actual carrier DLR.
Q6: What is your uptime SLA and how is downtime measured?
Get this in writing. Ask for historical uptime data.
Q7: How do you handle DLT compliance on my behalf?
Do they validate DLT parameters at submission? Do they manage DND filtering?
10M+ Messages. 99.9% Uptime. One API.
TextNation's enterprise messaging platform is built on the architecture described in this article — direct carrier peering, intelligent queuing, real-time monitoring, and full DLT compliance.
See the Platform →