The Silent Black Hole

How an Invisible MTU Ceiling Almost Killed an Industrial IoT Deployment

The Symptom: A Network That “Mostly” Worked

Picture hundreds of sensors across remote sites publishing telemetry every second to a central broker. The data arrives—mostly.
But every few minutes, a burst of tags drops. Clients reconnect. Alarms flicker. Engineers blame the application.

The network team runs captures.
TCP retransmissions: 12–18 %. SYN storms. MQTT keepalives timing out.
No ACL denies. No route flaps. No bandwidth saturation.

It’s a black hole—packets vanish without a trace.

Why This Is Incredibly Difficult to Diagnose

MTU/MSS mismatches are the perfect crime in networking—not because they’re rare, but because they’re not an “error” in the classical sense. They’re a normal function of the network that only becomes pathological when an application refuses to tolerate TCP’s built-in resilience.

Observable Symptom	Common Blame	Reality
Intermittent latency	Application bug	TCP retransmitting oversized segments
Reconnects	Firmware issue	Keepalive PINGREQ dropped
High CPU on broker	DDoS	Handling endless retransmit storms
Tag loss	Sensor failure	One 1460-byte payload black-holed

It’s not broken—it’s normal.
TCP is designed to survive loss. It retransmits, shrinks windows, and backs off.
Most applications (HTTP, email, file sync) tolerate 5–15 % loss.
They just feel “slow.”

Only when you run real-time, low-latency, high-reliability protocols like MQTT in industrial IoT applications, does the normal become catastrophic.

The Culprit: A 100-Byte Invisible Toll Booth

The topology:

[Broker] → [Gateway] → [DMVPN Headend] → [Carrier MPLS] → [Spoke Router] → [Remote Gateway] → [Edge Node]

The DMVPN tunnel was configured with:

ip mtu 1400
ip tcp adjust-mss 1360

That’s ~100 bytes of overhead:

Layer	Overhead
Inner IP	20
Inner TCP	20
GRE	4
Outer IP	20
IPsec ESP (AES-GCM)	~36–56
Total	~100–120 bytes

So the effective path MTU is ~1380–1400 bytes.

But the broker’s OS?
Local interface MTU = 1500 → advertises MSS = 1460.

The Asymmetry: A One-Way Speed Trap

Direction	SYN Source	Advertised MSS
Broker → Edge	Broker	1460
Edge → Broker	Edge (via spoke)	1360

The spoke router clamps inbound SYN-ACKs to 1360 to protect the tunnel.
The headend router does not clamp outbound SYNs from the broker.

Result:
Broker sends 1460-byte payloads → + headers + tunnel overhead = ~1560 bytes → dropped at headend.
No ICMP Type 3 Code 4 → black hole.

Why Most Apps Survived (But MQTT Didn’t)

TCP is resilient. It does three things when packets drop:

Retransmit (fast retransmit after 3 dup ACKs)
Shrink the window (congestion control)
Back off (exponential RTO)

For HTTP, email, or file transfers?
→ A few retries = a 200ms delay. Users don’t notice.

For MQTT over persistent sessions?

MQTT Trait	Why It Breaks
QoS 1/2	Requires PUBACK/PUBREC — one lost segment = app-level retry
Keepalives	PINGREQ/PINGRESP every 20–60s — drop = disconnect
Small, frequent messages	1 drop = 10–20% of a burst
Long-lived sessions	Old MSS locked in for hours

Analogy:
Think of a highway (tunnel) with a 14-foot bridge.
Most cars (HTTP) are 12 feet tall — they slow down, wait, eventually pass.
MQTT is a convoy of 13.9-foot trucks in a tight formation.
One truck hits the bridge → the whole convoy stops, backs up, and tries again.
The highway works—but the convoy fails.

The Fix: Two Levers, One Result

1. Clamp MSS to 1300 on both default gateways

# Data center Firewall (broker side)
nat (inside,outside) source dynamic BROKER-NET interface \
    destination static REMOTE-NETS REMOTE-NETS mss 1300

# Remote Firewall/router (Edge side)
interface
 ip tcp adjust-mss 1300

→ Every TCP session now advertises 1300 payload
→ Total packet ≤ 1340 bytes → fits 1400 tunnel with headroom

2. Allow ICMP Type 3 Code 4 (PMTUD safety net)

Firewall → Access Control → Add Rule

ICMP Type 3 Code 4 → Allow (bidirectional)

→ If anything does exceed, the sender gets told:
Next-Hop MTU: 1400 → auto-adjusts

The Result

Metric	Before	After
TCP Retransmissions	12–18 %	< 0.5 %
MQTT Reconnects	40–60 / hour	0
Tag Loss	2–5 %	0 %
Latency (pub/sub)	300–800 ms	40–80 ms

Key Takeaways (The “Speed Limit” Lessons)

Concept	Technical Truth	Highway Analogy
MTU	Max IP packet size on a link	Bridge height (14 ft)
MSS	Max TCP payload (MTU – 40)	Truck height (13 ft)
Clamping	Force-advertise lower MSS	Install a governor on every truck
PMTUD	ICMP 3/4 feedback	Traffic cop with a radar gun
Black Hole	No ICMP → endless retries	Speeding ticket never delivered

Final Thought: This Is Normal TCP… Until It Isn’t

TCP was built to survive packet loss.
It assumes the network will tell it when it’s wrong.

In 99% of enterprise LANs, that assumption holds.
In tunnel-heavy, carrier-overlay IoT, it doesn’t.

The fix wasn’t a hack.
It was engineering the truth into the handshake.

Architecturally speaking, networks should not be used to force application behavior. Networks are only so fungible. The network has and sets requirements, and applications should be configured to adhere to those requirements. However, it’s wise to enforce these types of requirements in industrial IoT networks, where architecture control may be decentralized.

Also, if you run into this issue after a deployment has started or completed, resolving with the network provides the asymmetric advantage to fix completely quickly.

There is really no reason not to, for the IoT network in a situation like this, because the traffic needs conformed, regardless if the application is configured to do it or not. This is really good insurance, but it doesn’t preclude correctly configuring applications.

“The network isn’t broken. It’s just honest.”
— Every MTU debug at 2 a.m.

Share this, pass it along. Save a fellow engineer from the black hole.

How an Invisible MTU Ceiling Almost Killed an Industrial IoT Deployment

Latest Comments

Let’s build something awesome together!