Reason for Outage (RFO) - cfly.co DNS Resolution Failure
Date of Incident: February 17, 2026
Duration: Approximately 03:00 MST to 05:16 MST (primary impact ~2 hours 16 minutes)
Affected Services: SIP registrations (via SBCs), Clearfly Portal API/management functions
Summary:
On February 17, 2026, at approximately 03:00 MST, SIP registrations began failing due to internal DNS resolvers' inability to resolve cfly.co records, which are used within the core voice infrastructure. This was caused by a temporary upstream routing/peering issue impacting reachability to .co TLD anycast root servers (from ~01:37 MST to ~05:14 MST), leading to expired glue record resolution failures despite collocated authoritative servers. Services were restored via workarounds by 04:00–05:16 MST, with permanent preventive configurations implemented and verified.
Impact
SIP registrations and static peering failed at increasing rates throughout the outage window.
Secondary impact: Portal API failures after cache clear during troubleshooting (~04:45–05:16 MST).
No evidence of a widespread .co TLD outage; the issue was path-specific to our upstream carrier/peering.
Timeline (all times in MST on 2026.02.17):
03:00 - Monitoring detects initial registration failures.
03:14 - Automated anomaly alert triggered (registration count threshold).
03:15 - Engineers paged.
03:19 - Incident conference bridge activated for coordinated troubleshooting.
03:38 - Troubleshooting narrows to internal DNS recursion issues specific to cfly.co records; recursion for other domains normal; cfly.co resolves correctly on 2/3 authoritative servers.
03:58 - Temporary workaround: Added cfly.co authoritative server as direct resolver on SBCs; registrations restored.
04:45 - DNS cache flush during troubleshooting causes secondary failure: Portal API/management inaccessible (expired cfly.co records no longer cached).
04:51 - Stable workaround: Added cfly.co as authoritative domain on caching resolvers; recursion bypassed for controlled zones. Root cause investigation continues.
05:16 - DNS config updated on Portal server, restoring full API functionality.
20:00 - Updated forwarding configuration phased in on internal caching servers. SBC configurations updated and DNS configurations reverted.
Cause:
An upstream routing/peering issue (likely carrier-scheduled maintenance) caused one-way traffic blackholing to .co TLD anycast root servers from Clearfly’s network between ~01:37 MST and ~05:14 MST. This prevented recursive resolution of cfly.co glue records once local cache TTLs expired, even though authoritative cfly.co servers are co-located in our data centers. The dependency on external recursion for our own domain's glue records resulted in unnecessary exposure to external disruptions.
Preventive Actions (Implemented & Verified)
Configured domain-specific forwarders on all internal caching resolvers for authoritative domains under our control (e.g. cfly.co). Queries now forward directly to our authoritative servers, bypassing Internet recursion and external TLD/root dependencies.
Updated SBC behavior to retain/use cached DNS records beyond TTL expiration during resolver failures.
Status
Incident resolved. No ongoing impact.