In Pursuit of the Cloud-Native Router Instance

Perhaps is the biggest technical barrier to cloud-native in telecom is reconciling microservice behavior with network service behavior.  If service QoS is limited by microservice and cloud-native implementation, then cloud-native suitability for telecom is in doubt.  If service setup, change, and restoration is limited by the way microservices are orchestrated, then cloud-native suitability for telecom is in doubt.  But do those limitations exist?  It depends.  We have to look at what cloud-native software imposes in the way of conditions, then see if telecom services can tolerate it.

The general vision of cloud-native is that functionality of any sort is created by stitching a series of small components called “microservices”, that are designed so that they can be run anywhere, scale anywhere, and still be assembled into a workflow.  For this to work, it’s essential that they’re “stateless”, which means that microservices themselves can’t store anything within themselves.  If that were done, then that internal information would be lost if the microservice failed (in which case its replacement wouldn’t function the same way) or was scaled (in which case the new iteration wouldn’t have the information, and wouldn’t function the same way).

Statelessness is insidious.  A lot of IP networking is seen as being stateless because it’s “connectionless”, there are no hard-allocated resources associated with a relationship between two IP endpoints.  That’s not the same as stateless, though.  A routing table, for example, is stored in a router.  If another instance of that router is spun up, either to replace the original or for load scalability, the new one would have to rebuild the routing table to function.  Not only that, if the new router was located somewhere other than the exact same place, on the same trunks, as the old, then routes could change, in which case other routing tables also need to be updated.

In scaling IP, if we were to develop parallel paths to share a load, the normal practices of adaptive routing would indeed use them, but only after the routes had “converged” (the other routing tables had accommodated the new choice), and only if the routing protocols could properly detect the congested state of the first device and shift things.

The key point here is that a relationship between two IP addresses (which, in OSI terms we could call a “session”) is stateful as far as the users are concerned.  They, in most cases, will actually employ a TCP process, which is stateful for congestion control.  We can switch around the route of the session within the network, but we can’t make a substitution of one instance for another invisible, nor can we make scaling invisible, in impact-on-session terms.

You may wonder what this adds up to, so let’s get to it.  If we assume that we implemented routers as monolithic components, the result would be something not based on microservices and not rightfully cloud-native.  NFV, as defined, creates virtual devices which are monolithic components.  These can be spun up and scaled or whatever, as real physical routers are, and with the same effect on services.

But what about “cloud-native?”  If we were to construct a “cloud-native router” we’d want to decompose the router function into microservices.  This process poses a number of questions, some perhaps minor and some perhaps critical.  The thing we have to keep in mind is that the microservices are presumed to be independently hosted, so they have to be connected by a network.  There’s latency associated with that, so we’d not want to divide router functions into microservices in such a way as to ensure that when we deploy, we end up having to put them all in the same place for performance reasons.

The first question, and probably the most fundamental, is how we divide router functionality.  In traditional routers, all the functionality is controlled from the data plane, meaning that router management, topology exchange, and other control features are all riding along the same paths as the data.  If we were to follow the monolithic-router model, we would necessarily have a single microservice that had to represent a port/trunk process to converge the data and signaling.  Do we then have a microservice for every port and trunk?  Where do we do the routing part?

The SDN movement, at least in its purist form, might have suggested a solution.  If we were to separate the control and management plane from the data plane, which you may recall Amazon suggests in its Nitro model for security/stability purposes, the control-and-management piece could well be a good candidate for microservices.  This would suggest to me that the “ideal” model for cloud-native implementation of heavyweight forwarding devices like routers would be a white-box switch at the data plane, separating out control and management functions for implementation through microservices.

The notion of the separate but centralized controller in SDN is a bit counter to the scalable-resilient-distributed model of microservices, but separating the control and management planes doesn’t mean you necessarily need a central controller.

A simple model of a separate control/management plane would consist of white-box switches (or software instances) that are paired with an agent process that provides control and management features.  The network connecting these agents would be an independent “VPN”, and control and management traffic would be separated from the inline IP flow at the access level, and thereafter flow on this separate network.

If you like the notion of adaptive discovery, you could continue to exchange topology and status information about the network within this management/control network, and feed the topology changes to the associated switches/instances as forwarding table updates.  If you like a more centralized approach, you could feed a resilient process set with a network-wide database from the same management/control network, and have each switch obtain policy-filtered topology information (again as forwarding table updates) from that central point.

You could also take another approach altogether.  Suppose that every “edge” element in our future network operated as a node in what used to be called “Next-Hop Resolution Protocol” (NHRP).  The concept is similar to the way OpenFlow works when there’s no forwarding entry for a given packet, except that this time we could query a central point and receive a source-route vector that would steer it (and future flows to that destination) to the desired off-ramp.

If a network error occurred, each “node” in the network would receive a “poison pill”, and if it received a source-route vector containing the failed element, it would request another route, and fix the source route vector accordingly.  If we wanted to scale a component, adding it to the inventory of used-in-routing elements would make it available for any new routes.  We could also selectively “poison” some existing routes to force a reconfiguration, which could then use our newly scaled element.

I’m not suggesting this is the best, or only, way of making cloud-native implementations of telecom services work, but it illustrates something that seems to address the key points.  What are they?  First, it’s not going to be easy to make a data-plane element into a microservice; you probably can’t afford the overhead of making it get its routing table from an external source every time it receives a new packet.  That means separation of the data plane from control and management (a good idea in any case) is required.  Second, you can easily implement control and management plane activity, including updating the routing tables, in a variety of ways, using microservices.  That’s really the software heavy lifting in today’s routers anyway.

This all suggests two things; that perhaps SDN had the right idea in control-plane separation, but didn’t pursue all the options for the separate implementation of that control plane, and that perhaps SDN’s model would be easier to translate to a cloud-native implementation than the model of monolithic routers.  Something to think about?