Should We Plan to Bypass Cloud Management for Virtual Functions?

The idea of stepping around or beyond the cloud seems almost heretical these days, but the fact is that if we consider “the cloud” to mean cloud platforms and infrastructure, we already step around/beyond it every day. The question is the best way to do that for network services.

The technology that deploys, redeploys, scales, and otherwise operationalizes applications is part of the cloud, but the way that work moves within an application isn’t. Applications may have run-time parameters set by Kubernetes, but the telemetry associated with internal application behavior isn’t part of cloud monitoring. This suggests that our middle virtual-network layer has things going on that are specialized to the application, or the service. That’s even more likely given that virtual network functions (VNFs) are hosted abstractions for physical devices, and those devices have their own management.

We have networks made up of appliances or devices today, and we manage those elements with specialized management systems. There are international standards, IETF standards, and so forth associated with this process, and both network operators and enterprises have network operations staff that depend on their training in these standards to do their jobs and keep networks running. If we imagine a network of a hundred “real” routers and one router virtual function, would we expect that singular VNF to be managed differently in its routing behavior than the other real devices? We might imagine it, but the operations people would surely push back strongly.

In the TMF and in the ETSI NFV ISG, standards gurus elected to retain the traditional management model associated with networks, the latter doing so even though it meant that for VNFs, there would be two management dimensions, one dealing with the VNFs as software committed to a resource pool, and the other dealing with VNF participation in a “network”. In some sense, this mimics the cloud’s operational model, because as I’ve already noted, cloud tools focus on deployment and scaling but not on workflow among application components. The question is whether there’s another “sense” here, one where the separation of management creates a potential risk to service stability.

Let’s assume a network of a hundred routers. Each of these routers has a management API that configures them and manages their software, interfaces, etc. Each also participates in a peer control-plane dialog that defines best paths between endpoints—routes. If a router fails, or a trunk connecting two routers fails, the combination of these two operational models respond, and the service is restored as long as an alternative route can be found.

Now let’s replace some of those routers with VNFs. We still have, with the VNF software, the same management APIs and we also have the same peer control-plane interactions. The conditions of trunk or VNF failure could therefore be handled in the same way as we’d have handled it with a hundred real routers. However, our VNFs are hosted. They can be scaled, redeployed, in ways that real routers cannot be. If a VNF fails, the best option might not be to reroute traffic at all, but to simply redeploy the VNF. That’s probably a broadly accepted benefit of VNFs, in fact.

The question is how the traditional router management processes know about the option, and how VNF deployment knows about whatever the router management processes are doing. Could we have a situation where two management systems work to respond to the same condition? Since the two have different inventories of possible remedies, what’s the chance these dueling responses wouldn’t end up truly dueling rather than cooperating?

We can also look at a more mundane situation, which is that there’s a failure of a VNF that requires it to be redeployed and to “rejoin” the network, likely after whatever remedies the router-network operations model has taken to sustain connectivity have been applied. We can assume that the VNF software and hardware configurations would be properly set by the cloud tools, but what about those parameters? The peer-to-peer piece of the router operations model would result in adapting routing tables, but how about configuring the device? The router operations tools routinely use the VNF’s management API, but a cloud tool wouldn’t intervene at that level.

The questions that all this raises are 1) whether a service and an application are truly different and need a different management model, 2) whether the true benefits of the cloud can be applied to services without unifying the management model, 3) just how far “unification” of the model would have to go to be effective.

I think the first question can be answered “Yes!” for no reason beyond the fact that current networks and network operations practices are based on the router-operations model, which the cloud obviously does not embrace. It would be a major task to substitute a different management model for current router operations.

The second question is a bit more problematic because there are two issues embedded in it. The first is the broad question of whether we can harness true cloud benefits for VNFs in a two-model world. Since replacing a VNF instance is surely easier than replacing a physical router, I think we can assume that that first piece can be answered with at least a qualified “Yes.” However, the re-parameterization issue raises a question of coordination between the two management models. Can the router management model know when to set up a new instance?

Question three is obviously the key question, and the most difficult, because answering it effectively would require that we assess how we’d achieve a unification to determine how far we could go. I pointed out in the past that there are two basic ways we could unify our models.

The first way would be to establish a higher-layer service management framework to which both cloud management and router management would be subordinate. In this approach, any management system would report a fault to something above, which would then have control of both management systems to remedy the fault, or report it up the line for higher-level action. This is my favorite approach, one I think we’ll eventually have to take, but also one that nobody seems inclined to support at this point in time.

The second way is to subordinate one management model to the other. We could either provide an API whereby router management could communicate with cloud management to scale or redeploy, or provide an API where cloud management could communicate with router management to configure a VNF instance. It seems likely that the first option would be difficult to implement because so many different management tools would have to be changed to include reference to that new cloud API, so that leaves the second.

Obviously it would be possible for cloud orchestration to include some sort of “stub” function that could parameterize a router VNF. In a sense, it would be little more than an extension to the concept of containers that Kubernetes (for example) already recognizes. A container holds what’s needed to host an application component. Adding to what’s needed in some generalized way isn’t inappropriate, and in fact might even be helpful down the line if we identify other applications (other than VNFs) that need special setup. The Nephio project may be heading in this direction, though it’s too early to say for sure, or explore how it might work.

This isn’t the end of our issues, though. We could, for example, ask what happens in the world of SDN. If our hundred-router network was replaced by a hundred-node SDN network with a central route controller, we have a single point where routes and SDN node configuration would likely be stored. This would mean that the SDN controller could be an easy place to harmonize management models, which would mean that SDN-router management could easily take the superior role in the management hierarchy for VNFs. That would make the once-difficult option for harmonization easy.

We may also be missing issues from another evolution, which is the evolution mobile networks have already created in defining a multi-layer structure. We have a user plane and a control plane in 5G, the former of which is the IP network, which is itself a data plane and a control plane. Generally, as we climb away from the data plane we see our “VNFs” looking more and more like traditional cloud applications, and we see the relationship between planes being formalized. That seems to create an SDN-like situation that could again favor letting traditional device-centric management take the lead role. Are multi-planar structures inevitable? If so, then should we be accepting the challenges of letting device management take the lead now, since it could be inevitable down the line?

We’re not hearing much about this issue, or similar issues, because our initiatives to advance virtual elements in a network have been perhaps a bit too contained. Management is usually declared out of scope, and that disguises management inconsistencies that can arise. It would be nice to think about our network evolution holistically, because it’s evolving that way whether we like it or not.