Should, or Can, We Add CI/CD to NFV?

ETSI is now looking at how network features, specifically virtual network functions (VNFs) could be updated without disrupting the services they’re used in.  In the cloud and application development, this has been a requirement for some time, and it fits into something called “continuous integration/continuous development” or CI/CD.  The idea is to allow changes to be made quickly to something that’s functionally a piece of a whole, without breaking the whole.

This seems like a great step forward, but that’s not as simple as it sounds, in part because networks and applications are actually very different.  Furthermore, there are different views on what the focus of virtualizing network features should be, and very different views on how to go about it.  Finally, to me at least, there’s a question as to whether feature updates in NFV are the real problem, because service disruptions from other sources are more likely.  The remedies for the latter should clearly be considered as at least a part of the strategy for dealing with CI/CD issues.

There are two pieces to CI/CD.  One piece deals with the functional congruence of a new version of a feature or application component with the rest of the application or service.  Is an enhancement implemented in a way that permits its introduction without disruption?  That’s a matter of first design, and second of testing.  The second piece deals with the question of whether the deployment of a new feature would, by disrupting operations in some way, break at least temporarily the thing it’s a part of.

The challenge with our functional congruence side, in relating to network services versus applications, is the fact that the ETSI NFV ISG has presumed that VNFs are the analog of physical network functions or devices, and thus should be interchangeable at a VNF level.  Testing a VNF in a CI/CD framework is difficult because it’s difficult (or perhaps even impossible) to determine what partner functions its features have to mesh with.  How many variations on VNF combinations might there be in a service?  A lot, particularly if there are many vendors offering the same VNF.

It’s my own view, as a software type, that VNFs really have to be tested against class, meaning that a VNF should “implement” a general class or object that presents as an intent model and has a very specific set of external inputs, outputs, and characteristics.  Any VNF that matches class requirements would necessarily work in concert with other VNFs that could “see” or “depend on” only those external representations.  However, if you wanted to add in CI/CD procedures to enhance this test-against-class paradigm, it would at least be possible.  Would it be worth it?  That’s up to you, or rather the operator deploying the stuff, to decide.

The other side of this is the question of whether you could introduce anything into a running service and have an acceptable level of impact.  When you upgrade the functionality of an application component and want to introduce it, you can either replace the current instance of the component (swap out/in) or you can parallel that instance and phase to it gracefully.  Those two choices map to stuff we do all the time in cloud applications that are resilient (you can replace a broken part by redeployment), scalable (you can add instances to increase processing capacity), or both.  Thus, we can look at the way that cloud redeployment and scaling is handled to see if it would work for network services and their features.

Applications today are almost totally focused on transaction processing, which means the handling of multi-step business activities that often align with what we used to call “commercial paper”, the orders and receipts and invoices and payments that underpin the flow of goods and services.  Because we’ve tended over time to introduce transaction processing applications at the point of transaction, in real time, many of the steps in transaction processing can tolerate delays in handling with minimal impact.  We call this online transaction processing or OLTP.

In networking, the primary goal is to move packets.  This is, like OLTP, is real-time, but a flow of packets is often far less tolerant of delay.  Flow control (provided by TCP in traditional IP networks) can handle some variability in net transport rates, but the protocol itself may be impacted if delays are too long.  This “data plane” process is unlike traditional applications, in short.  However, there are often parallel activities related to user access and resource control that resemble transactions.  In today’s 4G wireless networks we have distinct “signaling” flows and distinct “data” flows.

In transactional or signaling applications, it’s often possible to scale or even replace components without impacting users noticeably.  In data-plane applications, it’s very difficult to do either of these things without at least the risk of significant impact because of the problem of in-flow packets.

Suppose I’m sending a sequence of packets, numbered from “1” to “20” for this example.  I get to Number 5 and I recognize that a feature in the packet flow has failed and has to be replaced, so I spin up another copy, connect it in, and I’m set, right?  Not so fast.

Where exactly did I recognize the problem?  Let’s assume that I had six “feature nodes” labeled “A” through “F” in the direction of my flow, and it was “D” that broke.  Probably Node C saw the problem, and it now routes packets to a new Node G that replaces F.  No worries, right?  Wrong.

First, my failed Node D probably contains some in-flight packets in buffers.  Those were lost, of course, and they can only be recovered if we’ve duplicated the packet flow within the network (paralleled our nodes” so we could switch from one flow to the other, and even then, only if we knew exactly where to stop delivering the old-flow packets and start with the new ones.  That requires some content awareness.

It gets worse.  Suppose that when Node D failed, the configuration A-B-C-G-E-F was very inefficient in terms of route, or perhaps not even possible.  Suppose that we needed to go back and replace C as well, so we have A-B-H-G-E-F as our sequence?  We still have packets in route from B to C, which might be lost or might continue in the flow.  They could arrive out of sequence with respect to the originating flow if the new A-B-H path was faster.  That can mess up applications and protocols too.

The point is that you can’t treat data-plane features like application components.  Non-disruptive replacement of things, either because you have a new version or because you’re just replacing or scaling, isn’t always possible and is likely never easy.  How much effort, then, can you justify in CI/CD?

Then there’s the question of the mission of the application.  Most business applications are intrinsically multi-user.  You don’t have your own banking application, you share an application with other users.  That sharing means that there’s a significant value in keeping the application up and running and accurate and stable.  However, much of the focus of NFV has been on single-tenant services.  vCPE is like “real” CPE in that it’s not shared with others.  When you have single-tenant services and when every feature and every operations process designed to assure the continuity of those features is charged to one service for one user, it can be radically different.  In fact, you may be unable to justify the cost at all.

Finally, there’s the question of the CI/CD process overall.  Software development for a cloud-agile environment is an established process at the application level, but if you define a model of management and deployment and scaling and redeployment that’s not aligned with these established practices, how much of CI/CD can you actually harness, even if you have credible benefits to reap?  NFV continues to develop procedures that are not aligned with the cloud, and so it continues to diverge from the cloud-native tools that are evolving to support the alignment of applications with business goals.  Given that, isn’t NFV going to have to make a choice here—to either cease to diverge and in fact converge with the cloud, or to develop the whole CI/CD ecosystem independent of the cloud?  It seems to me that’s not really a choice at all.