Can We Really Support Service Agility in NFV?

I blogged yesterday about the need to create mission-specific “upperware” to facilitate the development of new services and experiences.  The point that was that NFV is not enough for that; you have to be able to develop and organize functional atoms for assembly according to some application model or all you’re doing is pushing VMs around.

If upperware is what generates new services, then NFV’s service agility has to lie in its support of upperware.  That assumption, I think, offers us an opportunity to look in a more specific and useful way at the “services” NFV actually offers and how it offers them.  That can help us prepare for service agility by properly considering NFV features, and also to assess where additional work is needed.

An “upperware application” would be a collection of virtual functions deployed in response to a need or request.  If we presume that NFV deployments are designed to be activated by a service order, then clearly any need or request could be made to activate NFV because it could trigger such an order.  The order-based path to “deployment” of upperware would be essential in any event, since it could be used to validate the availability of advertising revenue or direct payment for the service being committed.

From the user’s perspective, this “service” isn’t just VNFs, it’s a complete packaged experience that will almost certainly contain components that aren’t SDN or NFV at all, but rather “legacy” services created by traditional network or hosting facilities.  We’d need to commit these too, or we have no experience to sell, which is why it’s essential that we think of NFV deployment in the context of complete service/experience deployment.  That means that you either have to extend the orchestration and management model of NFV software to embrace these non-VNF elements, or you have to consider NFV to be subordinate to a higher-level operations orchestration process.

OK, let’s say we have our upperware-based service deployed.  In the orderly operation of the service, we might find that additional resources are needed for a virtual function, and so we’d want to scale that function horizontally.  The presumption of the ISG is that we’d either detect the need to scale within the functional VNFs and communicate it to VNF Management, or that a VNF Manager would detect the need.  The Manager would then initiate a scale-out.

This poses some questions.  First and foremost, committing resources requires trusting the requesting agent.  Would a network operator let some piece of service logic, presumably specific not only to a service but to a customer, draw without limits on infrastructure?  Hardly, which means that the request for resources would have to be validated against a template that said what exactly was allowed.  This is a centralized function, obviously, and that begs the question of whether centralized functions should make the determination to scale in the first place.

The second question is how the scaling is actually done.  It’s easy to talk about spinning up a VM somewhere, but you have to connect that VM into the service as a whole.  That means not only connecting it to existing elements in an appropriate way, but also insuring that the new instance of the function can share the load with the old.  That requires having load-balancing somewhere, and possibly also requires some form of “load convergence” where multiple instances of a front-end component must feed a common back-end element.

The third point is optimization.  When you start spinning up (or tearing down) components to respond to load changes, you’re changing the topology of the service.  Some of the stuff you do is very likely to impact the portion of the service outside NFV’s scope.  Front-end service elements, for example, are likely close to the points of traffic origination, so might you be impacting the actual user connection?  Think of a load-balancer that’s sharing work across a horizontally scaled front-end VNF.  You used to be connected to a specific single instance of that VNF, but now the connection has to be to the load-balancer, which is likely not in the same place.  That means that your scaling and NFV stuff have to cause a change in service routing outside the NFV domain.  And that means that you may have to consider the issue of that out-of-domain connection when optimizing the location of the load-balancing virtual function you need.

I think this makes it clear that the problems of supporting upperware with service agility are going to be hard to solve without some holistic service model.  A service, in our SDN/NFV future, has to be completely modeled, period.  If it’s not, then you can’t automate any deployment or change that impacts stuff outside the NFV domain.  You may not need a common model (I think one would help a lot) but you darn sure need at least a hierarchical model that provides for specific linkage of the lower-level NFV domain modeling to the higher-level service domain modeling.

I also think that this picture makes it clear just how risky the whole notion of having separate, multiple, VNF Managers could get.  The service provider will have to trust-manage anything that can impact network security and stability.  Giving a VNF Manager the ability to commit resources or even to test their status gives that manager the power to destroy.  We now generate a major function of certification, a kind of “VNF lifecycle management” corresponding to software Application Lifecycle Management, that has to prove that a given new or revised VNF Manager does only what it’s supposed to do.

What it’s supposed to do could get complicated, too.  Does every VNF that might scale get a preemptive load-balancer element deployed so the scaling can be done quickly?  That’s a lot of resources, but the response of a service to load changes might depend on having such a manager in place, and if it’s not there then load changes would probably result in disruption of the data path for at least the time needed to reconnect the service with the load balancer in place.  And of course failures and faults create even more complication because while we can predict what components we might want to horizontally scale, we can’t easily predict which ones will break.

The biggest effect of this sort of assessment of future requirements is demonstrating we’re not fully addressing current requirements.  Which of the functions that upperware might expect are we willing to say will not be available early on?  No scaling?  No fault management?  No changes in service by adding a feature?  Aren’t all those things missions that we’ve used to justify NFV?

If NFV doesn’t deliver service agility then it makes no real contribution to operator revenues, it can only cut costs.  But without the ability to model a service overall, can it really even do that?  I think it’s clear that we need to think services with NFV and not just functions.