Hopefully your interest in NFV management prompted you to read yesterday’s blog and you’re ready to follow up. If not, you may want to review it before you read this one because I’m building on the last one with only a very brief level-set!
Let’s assume we have a VNF with four components, one of which is horizontally scalable and sits in front of the other three, which are in line. You can draw this out as four boxes left to right, with the user on the extreme left and the “service” interior on the right. This is supported using a subnet and it’s got a private IP address (like the usual 192.168.1.x). The leftmost VNF has a port exposed via something Google-Andromeda-or-Amazon-Elastic-IP-like mechanism which for brevity here I’ll just call “SuperNAT”. Similarly the rightmost has an exposed port for service connection.
Let’s assume that we have a lot of load on our VNF on-ramp element. The first obvious question is how we know that. In the ETSI model we have Element Managers (EM) that are associated with the VNFs and we also have a VNF Manager. It would seem logical that if the VNFs themselves were capable of understanding their own load profiles, EMs could communicate a need for scaling. If not, it would have to come from “outside” meaning that the state of actual network and hosting resources might be used.
Whatever the source, scaling would have to be initiated as a lifecycle process, and the VNFM would drive the VIM to allocate additional instances. That much is clear. What is less clear is what would happen in a case like my example here, where in order to support multiple instances of our head-end VNF we’d likely need to load-balance. We now need an additional functional component not part of the original picture. How does that get instantiated? Normally, the NFV Orchestrator would be responsible for this sort of decision. Remember that we have a coordinated need to deploy the load balancer and to then reconnect the front-end elements, including the connection to the user. (Note that for service availability reasons we might presume that every service with scaling had a predefined load-balancing element in the configuration to prevent interruptions during this reconnect phase.)
Faults are more complicated. If something in NFVI breaks, then we have two possible paths forward. First, we could assume that the fault would be recognized by the VNFs themselves based on conditions that would be visible to them on their interconnected pathways. So if a VNF fails in our hypothetical service, the VNFs connected to it would presumably recognize the problem. The other possibility is that the fault would be recognized by the infrastructure management systems, whatever they are.
The ETSI spec suggests that a VIM could notify a VNF manager of “changes in state”, and one might presume that this would mean that the VNFM could either undertake to replace the item on its own, or could notify an EM in the VNF, which would then start remediation. It seems to me that if you have VNFMs and EMs in the picture, you have to let both of them know what’s being managed.
In a fault situation, we’re assuming that everything in real remediation terms is getting done by the VNFM, just as we did in the scaling example above. That’s reasonable given that the VNFM seems to have all the parametric data on the service, but it kind of makes the Orchestrator look like a rump function. I’d like to see a model where all of this was collected into a single software structure; I think it’s going to be difficult to build something with the separation of functions and the exposure of interfaces that the ETSI model defines, given the ambiguity of roles.
You can see the security ambiguity I talked about in the last blog more clearly now. VNFMs have the ability to command resources, which means that to control a VNFM is to control infrastructure, at least in some sense. The challenge here is that if the VNFM is specialized to the service itself, meaning we have per-service or per-VNF VNFMs, or even just if we have proprietary VNFMs, we’ve relaxed security on the network. A single service, or worse a single service instance and its associated customer, has the ability to call on infrastructure.
I understand that this could in theory be prevented, meaning that you could “authenticate” requests. The problem is that it’s hard to know what’s authentic. Remediation or scaling requests additional resources, which obviously impacts the shared pool. Under what conditions does a VNFM have the right to do that? Who enforces the conditions? If we say that the VNFM enforces its own security, we’ve just justified having no security at all because that principle runs afoul of all the firewall and management integrity checks traditional built into networks.
Then there’s operations integration. We are spinning up additional VMs in scaling, and we’re replacing components due to a fault that would very possibly create an SLA violation in remediation. It’s hard to see how both these conditions wouldn’t have to be reflected broadly, but in three specific places—the service model for NFV, the network operations center, and OSS/BSS.
Even for “NFV operations”, we have to maintain an accurate model of the service or we can’t respond to future change requirements. Imagine the challenges of fault management if scaled components didn’t show up on the service model? How does that happen, though?
I also wonder how a NOC finds out about a problem with a VNF. You could say that overloading of a VNF isn’t a NOC problem, but if NOCs are still expected to respond to customer complaints, how do they see the conditions that the NFV service itself is responding to?
Then there’s how this gets integrated with OSS/BSS. If a customer calls and asks something about the service, can a customer service rep dig into the details of the current service model and state? Right now there’s no interface expressed for that, or any specific detail on the model itself.
You might get the idea from this that I’m saying that NFV won’t work as described. I’m not saying that, but I am saying that I doubt that the ETSI model could be taken literally. Operators tell me that all of their PoC and lab work is building out from basic ETSI descriptions into what’s essentially ad hoc extensions of NFV to cover all the bases. That’s not necessarily a bad thing; innovation and multiple approaches can be valuable. It does tend to negate the standards, though, because these innovations could very well be PoC-specific, service-specific, vendor-specific, and thus generate a bunch of silos.
What’s needed here? Well, the simple answer is that we need to define the abstractions themselves—the service models that MANO would use, the models that are used by MANO to drive the VIMs—and we need complete flow diagrams to describe explicitly what happens under the kind of conditions I’ve outlined here. You can’t define an architecture without testing your structure with the kind of things it’s expected to handle. That’s routine in software design. It needs to be done with NFV, and quickly.