Is There Really a Problem With OpenStack in NFV?

Telefonica has long been a leader in virtualization, and there’s a new Analyst Mason report on their UNICA model.  There’s also been increased notice taken of Telefonica’s issues with OpenStack, and I think it’s worth looking at the report on UNICA and the OpenStack issues to see where the problems might lie.  Is OpenStack a problem, is the application of OpenStack the issue, or perhaps is the ETSI end-to-end model for NFV at fault?  Or all of the above?

In ETSI NFV’s E2E model, the management and orchestration element interfaces to infrastructure via a Virtual Infrastructure Manager.  I have issues with that from the first, because in my view we shouldn’t presume that all infrastructure is virtual, so an “Infrastructure Manager” would be more appropriate.  It also showcases a fundamental issue with the VIM concept, one that UNICA might not fully address.

We have lots of different infrastructure, both in terms of its technology and in terms of geography.  Logically, I think we should presume that there are many different “infrastructure managers”, supplied by vendors, by open-source project, or even developed by network operators.  Each of these would control a “domain”.  It’s hard to read the story from the report, but it I’ve heard stories that Telefonica has had issues with the performance of OpenStack while deploying multiple VNFs, and in particular issues with performance when requirements to deploy or redeploy collide.

The solution to the issue in the near term is what Telefonica calls “Vertical Virtualization”, which really means vertical service-specific silo NFV.  For vEPC, for example, they’d rely on Huawei.  This contrasts with the “horizontal” approach of UNICA, where (to quote the Analyst Mason paper) “Ericsson supplies the NFVI and related support for UNICA Infrastructure, which is the only infrastructure solution globally that will support VNFs.”

So here is where I think the issue may lie.  NFVI, in the ETSI document, is monolithic.  There is therefore a risk that a “domain” under NFVI control might be large enough to create hundreds, thousands, or even more service events per minute.Hu;;Hu  There is a known issue with the way OpenStack handles requests; they are serialized (queued and processed one at a time), because it’s very difficult to manage multiple requests for the same resource from different sources in any other way.  The use of parallel NFV implementations bypasses this, of course, but there are better ways.

Parallel implementations, “vertical virtualization” creates siloed resources, so the solution has only limited utility.  What is better is that there be some “VIM” structure that allows for the separation of domains, separation so that different vendors and technologies are separated.  Multiple VIMs can resolve this.  But you also need to have a way of partitioning rather than separating.  If OpenStack has a limited ability to control domains, you first work to expand that limit, and then you create domains that fit within it.

The biggest problem with OpenStack scaling in NFV is the networking piece, Neutron.  Operators report that Neutron can tap out with less than 200 requests.  It’s possible to substitute more effective network plugins, and here my own experiences with operators suggest that Nokia/Nuage has the best answer (not that Ericsson is likely to pick it!).

If you can’t expand the limits, then size within them.  An OpenStack domain doesn’t have to wrap around the globe.  Every data center could have a thousand racks, and you can easily define groups of racks as being a domain, with the size of the group designed to ensure that you don’t overload OpenStack.  However, you now have a resource pool of, say, 200 racks.  How do you make that work?

Answer; by a hierarchy.  You have a “virtual pool” VIM, and this VIM does gross-level resource assignment, not to a server but to a domain.  You pick a server farm at a high level, then a bank/domain, and finally a server.  Only the last of these requires OpenStack for hosting.  Networking is a bit more complicated, but only if you don’t structure your switching and connectivity in a hierarchical way.  In short, it’s possible to use decomposition policies to decompose a generalized resource pool into smaller domains that could be easily handled.

It’s also possible, if you use a good modeling strategy, to describe your service decomposition in such a way as to make a different VIM selection depending on the service.  Thus, you can do service modeling that does a higher-level resource selection.  Then you could use the same modeling strategy to decompose the resources.  If you’re interested in this, take a look at the annotated ExperiaSphere Deployment Phase slides HERE.

The point here is that the fault isn’t totally with OpenStack.  You can’t assign resources for a dozen activities in parallel, drawing from the same pool.  Thus, you have to divide the pool or nothing works.  You can make the pool bigger by having more efficient code, but in the end, you’re disposing of finite resources here, and you come to a point where you have to check status and commit one.  That’s a serial process.

This is a problem that’s been around in NFV for years, and many (including me, multiple times) have called it out.  I don’t think it’s a difficult problem to solve, but every problem you decline to face is insurmountable.  It’s not like the issues haven’t been recognized; the TMF SID (Shared Information and Data Model) has separated service and resource domains for literally several decades.  I don’t think they envisioned this particular application of their model, and I like other modern model approaches better, but SID would work.

No matter how you hammer vendors (or open-source groups, or both) to “fix” problems, the process will fail if you don’t identify the problem correctly.  Networks built up from virtual functions connected to each other in chains are going to generate a lot more provisioning than cloud applications generate.  If there was no way to scale OpenStack deployments properly without changing it, then I think Telefonica and others could make a case for demanding faster responses from the OpenStack community.  But there is a way, and a better way.

Telefonica has a lot of very smart people, including some who I really respect.  I think they’re just stuck in the momentum of an NFV vision that didn’t get off to a good start.  The irony to me is that there’s nothing in the E2E model that forecloses the kind of thing I’ve talked about here.  It’s just that a literal interpretation of the model encourages a rigid, limited, structure that pushes too much downward onto open tools (like OpenStack) that were never intended to solve global-scale NFV problems.  I’d encourage the ISG to promote the “loose construction” view of the specs, and operators to push for that.  Otherwise we have a long road ahead.