Metaswitch Takes Another Big NFV Step

In the hype-ridden NFV space, Metaswitch has always stood out for me as a voice of both reason and experience.  As the founder of the Project Clearwater open-source IMS project, they’re knowledgeable about cloud-based carrier functionality and open-source at the same time.  As a member of the CloudNFV™ group they are one of the few VNF providers who actually understands NFV in a complete ecosystemic context.  Now they have a new open-source project that further extends both their credibility and the framework of NFV and large-scale cloud infrastructures, called Project Calico.

Before I get into Project Calico I have to explain the issues it addresses.  Our conception of “cloud infrastructure” tends to be the simplistic vision of a glorious cloud universe that when you look at it closely is a bunch of servers running a host OS, hypervisor, guest OSs, and virtual functions, all under the control of OpenStack.  That makes NFV and the cloud little more than IaaS at the platform level, and that’s a problem in four dimensions.

Dimension one is something I’ve already blogged about.  If your goal is SaaS or NFV, the whole notion of IaaS is perhaps fatally stupid.  You’re going to set up a VM, load a guest OS, for every piddling component of an application or feature of a service.  Many of these will have minimal utilization, so you’re wasting a ton of money, and you’re doing that in an application you think is saving you money.  What you’d really like for SaaS or NFV is something that can let you run a whole lot of very lightweight partitions, containers, into which you can stick something.

The second side of the problem is that OpenStack itself has limitations.  It’s a single-thread implementation that because of its scalability limits can only control a domain of servers of some limited size.  How many?  It depends on a bunch of factors, but many operators I’ve talked to say that even a large data center may be too big for a single OpenStack instance.  In our future with a zillion distributed elements, OpenStack won’t scale.  So a carrier cloud hosting NFV would logically have to include multiple OpenStack domains and services have to be somehow connected effectively among them.

Another side of the problem is that OpenStack visualizes application/service deployment in terms of what could be called “service-specific subnets”.  You take a bunch of virtual functions and instantiate them on hosts, then link them onto a VLAN that is then gateway-connected to the service users.  The problem with this model is that while it’s fairly easy to spread IP networks across a bunch of data centers, it’s not as easy to spread VLANs that way.  In order to create a service hosted across multiple data centers, I’d likely need to instantiate a local VLAN per data center and link them at Level 3, which would likely make some of my inter-component links visible in the service address space, which is a security problem.

Then there’s management.  I’ve noted in prior blogs that you can’t expose the management interfaces of shared components to the management processes of service components or service users.  How do you avoid that?  If we have some kind of VLAN or subnet for the VNFs to live in we’ll need to translate access to management data to users outside, and also somehow provide these VNFs with access to any other component management data.  That means sticking something inside each VLAN to act as a management agent, and even doing that creates the question of how everything is addressed.

You can see that there’s a networking dimension to all of these problems, and it’s that dimension that Project Calico aims to address.  What Project Calico does most directly is to build a connection model for cloud components or VNFs that works directly at the IP level rather than at Level 2.  This model is based on a combination of BGP and an enhanced use of access control lists, and what it does is create what I’ll call a component network that appears to be as private as a VLAN would be, but is built at Level 3.  What makes it a component network is not that the address space is partitioned or private, but that the ACLs prevent components from talking “outside” or others from talking through the boundary to components that aren’t supposed to be exposed.  This takes the notion of cloud resource pools and hosting points out of the subnet and into the wide world of Internet and VPN—Level 3 of the ISO model.

One impact of this is to make networking easier for lightweight containers.  It’s far easier to provide a container app or even just a thread in a multi-tasking operating system with an IP address than to provide it with a virtual Ethernet adapter.  Many containers don’t even offer low-level VLANs.  You also have less overhead because you’re not creating an overlay network to partition tenants, so there’s fewer headers in the stack and less resources used in processing them.

Metaswitch is obviously not eliminating OpenStack with Project Calico given that the initial instance at least is an OpenStack plugin, but it can ease the burden of running multiple OpenStack instances.  It’s hard to orchestrate a service across OpenStack instances if the service is based on a VLAN; it’s much easier to do that if the service points are IP-addressable.  With Project Calico, you can arrange OpenStack domains to cover data centers or even collections of racks, and be pretty confident you can organize the connectivity when you cross those boundaries with service orchestration.

The whole issue of inter-component (intra-function or -service) connectivity might also be a lot easier with Project Calico’s help.  If we presume that all of the components of a service are “inside” a BGP area with ACL controlling the access, we can either pass or block outside access to any of the interfaces we build, and do it without worrying about subnet structures, etc.  In effect, each component could look a bit like a web server with access control in front of it.  Inside the area you can connect things one way and at the same time block access from outside, or from other tenants.

Which could be critical for management.  I remain convinced (despite the fact that so far the ETSI ISG hasn’t officially sanctioned the vision) that you have to virtualize management for NFV and for SDN.  That means that any real resource is buffered from management intervention by an i2aex-like repository, and that repository then creates a management view based on the properties not of the resources themselves but of the virtual functionality the resources collectively provide.  Making those management connections requires even more precise management of address access control than intra-function connection does, and Project Calico could provide it.

What Metaswitch has done here, and what it did with Project Clearwater and IMS, was to look at NFV from the top down, visualize the ecosystem, and address specific issues that would impact multiple critical components.  In doing so they’ve raised issues that should have come up in formal NFV activity and didn’t.  You have to wonder whether we can we make any official NFV process work top-down and ecosystemically.  I guess we’ll see.  If not, then maybe Metaswitch may have to build all of NFV one open-source project at a time.