An Example of a Cluster-Computing Tool for Carrier Cloud

In my last blog, I talked about applying cluster technology to carrier cloud.  Today I want to use an example of cluster-based infrastructure to illustrate just what might be done, and to explain the general application case better.  My selected example is Univa, who has two products that combine to create the essential cluster-carrier-cloud framework.  You’ll see the potential value of cluster technology for carrier cloud and, in particular, NFV.  You’ll also see why we’re not realizing it.

Univa classes itself as a “software-defined computing infrastructure” package, something that provides scheduling and orchestration for diverse resource pools.  They are aiming at the “big compute” market, so far primarily in the enterprise space.  This illustrates my point that there’s a lot going on in virtualization that’s not being targeted at network operators, but nevertheless may be exceptionally useful or even critical.  I said there were two products/packages in play here, and they are the Univa Grid Engine and NavOps Launch.  We’ll look at each of them, and at how they might combine.

Let me start by saying that there are multiple pieces to each of the two main elements in the Univa structure, and none of them are really all that well explained in their collateral.  Many features seem to be provided in several places, and many logical missions for the Univa suite would really likely stitch several pieces together, though exactly why and how isn’t always clear.  The company could do with better high-level documentation, in short, but I’ve tried to dig through it, since I think the concept is strong.

Grid Engine is a package that creates one or more resource clusters, and these clusters can be employed in fairly traditional cloud missions, as parallel grid tools, and as high-availability and low-latency compute resources.  It’s a work scheduler plus, a means of allocating cluster resources to applications/elements, and it can be applied to bare metal, VMs, containers, public cloud, and private cloud.  Some of its features require add-ons, including license management and charge-back tools.

It’s always difficult (and sometimes dangerous) to try to classify what a package like Grid Engine does under the covers, but I’ll have to in order to explain it.  In effect, Grid Engine creates clusters at multiple levels.  You have “clustering” specific to a particular public cloud, a virtualization technology (data center, containers) and overall.  The clusters can be visualized as hierarchical, so elements of a high-level cluster like containers can be drawn from lower-level clusters like cloud providers.  Policies determine how resource contributions flow through this process.

Work scheduling comes in as the description of how applications are assigned to compute resources.  Policies determine how resources are selected and how they’re lifecycle-managed once committed.  Since applications are assigned requirements that these policies operate on, the result is that deployment can be viewed as “fire and forget”, where the application is committed and Grid Engine keeps it running and its workflow elements organized.  However, the basic model is to deploy and sustain, not to adapt to changes.  We’ll get to how that would be done later on.

A nice feature of this approach is that you can run pretty much any kind of application.  Some strategies for feature/component hosting would demand that the applications or services be organized in a specific way, like event-driven or transactional or parallel.  Grid Engine doesn’t limit you in the kinds of applications that could be run, or the mix thereof.  For an application like NFV, you could support functions that provided low-latency IoT event processing, parallel computing for data analysis, and transactional high-availability stuff in any mix, providing of course that the applications/components themselves were designed right.  Some of this is supported best by adding NavOps elements, as we’ll see.  Similarly, you might want to add the Universal Resource Broker (URB, now listed as a NavOps element) if you have a lot of your own infrastructure to manage, since it provides Apache Mesos support.

Since license and usage management are available with the package, some of the specific issues that have already come up in NFV are addressed with Grid Engine.  This again illustrates that it would be better to exploit the tools already available than to invent a new model of virtual function deployment that has to be integrated (with some to considerable difficulty) to work with current software elements that do pieces of the NFV job.

NavOps is a set of tools more aimed at cloud computing than “big computing”, but obviously the lines between those two are fuzzy and some applications would use one or the other, and others both, whichever space you’re in.  NavOps Command and URB are aimed at improving the workload management for cloud deployments, including bursting and failover, and also integrating with current popular cloud/cluster frameworks like Kubernetes and Mesos.  NavOps Launch is based on open-source Tortuga, a cluster/cloud management tool.

Functionally, NavOps extends the scale and efficiency of cloud computing and cluster/resource management.  It would probably be, with all tool pieces considered, a strong basis for carrier cloud and NFV.  A better basis, in my view, would be a melding of the two toolkits, which would provide something similar to Mesos (and, via URB, could include Mesos), but extended into the virtual-machine-and-IaaS-cloud world.

Univa’s model, as Apache and DC/OS, are still somewhat dependent on the ability to identify abnormal events in the application/service being hosted, and initiating a trigger to signal remediation.  They’re also not attempting to resolve resource-level problems in a specific sense, only issues that impact what is being hosted on the resources.  Neither of these are crippling defects since any carrier cloud or NFV infrastructure solution would have these same issues.  However, NFV defines a management model (not an effective one in my view, but a defined one nevertheless) that could theoretically be accountable for the first of these.

The NFV ISG didn’t look at this kind of tool as the basis for hosting, and Univa doesn’t claim to support NFV or carrier cloud; their focus has been on the enterprise.  Operators might chase after Univa-like solutions on their own, of course, but most operators are conditioned to responding to availability of tools, not to running around looking for them.  Might the company take a position in the service provider space?  Money talks, but marketing to service providers is something most enterprise-focused companies undertake rarely, and often almost accidentally.  As carrier cloud opportunities develop, they might think more about that space.  That they’ve not done that so far is an indicator that you can’t entirely blame operators for not seeing the benefits of current cluster and virtualization technologies in their carrier cloud plans.  The providers of the tools aren’t making it easy for them.

Whether Univa thinks about the carrier cloud applications of its capabilities or not, it’s something that the service provider space needs to be thinking about.  If you want to have facile, easily integrated, operationally efficient cloud and NFV deployment, you need proven, mature, feature-rich tools as your starting point.  Univa is an example of a widely used, proven, large-scale, solution to the problem that NFV and carrier cloud proponents are just now realizing they need to face.  That need will become a critical problem if it’s not faced quickly.

In a story yesterday, Light Reading asked if a new Orange deputy CEO was the champion NFV needs there.  Perhaps senior executive sponsorship is helpful for NFV or any other new technology, but good executives won’t sponsor strategies that can’t prove benefits, and any who try won’t be successful (or likely be executives for long).  Where leadership would be helpful now is in recognizing that the current NFV evolution will get to the right approach late, if ever.  If Orange or any other operator wants NFV to work, they have to stop expecting the standards process to produce implementations, and look to where the critical pieces of those implementations are already emerging.