What’s the Latest on NFV Justification?

The big question about any new network technology has always been whether it could raise revenues.  Cost reduction is nice, but revenue augmentation is always a lot nicer—if it’s real.  With NFV, the focus of the revenue hopes of operators has been virtual CPE (vCPE) that could offer rapid augmentation of basic connection services with add-ons for security, monitoring and management, and so forth.  In fact, because vCPE is a pretty easy service to understand, it’s also been the focus of NFV PoCs and of many vendors.

Operators aren’t completely sold on the concept, though, and the reason is that many have been encountering early issues in vCPE-based services.  Some have told me that they are now of the view that they’ll have to reimagine their vCPE deployment strategy, in fact.  Nearly all are now “cautious” where many had been “hopeful” on the prospects of the service.  What’s wrong, and how can it be fixed?

The value proposition for vCPE seems simple.  You have a customer who needs some higher-layer services at their point of network connection.  Security-related services like firewall are the biggest draw, followed by VPN, application acceleration, and so forth.  These services would normally be provided by specialized appliances at each site, and with vCPE they can be hosted in the cloud and deployed on demand.

There are two presumptions associated with this model.  First, you have to presume that there are enough prospects for this kind of service.  Second, you have to assume that you can deliver the service to the customer at a lower cost point than the traditional appliance model could hit.  Operators report some issues in both areas.

Despite some vendors (Adtran, recently) advocating the use of NFV and cloud hosting as the platform for delivering services to SMBs, operators are indicating that the interest in the vCPE model doesn’t seem to extend very far down-market.  To offer some numbers from the US market, we have about seven and a half million business sites here, which seems a lot.  Less than half of them are network-connected in any way.  Of that half, only about 150,000 are connected with anything other than consumer-like broadband Internet access.  Globally, there’s a bit more than half-a-million such sites.

Business broadband using consumer-like technology (DSL, cable, FiOS, etc.) is almost always sold with an integral broadband access device that includes all the basic features.  My surveys have always shown that these devices are very low-cost in both opex and capex terms, whoever buys them.  In fact, the number of SMBs who reported incurring any significant cost in broadband attachment, including add-on elements like security, was insignificant.  What this means is that about one site in fifty is a prospect for any sort of vCPE service unless you step outside the realm of what’s currently being used.

Operators also tell me that even that one-in-fifty odds can be optimistic.  The problem is that most companies who have network connectivity today had it a year ago, or more.  The need for a firewall or a VPN isn’t new, and thus it has probably been accommodated using traditional devices.  Users who already have what they need are uninterested in new services whose cost is higher because those services include vCPE features.

You can see the pattern of vCPE success already; where “managed services” are the opportunity then vCPE is much more likely to be successful.  MSPs offer the service and the CPE together, and if you can reduce the cost (both capex and opex) of fulfilling managed services you can earn more money.  However, most enterprises aren’t interested in managed services because they have professional network staffs available.  That squeezes vCPE opportunity into the high end of SMBs, perhaps to professionals-dominated sites where you have valuable people who aren’t particularly tech-savvy.

The cost problem remains, or at least the impacts remain.  The presumption with NFV has been that “cloud hosting” of virtual functions would offer significant economies of scale.  That’s probably true, providing you have a cloud to host in.  Most operators not only have no such cloud in place, they don’t have the opportunity density for vCPE to justify building one.  You can’t backhaul VPN access very far without incurring too much cost and delay, so centralized hosting isn’t easy.

This has given rise to the idea that “vCPE” really means an agile premises box that’s kind of a mini-server into which you load features that you call “VNFs”.  In point of fact, there’s no need to use any of the standard NFV features at all in such a configuration, unless you believe that you can evolve to a cloud-hosted model for vCPE or can drive toward true NFV another way, then reuse the facilities for your vCPE deployments.

There is absolutely no question that there is an opportunity for vCPE created with this agile device, but it’s not a server or cloud opportunity.  There’s no question that it could evolve into a cloud opportunity if you have some other means of driving cloud deployment to reach a reasonable level of data center distribution near the access edge.  The problem is that this all begs the question of what’s going to create that driver for cloud deployment en masse.

This suggests that we’re spending too much time focusing on vCPE, because it’s not going to be the thing that really drives NFV success.  For that, you have to look to an application of NFV that has a lot more financial legs.  As I’ve noted in the past, there are two pathways toward broad-based NFV deployment—mobile infrastructure and IoT.

Operators love the idea of mobility as a driver for NFV; every mobile operator I’ve talked with believes that NFV would improve their agility, reliability, and capability.  They’re most interested in NFV as a part of a 5G rollout plan, since most of them believe that they’ll have to adopt 5G and will also have to transform their mobile backhaul (IMS, EPC) infrastructure to accommodate and optimize 5G.  They also tell me that they are getting 5G-centric NFV positioning from at least two vendors, which means that there’s already competition in this area.

The challenge with 5G as a driver is twofold.  First, you have to wait for it to happen; most believe it will roll out no earlier than the end of 2018.  That’s a long time for NFV vendors to wait.  Second, the 5G driver seems to preference the mobile-network-equipment vendors, which means everyone else is pushing noses against candy-store windows.

IoT looks more populist, more cloud-like, but the problem there is that operators are far from confident that they should take a cloud-hosting role in IoT.  They’d love to simply connect all the “things” using expensive wireless services and let the money roll in.  It’s not a totally stupid concept, if you presume that over time every home and office and factory with security, environment, and process sensors will end up being connected wirelessly.  If you think every sensor gets a 4/5G link, you’re imbibing something.

The problem with the IoT model is that unlike mobile infrastructure, operators don’t have anyone in the vendor space ringing the dinner bell for the revenue feeding.  There are a few IoT players (GE Digital with Predix) that actually have the right model, but operators don’t seem to be getting the full-court press on solutions they can apply to their own IoT services.

The net of all of this is that we are still groping for something that could create a large enough NFV service base to actually justify full-scope NFV as the standards people have conceptualized it.  We’re in the “NFV lite” era today, and vCPE isn’t going to get us out of that.  The winners in the NFV vendor space will be companies who figure out that the key to NFV success is justifying a cloud.

What’s the Connection Between “Open” and “Open-Source”?

The transformation of telecommunications and the networks that underlay the industry is coming to grips with what may seem a semantic barrier—what’s the relationship between “open” and “open-source?”  This seems to many a frustratingly vague problem to be circling at this critical time; something like the classic arguments about how many angels can dance on the head of a pin or how big a UFO is.  There’s more to it than that; a lot of stuff we need to address if we’re going to meet operator transformation goals.

Operators have, in modern times at least, demanded “open” network models.  Such a model allows operators to pick devices to fit in on the basis of price and features because the interfaces between the devices are standardized so that the boxes are at least largely interchangeable.  Nobody can lock in a buyer by selling one box which then won’t work properly unless all the boxes come from the same source.

I’ve been involved in international standards for a couple of decades, and it’s my view that networking standards have focused on the goal of openness primarily as a defense against lock-in.  It’s been a successful defense too, because we have many multi-vendor networks today and openness in this sense is a mandate of virtually every RFI and RFC issued by operators.

The problems with “open” arise when you move from the hardware domain to the software domain.  In software, the equivalent of an “interface” is the application program interface or API.  In the software world, you can build software by defining an API template (which, in Java, is called an “interface”) and then define multiple implementations of that same interface (Java does this by saying a class “implements” an interface).  On the surface this looks pretty much like the hardware domain, but it’s not as similar as it appears.

The big difference is what could be called mutability.  A device is an engineered symphony of hardware, firmware, and perhaps software, that might have taken years to perfect and that can be changed only with great difficulty, perhaps even only by replacement.  A software element is literally designed to allow facile changes to be made.  Static, versus dynamic.

One impact of this is elastic functionality.  A router, by definition, routes.  You connect to a router for that purpose, right?  In software, a given “interface” defines what in my time was jokingly called the “gozintas” and “gozoutas”, meaning inputs and outputs.  The function performed was implicit, not explicit, just like routers.  But if it’s easy to tweak the function, then it’s easy to create two different “implementations” of the same “interface” that don’t do the same thing at all.  Defining the interface, then, doesn’t make the implementations “open”.

On the positive side, mutability means that even where different “interfaces” to the same function are defined, it’s not difficult to convert one into another.  You simply write a little stub of code that takes the new request and formats it as the old function expects, then invokes it.  Imagine converting hardware interfaces that way!  What this means is that a lot of the things we had to focus on in standardizing hardware are unimportant in software standards, and some of the things we take for granted in hardware are critical in software.  We have to do software architecture to establish open software projects, not functional architecture or “interfaces”.

IMHO, both SDN and NFV have suffered delays because what were explicitly software projects were run like they were hardware projects.  IMHO, open-source initiatives like Open Daylight or OPNFV were kicked off to try to fix the problem, which is how open-source got into the mix.

Open-source is a process not an architecture.  The software is authored and evolved through community action, with the requirement that (with some dodges and exceptions) the stuff be free and the source code made available.  There are many examples of strong open-source projects, and the concept goes back a very long way.

You could argue that the genesis of open-source was the in university and research communities, the same people who launched the Internet.  The big, early, winner in the space was the UNIX operating system, popularized by UC Berkeley in what became known as “BSD” for the Berkeley Software Distribution.  What made UNIX popular was that at the time it was emerging (the ‘80s) computer vendors were recognizing that you had to have a large software base to win in the market, and that only the largest vendors (actually, only IBM) had one large enough.  How do these vendors compete without anti-trust?  Adopt UNIX.

The defining property of open-source is that the source code is available, not just the executable.  However, very few users of open-source software opt to even receive the source code and fewer do anything with it.  The importance of the openness is that nobody can hijack the code by distributing only executables.  However, there have been many complaints through the years that vendors who can afford to put a lot of developers on an open-source project can effectively control its direction.

For network operators, open-source projects can solve a serious problem, which is that old bugaboo of anti-trust intervention.  Operators cannot form a group and work together to solve technical problems.  I was involved in an operator-dominated group, and one of the big Tier Ones came in one day and said their lawyers told them they had to either pull out of the group or wrap the group into a larger industry initiative that wasn’t operator-dominated, or face anti-trust action.  The problem of course is who buys carrier routers except carriers, and how can you preserve openness if you have to join forces with the people who are trying to create closed systems for their own proprietary benefit?

An open-source project is a way to build network technology in collaboration with competitors, without facing anti-trust charges.  However, it poses its own risks, and we can see those risks developing already.

Perhaps the zero-day risk to creating openness with open-source is the risk that openness wasn’t even a goal.  All software isn’t designed to support open substitution of components, or free connection with other implementations.  Even today, we lack a complete architecture for federation of implementations in SDN and NFV for open-source implementations to draw on.  Anyone who’s looked at the various implementations of open office software knows that pulling a piece out of one and sticking it in another won’t likely work at all.  The truth is that for software to support open interchange and connection, you have to define the points where you expect that to happen up front.

Then there’s the chef-and-broth issue.  Let’s start with a “software router” or “VR” concept in open-source.  The VR would consist of some number of components, each defined with an “interface” and an “implementation” for the interface.  A bunch of programmers from different companies would work cooperatively to build this.  Suppose they disagree?  Suppose the requirements for a VR aren’t exactly the same among all the collaborators?  Suppose some of the programmers work for vendors who’d like to see the whole project fail or get mired in never-ending strife?

Well, it’s open-source so the operators could each go their own way, right?  Yes, but that creates two parallel implementations (“forks”) that if not managed will end up breaking any common ground between them.  We now have every operator building their own virtual devices.  But even with coordination, how far can the forks diverge before there’s not much left that’s common among them?  Forking is important to open-source, though, because it demonstrates that somebody with a good idea can create an alternative version of something that, if it’s broadly accepted, can become mainstream.  We see a fork evolving today in the OpenStack space, with Mirantis creating a fork on OpenStack to use Kubernetes for lifecycle orchestration.

Operators have expressed concern over one byproduct of community development and forking, which is potentially endless change cycles, version problems, and instability.  I’ve run into OpenStack dependencies myself, issues where you need specific middleware to run a specific OpenStack version, which you need because of a specific feature, then you find that the middleware version you need isn’t supported in the OS distro you need.  Central office switches used to have one version change every six months, and new features were advertised five or six versions in advance.  The casual release populism of open-source is a pretty sharp paradox.

The next issue is the “source of inspiration.”  We’ve derived the goals and broad architectures for things like SDN and NFV from standards, and we already know these were developed from the bottom up, focusing on interfaces and not really on architecture.  No matter how many open-source projects we have, they can shed the limitations of their inspirational standards only if they deliberately break from those standards.

The third issue is value.  Open-source is shared.  No for-profit company is likely to take a highly valuable, patentable, notion and contribute it freely.  If an entire VR is open-source, that would foreclose using any of those highly valuable notions.  Do operators want that?  Probably not.  If there are proprietary interfaces in the network today, can we link to them with open-source without violating license terms?  Probably not, since the code would reveal the interface specs.

The bottom line is that you cannot guarantee an effective, commercially useful, open system with open-source alone.  You can have an open-source project that’s started from the wrong place, is run the wrong way, and is never going to accomplish anything at all.  You can also have a great one.  Good open-source projects probably have a better chance of also being “open”, but only if openness in the telco sense was a goal.  If it wasn’t, then even open-source licensing can inhibit the mingling of proprietary elements, and that could impact utility at least during the transformation phase of networking, and perhaps forever.

“Open” versus “open-source” is the wrong way to look at things because this isn’t an either/or.  Open-source is not the total answer, nor is openness.  In a complex software system you need both.  Based on my own experience, you have to start an “open system” with an open design, an architecture for the software that creates useful points where component attachment and interchange are assured.  Whether the implementation of that design is done with open-source or not, you’ll end up with an open system.

Optical Vendors Have a Shot at Respect–But Only ONE Shot

When is optical networking going to get some respect?  That’s certainly a question that optical vendors are asking, and if you think about it, logic would suggest that the area that produces bits should be a major focus when everyone says traffic is exploding.  Answering the “When?” question may require understanding why optics isn’t respect-worthy already.

Optical networking evolved from the old days of time-division multiplexing.  SONET (in the US) and SDH (Europe and elsewhere) were both evolutions of the older T-carrier/E-carrier electrical transport standards.  The focus of networking in those days was to carry 56kbps voice trunks and some leased-line traffic that included data.

When packet networks came along (early ‘80s) the TDM network was pressed into service to provide packet trunks between routers or switches.  The more packet we had, the more transport we needed, and everyone loved the growth curve.  But under the covers there was a hidden process evolving.  Network services, once integrated with transport and optics, were separating.

Everything you do that leaves your neighborhood is probably carried on optical trunks, but you may not “see” an optical interface directly and the service you purchase is probably IP or Ethernet, not an optical service.  Nothing focuses your attention like paying for it, and what you pay for are Layer 2/3 services on the old OSI chart, not Layer 1 where optics lives.  Out of sight, out of mind.

The real problem started when electrical-layer traffic demand grew to the point where you needed faster packet trunks.  When Ethernet at 10Mbps was considered obscenely fast, even a modest optical trunk had much more capacity.  There was also still a lot of old TDM hanging around, and so having optical multiplexing as a means of combining slower packet and legacy feeds was valuable.  As Ethernet standards evolved, and as optical interfaces directly to switches and routers evolved, and as TDM died off, we suddenly found ourselves in a world where optical trunks were just connectors for electrical-layer devices.

If all traffic is packet, and if the packet device can terminate an optical trunk, where is the “optical network”?  There isn’t one.  A wire doesn’t make a grid, so a string of glass doesn’t make a network.  Optical networks have been devalued by the evolution of electrical-layer services, and if nothing else happens that devaluation is likely to proceed, perhaps even accelerate.  The question of optical respect, then, will be answered by answering the question “What might happen?”

The most obvious thing that could happen is an increase in optical capacity that outstrips what a single electrical-layer aggregation device can utilize.  DWDM, for example, generates enormous potential capacity, more than most operators would want to stuff into a single box at L2/L3 even if one could carry all of it.  If the capacity of an optical fiber is very high, then there’s an incentive to utilize it fully to gain economies of scale and return on the deployment cost.

The problem with this seemingly wonderful and compelling driver is that we’ve had it for over a decade and it hasn’t made optics the center of the universe.  What other arrows could a respect-seeking optical layer have in the quiver?

How about the tried and true idea of robbing Peter and keeping the money?  To put it more gently, suppose that the optical layer could directly reduce costs at the electrical layer, meaning syphon off feature value from L2 and L3?  This is where “agile optics” comes in.

There are a lot of features in routers that could be displaced if you could add them below.  Resiliency at the optical level means you don’t see route failures at the electrical level.  A number of big Tier Ones have issued RFPs from time to time to explore the creation of what’s effectively an IP “deep core” based on agile optics, and some vendors have products aimed at this very thing.

The problem here is that there aren’t that many core networks.  Google has said back in 2010 that if it were an ISP it would be the third-largest and fastest-growing in the world, and Google (as I’ll get to) was really using a different approach than agile optics, and using it at the 10GE level.  That doesn’t say a lot for the notion of fast optical networks as essential.

Just as Google might be dashing optical hopes, it could be raising some other hopeful areas.  One interesting point is that Google’s Internet-facing network is growing faster than the Internet.  Part of the reason for that is that most of the traffic today is the Internet, most of Internet traffic is video, and most video is delivered over a content-delivery network (CDN) and never really gets to the “core” at all.  The other interesting point is the cloud.  Google’s cloud network (B4) is growing faster than its Internet-facing network (B2), and B4 is their inter-data-center network.  How these combine is really interesting.

Optical guys love the data center interconnect application (DCI) but they have a pretty pedestrian vision for it.  “Hey, if every enterprise had a hundred data centers and there are ten thousand enterprises globally, that’s a million data centers.  Mesh them and look what you get!  We’re rich!”  Of course 1) every enterprise doesn’t have 100 data centers, 2) you only connect your own data centers, not everyone’s, and 3) you don’t have to mesh them to connect them.

The real growth opportunity for DCI would come not from enterprises but from what’s driving Google’s B4, it’s the internals of a cloud network.  Carrier cloud, if it were driven by a combination of CDN, NFV, and IoT, could produce over 100,000 new data centers globally, and these would have to be connected both to each other and to the service edge points.

The key point here is that fiber services could be multiplied if we were to consider the “user” to be an interior logical service element and not an end-user at all.  But another lesson of B4 is that Google has built an agile core using SDN trunking and optics, that spoofs IP at the edge to look like a BGP core.

All of this could be done by optical vendors, even with simple agile optics with a veneer of IP-ness to adapt the core to an IP mission.  Add in SDN grooming and virtual wires (as B4 does) and you have a product that would look like the heart of an IP network and could serve a bunch of B4-like missions.

Where?  Metro.  The big optical opportunity is metro, simply because there are a lot of metros compared to cores.  In the US there are over 250 “standard” metro areas, and almost 700 that would qualify as distinct metro-network opportunities.  Globally there are 2700 suitable metro areas, which is a heck of a lot more than core networks.  Not only that, there’s a totally accepted and committed driver for metro spending already in place, waiting to be exploited.

What?  It’s 5G.  As always we’ve gotten wrapped around the axel with 5G drivers.  Operators are committed to 5G for the same reason that somebody gets committed to a weapons program—it takes only one to start an arms race.  The trick for operators will be to make something good emerge from the inevitable investment.  5G will offer vendors a chance to demonstrate that can be done, and it will do so in an area where large budgets are already being developed.

There will be a lot of fiber in 5G rollouts, and that massive commitment will both empower optics and stake out a position.  If the vendors can’t think of anything insightful they can propose, then they’ll have cemented themselves into a plumbing role, probably for all time, and the electrical layer and SDN players will make optics synonymous with “sand”.

Signs of Progress in the Standardization of NFV Orchestration

One of the positive signs emerging from the ETSI NFV ISG is recent interest in multiple orchestration layers.  The notion of a “split orchestration” model that separates traditional NFV Management and Orchestration (MANO) from “resource orchestration” could be a powerful enhancement to the basic NFV mode, and perhaps even a step toward embracing a full and practical vision of top-down service orchestration.  It could also lead to issues.

The original NFV end-to-end model, which is largely unchanged today, called for a MANO function to control the deployment of the virtual network functions and components needed for a given feature or set of features.  From the first, it has been clear that there are serious complications associated with this simple single-layer orchestration model.  It doesn’t go high enough to envelop operations processes, it doesn’t go wide enough to cover the service end-to-end, and it doesn’t go deep enough to describe technology-specific requirements of various cloud or network platforms.

The cloud has confronted a similar problem.  OpenStack’s Quantum networking, which was renamed “Neutron”, is a very simple build-a-subnet-and-connect approach to cloud deployment that works fine as long as you’re deploying a single set of application components in a subnet, but creaks unmercifully when you try to scale it to lots of tenants or complex applications with shifting, dynamic, component structures.  In the cloud DevOps deployment and lifecycle scripting area, there are increasing signs of convergence on a more sophisticated, modular, approach.

The split of MANO orchestration creates a “Network Service Orchestrator” that talks to operations processes above and the VNF Manager (VNFM) below.  The VNFM talks with the Resource Orchestrator (RO), which in turn talks to various (presumably) “Virtual” Infrastructure Managers (VIMs, which since I don’t think will always be managing “virtual” infrastructure, I call Infrastructure Managers or IMs).  The RO, in effect, orchestrates (V)IMs.

If we extrapolate this view, what we get is that VIMs are designed to be the abstractions of resources (what does a RO orchestrate if not a resource?)  In order for this to work, we need to have some overall model concept for a resource and a way of representing it in an RO structure.  That overall model would be composed of representations created by the VIMs.  By inference, there must be more than one VIM and VIMs presumably represent (individually or collectively) a set of models above and real resources below.

This approach is a step forward for the simple reason that anything that makes orchestration a multi-layered structure is positive.  It conforms to what I think is basic logic, and it also conforms to published operator views of the structure of service orchestration overall.  It does pose some questions and raise some issues.

The first is those model representations for a VIM.  The goal of any abstraction model is to avoid brittle behavior, meaning exposing details within the model to the outside world.  That would mean that a change in the implementation (which is supposed to be opaque) is in fact resource-specific.  Change resources, and you have to change all the models.  Logically, VIMs should be intent models in every sense.

This raises the question of what’s inside a VIM.  Normally you’d think that OpenStack was a model for a VIM, and that Nova deployed instances and Neutron did connections.  Leaving aside the point that Neutron isn’t ideal for the connection mission, the point is that in most real implementations you’d use a language like HEAT to orchestrate those processes.  That would place HEAT orchestration inside the VIM.  Does the ISG propose to use HEAT as the RO, or is there another hidden layer of orchestration inside the VIM?

Every time we establish a layer of orchestration, we commit in a very real sense to defining a set of abstractions that the layer orchestrates, a process for adding to that list as needed, an organization to administer the list and insure consistency and lack of overlap, and so forth.  That’s not a problem unique to the ISG’s split agenda, but it’s one that has to be addressed and that so far is still not being handled.  Without such a process we end up with a virtual world even more complicated than the real world we’re trying to simplify.

The second issue is one of “coupling”.  The split in orchestration adds elements, and the elements in the ETSI E2E model are connected by interfaces, which implies interactivity.  Can a VNF, and an Element Management System as separate software components that may exist separately for each service, “interface” with a common function like a VNFM, which must also interface with an RO?  At a given point in time, there could be thousands of deployed VNF-instanced services, each of which might be trying to tell a VNF function something.  The RO might be working on another service or a set of them.  How do all these threads connect with each other?  Multiple threads?  If so, how do you make sure that you can block and unblock threads correctly, that you don’t end up starting a process and having it die forever because it’s dependent on something that’s in turn dependent on it?

An issue that unites these roles is that of functional separation.  “Too many chefs spoil the broth,” is the old saw.  If you have two or three or more places to orchestrate, there’s a major question about just what gets orchestrated where, and what technology is used.  In my view, orchestration layers should be based on logical functional subdivisions of the process of deployment.  There is no question that the highest level is the “service level”, which in the ETSI model (and in my view, and the view of operators) lives somewhere in operations software above the NFV implementation.  There’s no question that the bottom layer is what I’ll call “domain-specific resource orchestration”, meaning the stuff that’s needed to implement a specific model of deployment based on what you’re deploying on.  The stuff in the middle is a little more complicated.

I think that there’s ample proof in the industry, ranging from the structuring supported by the TMF SID to that of TOSCA, that anything above the domain-specific resource orchestration layer should be structured hierarchically, so that a model shows service element relationships and carries parameters.  This structure also lends itself to intent models.  There are also clear signs that we need to launch a serious effort to define the high-level abstractions on which intent models would be based.  For example, we should define what a “VPN” exposes as properties and parameters, and similarly what a “FunctionHostingPoint” (VM or container or even vCPE) has to expose.

I am hopeful that we’re seeing signs of convergence between the buyers and the standards in this space.  The bad news for the vendors is that this probably means the opportunities for differentiation are going to pass them by unless they get out in front of the trends, establish credible product ecosystems, and make the business case for NFV in a compelling way.  Events are overtaking everyone here, and though the pace is slow at present, it could be accelerating.

Is There a Cloud/NFV Lesson in the Delta Outage?

How can it happen?  You hear today that a power failure in Delta Airlines’ Atlanta hub put their systems down for an extended period of time.  Obviously all of the Delta problems weren’t ongoing IT problems; once you mess up a ballet like airline flight and crew scheduling it’s hard to get everyone dancing in order again.  However, a system outage started this.

I had a client ask me how this was possible in this age of supposedly high availability, and while I can’t answer the question specifically for the Delta outage, I did have an experience in the past that could shed light on why something like a power hit can sometimes cause a massive, protracted mess, even not considering the scheduling aftermath.

The client was in the healthcare industry, and like Delta they had an “old” data center system supporting a large number of old terminal devices.  They had made a decision to use a modern IP network to concentrate their terminal traffic at key points and aggregate it to transport back to their data center location.  I’ve worked with the airline protocols and they were similar to the ones used by the healthcare company.

The problem for my healthcare client started when, after setting up their new network over a weekend, they faced the first full, normal, workday.  As traffic loads grew, there were reports of terminals losing communication, and these reports mounted quickly that Monday.  In fact, the number of terminals that had lost service were so large that the data center system, which was designed to self-diagnose problems, determined that there must be a problem in the software image and triggered a reloading of software.

The reload, of course, tool all the terminals down, and this meant that the entire network of almost 17 thousand terminals tried to come up at one time.  It didn’t come up.  In fact, for the next three days they couldn’t even get half the network running at once no matter what they did.  They called me in a panic on Thursday, asking if I could help, but my schedule prevented me from giving them the full week on-site that they wanted until two weeks later.  I asked them to send me some specific information by express so I could get a head start on analysis, and told them I’d give them a schedule of interviews I needed within a week.

It only took a day to find the problem once I got there, and less than 8 hours to totally resolve it.  There were some ugly confrontations with vendors at meetings (including a memorable scene where a healthcare executive literally threw a binding of documents into the chest of a vendor executive), and a key manager for the healthcare company resigned.  It was their worst outage in their history.

So what caused it, you wonder?  Two things.

The first problem was that old devices, particularly old terminals that use old protocols, don’t work as expected on modern networks without considerable tweaking.  Router networks are not wires, they have significant propagation delay at best and during periods of congestion it can get worse.  Old terminal protocols that think they’re running on a wire will send a message, wait a couple of seconds, and then assume something got lost in transit so they send it again.  After a couple tries, the terminal or the computer will try to recover by reinitializing the dialog.  That works nearly all the time…except if an old message is stuck in a buffer somewhere in a router network.

If you’re on a wire you might have to deal with a lost message, but things don’t pass each other on a wire.  When you get an out-of-sequence message you assume something very bad has happened and you reinitialize.  As traffic in the healthcare network increased that Monday, more messages meant more delay and buffering, and eventually some of the terminals “timed out” and tried to start over.  Then the old messages came in, which caused them to start again…and again…and still more times.  And of course, when everything had to be restarted because the data center system reloaded, all these startings were synchronized and created even more traffic.  The network was swamped, nobody got logical message sequences, and nothing worked.

The solution to the problem was simple; tell the routers not to buffer messages from the terminals.  A lost message can be recovered from, but one that arrives too late after a recovery process has already started will never be handled correctly.  One parameter change and everything worked.

Power failures can create this same situation.  Delta had a power hit, and their backups failed to engage properly.  If my healthcare client had a data center power failure that took down the computers, or even just took down the network connections, they’d have experienced exactly the same problem of synchronized recovery.  I can’t say if Delta saw this problem, but they might have.

What about the second of the two causes?  That was failure to address simultaneous-event overload in the design.  A network is usually designed for some significant margin over the peak traffic experienced.  In the case of my healthcare client it was designed for double the load.  The problem was that nobody considered what would happen if everything failed at the same time and then tried to restart at the same time.  The peak traffic that was generated was five times the normal peak load and over double the design load.  Nobody thought about how recovering from a failure might impact traffic patterns.

The cloud could resolve many or perhaps even all of these problems, but only if the cloud and the applications running in it were properly prepared.   Have we ever had a cloud outage created by a power hit?  Darn straight, but if we had a truly distributed cloud, one with hundreds or thousands of data centers, they wouldn’t all be likely to get a power hit at once.  If devices needed special startup processes, could those processes have been spawned in as many instances as needed to get things running, so the devices could be “turned over” to the main application in a good state?  Darn straight, if we designed the applications like that.

Are we designing for high availability and resiliency today?  Some will say we are, but the interesting thing is that they’re designing for the failure of an application component but not for the overloading of the recovery processes themselves.  How many new instances will have to be deployed in response to a failure?  How many new connections will be needed?  How many threads can we run through OpenStack Nova or Neutron instances at a time, and how many instances of Nova or Neutron do we have available?

A resilient system isn’t resilient if the processes that sustain its resiliency are themselves brittle.  It’s easy to say that we can replace a component if it fails, scale one that’s under load, but just how many of these can be done at one time, and are there not conditions—like a power hit—that could generate the need for a massive, synchronized, recovery?  Remember that our vision of the cloud has evolved considerably from hosted server consolidation, but our tools are still rooted in that trivial model.  Delta’s experience may prove that we need to think harder about this problem.

Just What the Heck is an Event-Driven System Anyway?

In a number of my past blogs, I’ve talked about the value of an event-driven model for cloud and NFV deployment.  Since then I’ve gotten a few requests to explain just what the difference is between traditional and event-based models.  It’s a challenge to do that without dipping deeply into programming details, but I’m going to take a shot at it here.

Envision an assembly line in operation.  We have, let’s say, a car chassis enter the line at the beginning, and we then bolt various things to it along the way until an automobile emerges from the end.  We can, if we have a specific goal of the number of cars per hour, determine how fast the line will have to move to fulfill the production target.  This is pretty much how cars have been built since the days of Henry Ford, and we think of this as an orderly and well-controlled process.

We can draw an analogy between this approach and the way that a service would be created in the days of manual operations processes.  You get a service order.  You start taking steps to fulfill the order, in sequence.  When the steps are taken, the service is ready to turn over.  This process can be called a “workflow”, and it’s how manual things get done, including assembly lines.  When we translate manual processes to software, we tend to replicate this notion of a workflow because…we…it works and it fits our natural model of how things are done.

Let’s assume now that as our assembly line is humming along, we have a glitch in door delivery.  We can’t bolt the door in place because it’s not there.  We can’t do later steps that depend on door installation.  What are our options to meet our goals?

One option could be to have a number of assembly lines running in parallel, which means that we can stall a line and others will pick up the slack.  The problem with this is that it’s an overprovisioning strategy, meaning that we’re committing more resources than needed to allow for the failure of one of our workflows.  It’s going to leave facilities idle because if everything is working OK we’d have to throttle back on production or we’d build more cars than we need.  Another option is just to let the line stall while we get a door.  In that case, not only are we risking meeting our target, we’re spending money on the idle facilities even when work can’t get to them.

Suppose we kind of shunt the car that’s missing a door off to a siding and let other stuff pass?  We could do that, providing that we have some way of getting the door to the siding and getting it installed, then reintroducing the car to the flow.  But is there a gap in the flow, other than the one we created by putting our door-waiting car onto the siding?  Do we have to stop the line to create a gap?  If so, we’ve wasted the slot we created by side-lining the car, and we’ve introduced the need to synchronize the door with the car, then the car with the line.

Then let’s suppose that we have a shortage of another material down the line.  We get our door, we attach it, we stall things to get it back into the flow, and two steps later we find that we have to sideline it again.  Perhaps the car we held up didn’t have that later part dependency.  We could have done better by holding the car with the missing door till the later parts were ready.

You can apply this to service-building of course.  A single linear workflow is fine if everything goes in a linear way, but can that be expected?  We have a service order.  We’re waiting on some provisioning step, and if we have a single-threaded process we halt activity till we get what we want.  Meanwhile the customer alters or cancels the order and we don’t even need to take the step.  In a linear workflow, we’d have to write tests into the software to check for pending changes and cancellations while we were waiting for something to complete.  This makes the software “brittle” because every service might have different dependencies, different things to check for.

In software, we can have what’s called “multi-threading” which is analogous to having those parallel assembly lines.  When something is stalled waiting, it can be held up while another thread keeps going.  But just like assembly lines, you need to have resources for the threads and you’re wasting time and adding complexity by having things held up.  In addition, you still have the problem of knowing what you’re waiting on—if an order is waiting for something and is changed, or is dependent on something that’s still “down the line” and not available either, you have more waste.

What you need to make this thread-multiplying work is the ability to respond to the availability of something and handle it in an appropriate way.  In the world of car-building it’s hard to envision this because it would mean that every car would be built in a little “pod”, and that the start of the process would be the launching of a get-part activity based on the car’s parts list.  You’d then deal with the parts as they arrive and fit.  Obviously this is an event-driven approach, and just as obviously this would be hard to justify in physical manufacturing.

If we enter the software world were building a facility isn’t an issue, we have a different story.  You can have “orders” that drive a task of marshaling resources, and when you have everything you need you can release the “product” for use.  However, since you’ve lost the inherent context of a linear workflow, you need your software order process to somehow keep track of where it is.

You normally build an event-driven model using what are called “state-event tables”.  You define specific “states” representing stages of progress.  You define events as things that have to be accommodated.  In each of your states, you assign an event to a process that does the right thing.  If you are in the state “WaitingForProvisioning” and you’ve activated a series of steps to allocate resources and you get a cancellation, you activate the “ReleaseResources” process.  If you’re in the state “WaitingForActivation” and you receive an Activate event, you start assigning stuff and enter WaitingForProvisioning.  If you were to receive another Activate in that state you know something is awry in your software.

Where this gets important with respect to NFV and the cloud is that we have tended to follow linear workflow models when we built a lot of the software and standards.  Not only does it make the software brittle, it makes it less efficient in handling large numbers of requests.

One of the subtle problems that can be created in this area is “interfaces” or “APIs” that are workflow-presumptive.  An example from the NFV world is easy to find.  We have an interface between a VNF and the VNF manager, and that interface is an API.  Unless the interface is explicitly handling the passing of events then the presumption is that it’s part of a workflow, and we’re back to our assembly line and missing door.

DevOps, the deployment management framework used in the cloud, has begun to adopt a limited event-driven model to couple infrastructure issues to application deployment models.  To me, this shows that the industry is starting to accept event-driven systems.  It proves that you can retrofit events to current structures, but surely if you want to be event-driven it’s better to get there holistically and from the start.

The benefits of an event-driven software system for service lifecycle management are profound, and so are the differences between software designed for workflows and software designed to be event-driven.  We need, for both the cloud and for NFV, a decision on whether we’re willing to forego the benefits of event-driven software, because we are doing that by default.

Think for a moment about my service-model-pod concept for a moment.  Any application, any service, any management process fits in it.  You can spawn models for each service and the model keeps track of the service parameters and state, and it’s the conduit through which events are coupled.  If you assume microservices or Lambdas are the processes, then you can spawn them as needed.  There’s no risk of swamping management systems with fault events or scaling demands created by a popular app or activity.

This is what an event-driven structure could look like.  If we have this, whether it’s in cloud deployment or NFV provisioning, we have the ultimate in scalability and efficiency.  An implementation based on event-handling would beat everything else.  We don’t have one today, but as I’ve noted we’re starting to see little steps toward recognizing the model as superior, and we may get one yet, even soon.  If we do, whoever fields it will reap some significant benefits.

Is There a Practical Pathway to Fiber-to-the-Premises?

Despite all the hype around Google Fiber, the fact is that getting fiber to the premises (FTTP) is a major economic challenge.  While everyone wants fast Internet, nobody really wants to pay for it, and with the Internet (unlike, say, the auto industry) there’s a deep-seated public perception that somehow what they want should be free, or nearly so.  We started the broadband revolution by exploiting wireline infrastructure deployed for other reasons—DSL exploited phone lines and cable broadband exploited broadcast TV delivery via CATV.  Fiber, as new deployment, poses a completely different challenge, which is paying for something that’s expensive from a service that people want to get for nothing.

The fiber profitability challenge is created by something I’ve researched for well over a decade—“demand density”.  Roughly stated, this is the dollars of GDP per mile passed, and it’s a measure of the ability of infrastructure to pay back on its investment.  A high demand density (Japan, Korea, Singapore) means you can easily connect enough users to earn a respectable return, and a low one (Australia, Canada, the US) makes getting to that happy point much harder.

Where demand density is lower, the market seems to have identified two paths toward achieving good FTTP penetration.  The first is to make the access network what’s essentially a public utility, with government support, and this was adopted in Australia in the form of “NBN” or the National Broadband Network in 2009.

NBN was, of course, a child of the political process and so it generated a lot of debate and disputes from the first.  The goal of covering 93% of Australia’s households by 2021 was ambitious.  The fact that NBN was essentially taking over the incumbent telco’s (Telstra’s) role in access networking required compensation, the amount of which was controversial.  Cost estimates supplied for the project were disputed in the industry (including by me).  The head of NBN was Mike Quigley, who had been CEO at Alcatel, which also raised conflict questions.  Finally, in the lead-up to the 2013 elections, the current government came up with “MTM” meaning “Multi-Technology Mix”, which pulled back from full FTTP and introduced a mixture of other less costly technologies, including HFC, FTTN, and even satellite.

Cost overruns and technology issues have marred NBN according to most sources.  In 2014 an independent cost-benefit analysis estimated that the best option would be to continue an unsubsidized rollout, presuming that this could be made to work.

This year, former NBN head Mike Quigley defended NBN in a talk.  In it, Quigley blamed most of NBN’s woes on the way it was covered in the media.  His quote, from Lenin, is illustrative of the view: “A lie, told often enough, becomes the truth.”  Quigley’s data shows that the original FTTP concept was actually working, though he admits there were some issues (like discovering asbestos contamination in some facilities) that had to be addressed.  He does note that while the original NBN deal with Telstra would have protected NBN from overruns and quality issues, the renegotiation of the deal reduced these protections, which put the project more at risk.  In any event, in the MTM plan, the reality was that most of the success came from the original FTTP passes.

I’ve looked at the project a number of times, and I think that Quigley makes some valid points and perhaps doesn’t emphasize one critical one.  The valid point is that the FTTP technology target does appear to be a better option than the MTM plan, in hindsight.  The issue that made the difference is that the cost and time to roll out the other MTM options was underestimated badly.  Bad estimates are also behind most of the cost issues; it wasn’t that cost control was bad as much as that estimates were unrealistic.

The critical point is that political projects have to sustain popular support.  It’s not enough to say that you were right a half-decade or more after the issues were debated.  You have to make your position clear, sell the constituents, and then work to keep them onboard.  Companies understand marketing, but governments don’t always seem to get the picture.

Where does this all lead?  NBN wasn’t costed out properly at the beginning, which is hardly a surprise given the typically large cost overruns on government projects globally.  The decision to change horses to MTM in 2013 was based on overly optimistic data on the rate and cost of deployment.  The project goals could probably have been realized at acceptable rates of return had NBN stuck with FTTP.  But the big issue is that government projects are rarely run well, though Quigley presented some examples (like the Erie Canal, which most people would say isn’t easily made relevant to broadband!) to the contrary.  In the net, I would have to say that NBN proves that a successful government project for broadband deployment would be very difficult to run.  I have doubts even about municipal programs, though it might be possible to get a better handle on costs and benefits in a smaller geography.

That brings us to the second option for FTTH justification, which is to improve the ROI of the project by controlling costs and elevating benefits.  Demand density is a measure of economic power available to exploit, per mile of infrastructure.  If you could exploit more of that economic power, and do so at a better cost point, you could do more fiber deployment.

In the US, you can see both the demand density issue and the exploitation opportunity emerging, in a comparison between Verizon and AT&T.

The demand density of the Verizon “core wireline” territory is, on the average, seven times that of the AT&T territory.  In some areas, the difference is as much as eleven times.  It’s hardly surprising that Verizon’s original broadband strategy was to run FiOS to the places where density was high and let the others alone.  Not only that, Verizon sold off lines in areas where their demand density was far less.  AT&T’s approach was far less fiber-intensive, and they’ve now taken the course of using satellite to deliver TV in their territory at large.

The TV dimension here is important.  The current data suggests that about 25% of users will pay for more than basic broadband wireline services, but this depends on how much of a premium has to be charged.  The challenge in fiber deployment is that current pass costs for fiber are probably five times or more the cost of passing households with HFC as cable providers do.  Plant maintenance is lower, though, and Internet is better with fiber.  What’s made both HFC and FTTH work in the US has been broadcast TV delivery.

The dependence of fiber and HFV on TV is troubling given that Verizon’s data shows that well over a third of its customers have elected to contract for their low-cost “Custom” bundle, and that the number of customers who renew at this lower price point is so great that it overcomes the new subscriber numbers.  It would appear that attempts to go down-market with FiOS to get better penetration in the areas where FiOS is offered are converting existing customers to lower-priced plans.  If broadcast TV is in fact in jeopardy, then the TV justification for HFC and FTTH is too.

One possible answer is offered by 5G, as Verizon has said.  If you were to feed 5G microcells with FTTN technology, you could hop into the home at a lower cost and get a microcell community that could be exploited for other services too.  Since mobile broadband and the user habits it’s enabled are arguably a big factor in the drop of interest in broadcast TV, this would exploit the cause to mitigate the effect.  However, shifting users to an on-demand TV model of any kind reduces the role and power of the TV provider, shifting it more to the networks who develop the programming.  In the long run, if the Internet becomes the delivery mechanism for programming, the effect is to switch users to a service with a lower profit, further complicating the justification for fiber.

Nobody is committed to universal fiber in the US, nor does anybody in the provider community believe it is possible—and that includes Google.  We’ve probably picked most of the low-hanging areas in a demand-density sense, and while a symbiosis of 5G and fiber to the node/neighborhood is promising, we can’t say how much it would save in cost or drive in incremental revenue.  We also know that viewing trends seem to be shifting away from broadcast TV, albeit slowly, which threatens the biggest justifier for fiber.  So, in the end, this path to universal fiber is risky too.

So is there hope?  Quigley makes some interesting points in the “hope” area, and there’s support for them in other industry trends.  First, pass cost with fiber is coming down because of technology improvements.  Verizon’s pass cost for FiOS is said to be less than half what it was originally, perhaps as low as a third.  Second, the original cost estimates for NBN could still be met with FTTP today.  Taken together, though, this says that most of the realizable fiber-technology savings on the table would be eaten up by other factors.  That suggests that revenue is critical, and the US experience as reported by Verizon is troubling in that area.

We may, if broadcast TV continues to slip slowly as the cash cow for access infrastructure, have to consider the unthinkable, which is Internet that stands on its own.  Those who might argue that you can get broadband only from cable and telco providers should consider that those who do take TV fund a big chunk of the access infrastructure, from which broadband-only can then be delivered at a lower marginal cost.  What happens if that subsidy goes away?  If Verizon’s experience with its low-cost custom plan is an indicator, we’re seeing new pricing pressure on TV.

I reported on Verizon’s hope of marrying FiOS with 5G earlier, and that could allow Verizon to lower the connection cost for homes by eliminating a lot of the in-home elements and even the service call.  In the long run, though, it may be that we need 5G mobile services to play off those FTT-something microcells.  It’s possible to conceptualize mobile services created more by microcells than by traditional ones, but the technology issues are significant—roaming when you might pass through ten microcells just to get to the end of your street is an example.

Something in the mobile space is needed, IMHO.  Mobile infrastructure is getting the investment.  Mobile services are the direction the consumer is going.  If wireline broadband in general and FTTP in particular is the goal, then we’ll have to meet it increasingly with mobile contributions.  The technical issues, and even ownership and regulatory issues, will have to be overcome some way, and we should probably start thinking about that now.

Does the Secret of the Cloud Lie in Lambdas?

Sometimes it’s useful to take an extreme case, a kind of end-game view, to gain insight into how a technology shift might be happening and what it might mean to the rest of the tech world.  I’ve said several times in this blog that the cloud, to be truly revolutionary, had to be more than hosted virtualization.  What does it have to be?  It’s often hard to see the direction when you look at subtle details of application and cloud evolution, so let’s try direction-finding through something I mentioned earlier this week, Amazon’s AWS Lambda.

In programming, a Lambda function is one that operates only on what’s passed to it.  You give it arguments and it does its thing with them, using nothing else.  There’s ample documentation on how to use Lambda functions in all the popular programming languages, and Java 8 has specific support for them.  You may have seen reference to the programming style that’s based on Lambda—“functional programming”.

Obviously Lambda functions are a whole new style of programming.  It’s hard for me, a former software architect and programmer, to see an easy way of converting something to Lambda form—you’d pretty much have to start over.  It’s also probably hard for most non-programmers to see much of a value in Lambda.  Even in the IT space, some have told me that they wonder if the whole idea is just the invention of some geeks looking for novelty over utility.

But it’s not the abstract notion of Lambdas I want to get into here, but how Amazon seems to be using them in the cloud.  AWS Lambda is a service unlike any other; it’s not about hosting at all.  You don’t get a VM or container.  What you get is the right to run some Lambdas, and if you find that confusing I’m pretty sure visiting the Amazon site won’t help you much.  I’m going to get a bit inventive to try to explain why I think that Lambda is important to the cloud.

The world is a kind of giant event-space, a place where stuff happens.  That’s true of the global ecosystem and also of subsets like companies, offices, teams, households, even people.  Typically, we think of IT as a set of tools to handle information needs, to get stuff for us, store things, process things.  In most of these cases the IT tools are defined processes that you interface with through some device.  We think of a flow of processes, a workflow.

An event-space view is different.  Stuff happens, and the stuff provokes a process response.  Events trigger processes directly.  The event can be an outside stimulus coupled in some way into the cloud, from IoT or a device.  It could also represent a piece in an ongoing data flow, somewhere between origination and destination.  In a sense, any condition that can be detected and expressed can be an event, and can trigger a Lambda process.

Amazon’s basic material doesn’t make it clear, but it appears that their Lambdas are a bit looser than the strict definition; you can do a database access inside one, for example.  However, it appears that all Lambda processes are “RESTful” in that they don’t maintain state inside them, but process each request they get atomically.  That means that the AWS Lambda service can scale to handle a stream of events by instantiating multiple parallel Lambda processes.

You can probably see the similarity between Lambdas and microservices already.  The difference, I think, is that a microservice is represented by a URL and invoked by somebody when they need what it represents.  It sits there waiting for action.  A Lambda is bound directly to an event, and it’s spawned when the event is recognized and disappears when it has done its work.

Amazon calls this “serverless” processing, and others would say it’s a step toward “NoOps” meaning no operations processes are needed because there’s no explicit infrastructure to maintain.  I don’t think either of these notions do the concept justice, and in fact both can be distracting.  What AWS Lambda is, is a framework for distributed-real-time event handling that can scale fully.  And because you can visualize any business process as a generator of events, it’s a way to build applications for the cloud that are fully elastic.

It will probably be a long time before current applications are redesigned to be fully Lambda-ized, if they ever are.  However, virtually all IoT applications, process control applications, many mobile empowerment applications, and every piece of NFV could be done in Lambda form and gain significant value for their users.  Even today’s application deployment tasks, automated routinely in DevOps tools, are starting to become event-driven in part so that they can respond to things like excess work by spawning some new component instances.  In the long run, those spawned processes themselves could be Lambdas, but in the short term you could use Lambdas linked to the key events to activate the necessary scale-in-or-out steps.

Amazon offers a number of case studies on their Lambda page, and if you wade through them you can catch a sense of the power of the approach.  If elasticity is the essential property of the cloud, then Lambdas are the ultimate cloud.

You also can catch a fundamental truth, which is that you have to develop the applications in Lambda form.  You can’t just “run” something as a Lambda.  This demonstrates that the cloud, as the future of computing, has its own rules for application-building and you’ll have to conform to gain the most from the cloud itself.  It also demonstrates that Lambda-izing something is a major change of thinking.

One of my big concerns is that opportunities like NFV or IoT, both of which scream to be developed in Lambda form, are going to get pigeon-holed into a legacy model by the specifications that are developed.  There is no “interface” to a Lambda in a traditional sense, for example, so do standards that mandate interfaces end up foreclosing or diminishing Lambdas?

Exploiting Lambdas is the path to optimum usage of the cloud, and yet we don’t know a lot about how they might be used.  I think IoT, for example, should start as a vast cloud (literally) of Lambdas that are feeding repositories from both raw sensor data and from correlated/processed data created by other Lambdas.  How exactly do you architect this sort of thing?  What are the networking requirements as a user, and as a provider?  For sure the only durable thing about a Lambda cloud is the elastic connectivity, the virtual network.

The downside to Lambdas is that “I don’t know” problem.  Even in the software world, Lambdas or functional programming isn’t exactly sweeping the world.  I’ve seen a lot of interest in the software organization of startups, particularly those who explicitly intent to deploy on the cloud.  In the broader space, not so much.  A straw poll I took of OpenStack supporters found less than 10% believed that Lambdas could help OpenStack (it could be crucial in handling infrastructure events).

That might make Lambdas perversely attractive to some, though.  If this model is as good as I think it is, then using it would confer an enormous competitive advantage, not to mention a lot of bragging rights in the media.  I suspect this will take a bit more than 8 months to shake out, but we might see signs by the end of this year, if Lambdas are really going to catch on.

Making NaaS the Center of the Cloud

Yesterday, I suggested that the tech giant of the future had to be built around a virtual networking portfolio.  I was aiming the comments at “tech giant” companies and specifically at how the companies should reorder their assets to match the needs of the cloud era.  I’d like to dig a bit deeper into that now.

“Virtual networking” or “Network-as-a-Service” or “NaaS” is the process of creating a connection community with its own address space and its own rules of traffic handling, without altering the state of devices in the real network that underlays it.  Your NaaS model can couple traffic issues downward, but it should not directly provision user services by changing network device properties as we would do with most VPN and VLAN offerings.

We learned the need for virtual networking with the advent of public cloud computing.  The problem you have in a cloud data center is that every user is an independent tenant who has to be partitioned from other tenants for security and compliance reasons.  Nicira, whose overlay “SDN” approach was bought by VMware and became NSX, was the first real success in the space.

If you follow the definition of NaaS into the technology domain, you end up with the notion of an overlay network, meaning a network where service connectivity is created by assembling tunnels (built on the underlay physical network) and nodes (that are usually hosted on servers somewhere).  This is very SD-WAN-like in a technology sense, so the concept isn’t revolutionary.  Nicira worked this way and NSX still does.

If you look into the planning domain, things get a bit more interesting.  A virtual machine is a kind of machine, but also a kind of a “machine-in-waiting”.  It’s an abstraction that has to be made real, but at some level you can plan as though it was real already.  Can I, for example, use a well-designed DevOps tool to deploy an application on containers or VMs in abstract?  If I do, is it possible that the “network-in-waiting”, created by NaaS, becomes the central element?

Let’s say that I have an application with five components.  I could draw that application as five component circles joined below with an oval that represented an IP subnetwork.  I could assign an RFC 1918 (private IP) address to each of the components, give them URLs that translate to that address.  I could add a default gateway to provide them Internet access or NAT/PAT processes to translate the private addresses I assigned to some VPN or Internet address space.  What I would be doing is building a virtual network of abstract components.  Once I’ve done this, I could then host the stuff and have it all made real.

This is I think an interesting way to look at things.  I think that on reflection most would agree that whether we’re talking about DevOps to deploy highly componentized applications, particularly ones sharing microservices across applications, or NFV orchestration, it shows that there’s a value to orchestrating the abstractions and then realizing them.  A model that does this would let you define deployment rules in a resource-independent way, it would let you structure things logically first and then physically.

So let’s run with this approach, both for the cloud and for NFV, and see what it would mean for vendors.

The first obvious point is that you need to have a virtual network architecture, an overlay SDN model, that lets you abstract NaaS fully.  The network is known by what it exposes as addressed resources, so you have to know how this exposure happens.  It creates what’s often called the “northbound interface” or NBI, and it’s the set of properties that intent models have to expose.  Which means that a NaaS is a very small step from the functional intent model that a network service or application actually represents.  Each exposed address is an API, which has functionality.  Everything that happens from this point forward realizes the functions that these exposed addresses/APIs represent.

The second point is that your virtual network architecture has to be extremely versatile.  The basic cloud network model (OpenStack’s Neutron for example) focuses on deploying a VM or collection of VMs and linking them to a subnetwork.  That’s useful in basic cloud applications, but when you look at something complex like a whole community of microservices serving applications and designed to migrate among cloud providers and private hosting dynamically, you’re in a new world.  If the abstraction is the network, we have to be able to build a NaaS model that can cross all these boundaries.  A “subnet focus” as Neutron has makes it hard to build a truly distributed cloud app or to deploy NFV.  The virtual network architecture is the differentiator for vendors, for that reason.

There are a number of vendors who arguably have differentiable virtual network architectures, including Nokia (Nuage), Juniper (Contrail), and Cisco (ACI) plus (of course) VMware and HPE.  For these vendors, the challenge will be relating their offerings to cloud/NFV deployment and the orchestration tools needed.  For the vendors who don’t have their own virtual network architecture, the challenge will be embracing something in a way that doesn’t end up inviting a competitor into the tent.

OpenDaylight has been advocated as a kind of universal NaaS, but the challenge here I think is that ODL is really about realizing connectivity and not about building a virtual network architecture.  If you have the right modeling above it, ODL is a great way of achieving multi-vendor device and network control.  If not, you still don’t have a virtual network approach.

I similarly doubt that an SDN or OpenFlow device is the best way to create universal NaaS, because it’s limited in how it can be deployed.  A software, virtual, device seems the best way to build virtual network architectures.  I also have concerns about just how multiple OpenFlow or ODL domains could be linked or federated properly.  I think it’s possible, but I don’t believe we have a full solution that’s tested and ready to deploy.

Some providers have elected to use virtual appliances (routers/switches) and VXLAN tunnels to construct virtual networks, and this strategy can work well if you have a DevOps or orchestration foundation to set up all the pieces.  I think that a vendor would be able to combine open-source appliances and tunnels to create an offering of their own, and this demonstrates that having full orchestration capabilities above could probably empower a lot of tunnel-and-node combinations even if they weren’t designed as a true NaaS framework.

That brings up the next point, which is that if a true overlay is created by a tunnel-and-node structure, then it’s traffic as far as the underlayment (the physical network) is concerned.  Each NaaS service then has a logical topology that’s entirely elastic, meaning that it extends pretty much to anywhere you say it should.  However, it may be useful to be able to control how the virtual and real network topologies relate to control traffic, enforce security, and provide QoS.  At the minimum you’d want real-network analytics to be coupled upward, and you’d want to be able to map tunnel and node locations to get your NaaS data on optimum routes in the real world.

I’ve said in prior blogs that while a new service request might be a useful trigger for rebalancing of resources below, it should not couple directly down.  The operator cannot allow service-specific details to change the state of shared resources or they risk major stability/security issues.  It’s better to say that some sort of class-of-service framework for tunnel hosting would be used, and that analytics would dictate how the resources were balanced.

Optical vendors should probably be looking at this level of NaaS.  The most useful model would be to have optical transport, SDN tunnel grooming, and a set of strong orchestration tools that could then be used to either activate NaaS features already present in an overlay solution or deploy tunnel-and-node configurations as needed.  Ciena seems to have this capability, though I think their SDN coupling story could be stronger.

So there we are.  The cloud is a networked community first and foremost, and it has to be treated that way by anyone who wants to be successful.  Because the cloud is virtual, the network that creates it has to be virtual as well.  While each vendor may have to deal with this reality in a different way because of product line boundaries, it’s still reality for all.

What are the Right Product Elements for the Era of the Cloud?

I noted in an earlier blog that some of the tech giants seemed to be shedding parts of their businesses, parts that would at one time have been considered strongly symbiotic or even critical.  Credit Suisse has just suggested that HPE would be worth more to shareholders if it broke itself up and sold off the pieces.  At the same time, I’ve noted that some new technologies like NFV seem stalled because without a very broad product footprint, vendors can’t make enough on a deal to justify all the time and attention needed to close it.  It’s worth asking whether there’s a fundamental issue in moving tech forward under current conditions, and how it might be resolved.

“Sales account control” was the watchword for tech giants like IBM and Cisco.  The idea was that if you could earn enough on a given customer, and with those earnings justify sustaining continuous sales presence, you’d have people in place to feed back customer feelings and plans, and to influence those plans in your favor.  This worked back in 2000 when Cisco was one of the top five companies in market cap.  If you look at the market-cap rankings as of August 1st, you find companies who succeed by selling lot of stuff by selling inexpensive things to small buyers.  Obviously that kills the account control strategy.  Does it kill other stuff?

Consumer products are largely self-contained.  You don’t have to build a network to buy a phone or use social media or buy a product online.  They also tend not to be subject to complex cost justifications; do people ask what the ROI on that new iPhone model would be, or whether they’ll get attention from their friends for having it?  The question, then, is whether a simplistic purchase model is essential to tech success these days.  You can look at this issue from the IT side in cloud computing and from the network side in NFV.

Suppose I’m a server vendor.  Do I go out and sell cloud computing to my customer?  If so, am I not making them into at least a partial user of public cloud rather than of my own servers?  If public cloud is cheaper than self-owned data centers, wouldn’t a total transition to public cloud end up using fewer servers, reducing my total addressable market?

Suppose then that I’m a virtualization software provider.  I don’t have exposure to the server space.  I do have an opportunity to sell my software to public cloud providers or to enterprises who want to use cloud principles to get more from their servers (meaning spend less for equal application performance).  The cloud transformation doesn’t hurt me one bit.

But the problem here is the sales volume.  My server vendor, selling perhaps ten thousand servers, has a boatload of money on the table.  My software vendor has a limited licensing fee.  The server person, who could justify a significant sales effort given the money at stake, wins if no cloud decision is made.  The software person who has a big win from the cloud doesn’t make a lot on the victory and can’t justify anything like the same level of sales effort.  Which side will out-sing the other?

So is the public cloud provider the answer?  You can see from Amazon’s numbers that the total cloud computing market is still less than 5% of total IT spending.  It’s also hard to see how Amazon could expect to send cloud-acolytes out to sing the praises of the cloud from door to door.  And basic IaaS, the kind of cloud service that would be relatively easy to adopt, has thin margins.  The “real” cloud, the cloud that’s created by adding in the dozens of hosted AWS services, is likely to require application development skill to adopt.  Amazon door-to-door looking for architects and developers?  Give me a break.

Now to NFV.  Like the cloud, the essential value proposition for NFV was cost-based; to replace expensive proprietary devices with software features hosted on commodity servers.  This generates four vendor constituencies.  We have the server people, the hosting providers.  We have the vendors who supply the virtual-function software that gets hosted.  We have the vendors who offer the NFV implementation that pulls all this together, and we have network equipment vendors whose proprietary gear the whole process is trying to displace.  And we have the same question of “What’s in NFV for me?”

Server vendors could certainly be expected to rush out and promote NFV migration…except that the whole notion of NFV is commodity servers.  The challenge for a server vendor who wants to drive NFV is that they have no way of being sure they’d get the server business after they’d done all the education and justification work.

Network equipment vendors would have to see NFV as a “nowhere to go but down” game, so they tend to follow the path of reluctant support.  If a customer really wants NFV, it’s best to lose the difference in price between your NFV products and your proprietary boxes than to lose everything.  But if the customer is on the fence, let them sit and get comfortable.  That means these vendors aren’t trying to lead the charge.

The providers of virtual functions have, according to operators, seen NFV as a kind of license to kill.  If NFV is some sort of mandate operators have to follow no matter what, then it makes sense to get as much for your functions as possible.  Of course, NFV isn’t a mandate and the cost of VNF licensing is one of the impediments to NFV most often cited by operators.  While this group may want to push the market, they don’t have a realistic position to leverage…yet.

The vendors who actually have NFV software, meaning those who can deploy and manage services based in whole or in part on VNFs, the dilemma is most acute.  On the one hand, they have the keys to the kingdom, meaning they can make the business case to drive the whole ecosystem.  On the other hand, they may not have any real way of monetizing their success because all the operators expect the software to be open-source commodity stuff.

So where does this leave us for both the cloud and NFV?  It seems to me that the notion of ecosystemic sales is the correct one, but we may not have the right ecosystems in play.

This is clearly the era of the cloud, but at a deeper technical level it’s the era of virtualization.  Virtualization plus distributability of resources equals the cloud, equals NFV.  Anyone who wants to drive the market in the cloud or NFV will have to be able to resolve that equation with their own offerings, whether they’re products or open-source distros.  That means that today’s tech giants need to be built around two things—networking and virtualization-enabling software.  The software handles the translation between abstractions and resource systems and the networking connects the resource systems.

Seen in this light it’s clear that Cisco has all the pieces they need.  Rival network vendor Juniper is trying to add in a differentiable virtualization model, which would give them the parts they need.  HPE has networking and also has a cloud-and-virtualization vision, and so they have the pieces too.  IBM does not; they sold off their network business years ago (to Cisco).  The lack of network incumbency is IBM’s biggest problem, and the need to sustain the virtualization-plus-network model is the imperative that would have to guide any HPE shedding of product areas.

They could shed a lot, though.  I don’t think the resource systems themselves, the applications or the servers, are essential for success.  A vendor can partner with somebody to get reference implementations for deployment, and the purpose of these has to be to validate the generality of their solutions.  When you’re trying to move an industry to a whole new paradigm, a broad set of stuff that’s not particularly relevant to the transformation only makes your management focus and sales focus more difficult to direct properly.

“Could” doesn’t necessarily mean “should” though.  Every network vendor (save the optical guys) would stand to lose something by a cloud/virtualization/NFV transformation.  What servers could do is ease the impact of that transformation by giving them a stake in the hardware that’s growing to cover that which is shrinking.

So watch HPE as a signpost.  If they do shed more assets and if their shedding seems to accommodate the technologies that can drive, and at the same time protect against, the transition, then they’ve probably seen the light, and the rest of the industry probably has, or will.  If they seem to be just spinning around to gain shareholder value, then they and the industry may be in trouble with the cloud, and all it implies.