Is the New OPNFV Event Streams Project the Start of the Right Management Model?

One of those who comment regularly on my blog brought a news item to my attention.  The OPNFV project has a new activity, introduced by AT&T, called “Event Streams” and defined HERE.  The purpose of the project is to create a standard format for sending event data from the Service Assurance component of NFV to the management process for lifecycle management.  I’ve been very critical of NFV management, so the question now is whether Event Streams will address my concerns.  The short answer is “possibly, partly.”

The notion of events and event processing goes way back.  All protocol handlers treat messages as events, for example, and you can argue that even transaction processing is about “events” that represent things like bank deposits or inventory changes.  At the software level, the notion of an “event” is the basis for one form of exchanging information between processes, something sometimes called a “trigger” process.  The other popular form is called a “polled” process because in that form a software element isn’t signaled something is happening, it checks to see if it is.

Many of the traditional management and operations activities of networks have been more polled than triggered because provisioning was considered to be a linear process.  As networks got more complicated, more and more experts started talking about “event-driven” operations, meaning something that was triggered by conditions rather than written as a flow that checked on stuff.  So Event Streams could be a step in that direction.

A step far enough?  There are actually three things you need to make event-driven management work.  One, obviously, is the events.  The second is the concept of state and the third is a way to address the natural hierarchy of the service itself.  If we can find all those things in NFV, we can be event-driven.  Events we now have, but what about the rest?

Let’s start with “state”.  State is an indication of context.  Suppose you and I are conversing, and I’m asking you questions that you answer.  If there’s a delay or if you don’t hear me, you might miss a question and I might ask the next one.  Your answer, correct in the context you had, is now incorrect.  But if you and I each have a recognized “state” like “Asking”, “ConfirmHearing”, and “Answering” then we can synchronize through difficulties.

In network operations and management, state defines where we are in a lifecycle.  We might be “Ordered”, or “Activating” or “Operating”, and events mean different things in each state.  If I get an “Activate” in the “Ordered” state, it’s the trigger for the normal next step of deployment.  If I get one in the “Operating” state, it’s an indication of a lack of synchronicity between the OSS/BSS and the NFV processes.  It is, that is, if I have a state defined.

Let’s look now at a simple “service” consisting of a “VPN” component and a series of “Access” components.  The service won’t work if all the components aren’t working, so we could say that the service is in the “Operating” state when all the components are.  Logically, what should happen then is that when all the components are in the “Ordered” state, we’d send an “Activate” to the top-level “Service object”, and it would in turn generate an event to the subordinates to “Activate”.  When each had reported it was “Operating”, the service would enter the “Operating” state and generate an event to the OSS/BSS.

So what we have here is a whole series of event-driven elements, contextualized (state and relationship) by some sort of object model that defines how stuff is related.  It’s not just one state/event process (what software nerds call “finite-state machines”) but a whole collection of such processes, event-coupled so that the behaviors are synchronized.

This concept is incredibly important, but it’s not always obvious that’s the case.  But here’s an example.  Suppose that a single VNF inside an Access element fails and is going to re-deploy.  That access element would have to enter a new state, let’s call it “Recovering” and so the VNF that failed would have to signal with an event.  Does that access element go non-operational immediately or does it give the VNF some time?  Does it report even the recovery attempt to the service level via an event, or does it wait till it determines that the failure can’t be remedied?  All of this stuff would normally be defined in state/event tables for each service element.  In the real world of SDN and NFV, every VNF deployed and every set of connections could be an element, so the model we’re talking about could be multiple layers deep.

This has implications for building services.  If you have a three- or four-layer service model you’re building, every element in the model has to be able to communicate with the stuff above and below it through events, which means that they have to understand the same events and have to be able to respond as expected.  So what we really have to know about service elements in SDN or NFV is how their state/event processing works.

Obviously we don’t know that today, because we didn’t have even a consistent model of event exchange, which Event Streams would define.  But the project doesn’t define states, nor does it define state/event tables or standardized responses.  Without those definitions an architect couldn’t assemble a service from pieces because they couldn’t be sure that all the pieces would talk the same event language or interpret the context of lifecycles the same way.

The net of this is that Event Streams are enormously important to NFV, but they’re a necessary condition and not a sufficient condition.  We still don’t have the right framework for service modeling, a framework in which every functional component of a service is represented by a model “object” that stores its state and the table that relates to event-handling in every possible state.

The question is whether we need that, or whether we could make VNF Managers perform the function.  Could we send them events?  There’s no current mandate that a VNFM process events at all, much less process some standard set of events.  If a VNFM contains state/event knowledge, then the “place” of the associated VNF in a service would have to be consistent or the state/event interpretation wouldn’t be right.  That means that our VNF inside an access element might not be portable to another access element because that element wanted to report “Recovering” or “Faulting” under different conditions.  IMHO, this stuff has to be in the model, not in the software, or the software won’t be truly composable.

I’m not trying to minimize the value of Event Streams here.  It’s very important, providing that it provokes a complete discussion of state/event handling in network operations.  If it doesn’t, then it’s going to lead to a dead end.

Will Operators Avoid the Same Mistakes they Say Vendors Make in Transformation?

Operators want open source software and they want OCP hardware, or so they say.  It would seem that the trend overall is to stamp out vendors, but of course neither of these things really stamp out vendor relationships.  They might have an impact on the buyer/seller relationship, though, and on the way that operators buy and build networks.  If the model crosses over into the enterprise space, which is very likely, then it could have a profound impact on the market overall.

If there are multiple stages of grief, operators have experienced their own flavor in their relationship with their vendors.  Twenty years ago, operators tended to favor big vendors who could provide complete solutions because it eliminated integration issues and finger-pointing.  Ten years ago, operators were starting to feel they were being taken advantage of, and many started creating “procurement zones” to compartmentalize vendors and prevent one from owning the whole network.  Now, as I said, they’re trying to do what’s really commodity or vendor-independent procurement.  That’s a pretty dramatic evolution, and one driven (say almost all operators) by growing concern that vendors aren’t supporting operator goals, but their own.

What created the divergence of goals can be charted.  Up until 2001, technology spending by both network operators and enterprises was driven by a cyclical transformation of opportunity that, roughly every fifteen years, introduced a new set of benefits that drove spending growth.  New opportunities will normally demand new features from technology products, new differentiators appear, and operators have new revenue to offset costs.  Thus, they tend to be looking for ways to realize that new opportunity quickly, and vendors are their friends.

In 2001, the cycle stalled and has not restarted.  If you go back to both surveys and articles you can see that this inaugurated a new kind of tech market, where the only thing that mattered was reducing costs.  That’s a logical response to a static benefit environment—you can improve financials only by realizing those benefits more cheaply.  But cost management usually starts with price pressure on purchases, and that’s what launches an “us-versus-them” mindset among operators and vendors.

The old saying “we have met the enemy and they are us” might then apply now.  Both open source software and “COTS” hardware are very different because they are commodities.  They present operators with a new problem, which is that vendor support for innovation depends on profit from that support.  Absent a differentiable opportunity, nobody will support transformation unless operators pay for professional services.  Ericsson probably foresaw this shift, and as a result focused more on professional services in its own business model.  While realizing the benefits of a shift to commodity network elements has been slower to develop than Ericsson may have hoped, it’s still clearly underway.

But OK, operators want open-source and COTS.  Will they be able to get it, and if so who will win out?

If you look at transformation as operators do, you see that their primary goal is what could be called “top-end engagement”.  They have benefits—opex and capex reduction and revenue augmentation—and they need to engage these things.  Traditional technology specifications and standards, starting as they normally do at the bottom, don’t even get close to the business layer where all these benefits have to be realized.  That’s why the operator approaches seem to focus “above” the standards we know.

So the most important point here is to somehow get architectures to tie back to the benefits that, while they were expanding, drove industry innovation.  A friend of mine, John Reilly, did nice book (available from Amazon in hard copy or Kindle form) called “The Value Fabric: A Guide to Doing Business in the Digital World” that might be helpful in framing business benefits for multiple stakeholders.  It’s a way of describing how a series of “digital bridges” can establish the relationship among stakeholders in a complex market.  It’s based on a TMF notion, which means it would be applicable to operator ecosystems, and vendors could explore the notion of providing digital bridges to link the stakeholders in an operator ecosystem together.  Advertising, content providers, operator partners, hosting providers, and even software providers who license their stuff, are examples of stakeholders that could be linked into a value fabric.

But all of this good stuff is likely to be software, and however valuable software is in the end, it’s not going to cost as much as a couple hundred thousand routers.  In fact, it’s reducing that hardware cost that’s the goal for operators.  Network vendors are not going to embrace being cost-reduced till they vanish to a point.  And if COTS servers and open-source software are the vehicle for diminishing network vendor influence, who’s incented to take their place in the innovation game?  In the architectures that operators are promoting, no major vendor is a long-term winner.

I’m not saying this is bad, or even that vendors don’t deserve some angst.  I’ve certainly told enough of them where things are heading, and tried to get them to address the issues while there was still time, to have lost sympathy.  What I am concerned about is how the industry progresses.  Is networking to become purely a game of advertising/marketing at the operator level, and of cheap fabrication of white boxes at the hardware level?  If so, are the operators now prepared to drive the bus by investing some of their savings in technologists who can do the deep thinking that will be needed even more in the future?

The network of the future is going somewhere other than the simple substitution game that operators envision.  You can see, for example, that personal agents are going to transform networks.  So will IoT, when we finally accept it’s not about making every sensor into a fifty-buck-a-month LTE subscriber.  The limitation in vendor SDN/NFV architectures is that they try to conserve the legacy structure that progress demands we sacrifice.  The limitation in operator architectures is that they constrain their vision of services to the current network, and so forego the longer-term revenue-driven benefits that have funded all our innovations so far.

What’s above commoditizing services?  Revolutionary services, we would hope.  So let’s see some operators step up and show more determination to innovate than the vendors they’re spurning.  Let’s find values to build a fabric around.

Network Feature Composition, Decomposition, and Microservices

At the TMF event in Nice Verizon opened yet another discussion, or perhaps I should say “reopened” because the topic came up way back in April 2013 and it was just as divisive then.  It’s the topic of “microservices” or breaking down virtual functions into very small components.  NetCracker also had some things to say about microservices, and so it’s a good thing to be talking about.

If we harken back to April of 2013, we’re at a point where the NFV ISG had just opened its activity.  There was still plenty of room to discuss scope and architecture, and there was plenty of discussion on both.  This was the meeting where I launched the CloudNFV project, and it was also the meeting where a very specific discussion on “decomposition” came up.

Everyone knows that the purpose of NFV was to compose services from virtual functions.  Anything that composes a whole from some parts will be sensitive to just how granular the parts are.  We know, for example, that if you compose virtual CPE from four or five functional elements (firewall, NAT, etc.) you get some benefits.  If you had a virtual function that consisted of all of these things rolled into one and that was as granular as you got, it’s hard to see how a physical appliance wouldn’t serve better.  Granularity equals agility.

The “decomposition” theme relates to this granularity.  Here, the suggestion was that operators require that virtual functions be decomposed not only into little feature granules, but even further into what today we’d call “microservices”.  There are a lot of common elements in things like firewall, VPN, NAT, and so forth, so the decomposition camp says.  Why not break things down into smaller elements to allow even totally new stuff to be built from the building blocks of the old.  It carries service composition downward to function composition.

The operators really liked this, and so did some vendors (Connectem introduced it in a preso I heard), but the major vendors really hated it.  They still do, because this sort of decomposition not of services but of functions threatens their ability to promote their own VNFs.  But the fact that buyers and sellers are in conflict here is no surprise.  The question is whether decomposition is practical, and if it is whether microservices are a viable approach.

Virtually all software that’s written today is already decomposed, in that it’s made up of classes or modules or functions or some other internal component set.  My memory of programming techniques goes back to the ‘60s, and I can honestly say that even then there was tremendous pressure from development management to employ modular structures.  Even in programming languages like assembler, or machine language, there were features to support “subroutines” or modular elements that called directly on the computer’s instruction set (for those interested, look up “Branch and Link”).

One might think that this long history of support for modularity would mean that it would be no big thing to decompose functions.  Not necessarily.  Then, as today, the big problem is less dividing software into modules than it is in assembling those modules in any way other than the original way.

Most software that’s composable is really designed to be composed at development time.  There are frequently no convenient means provided to determine what data elements are needed and what format they’re expected to be in.  Worse yet, the flow of control among the components may implicitly depend on efficient coupling—local passing of parameters and execution.  For something to be a “service” or “microservice” in today’s terms, it would have to accept loose coupling through a network connection.  That’s something that adds complexity to the software (how do you know where the component is and whether it’s available?) and also can create enormous performance issues through introduction of network delays into frequently used execution paths.

The point is that it’s an oversimplification to say that everything has to be decomposed and recomposed.  There are plenty of examples of things that should or could not be.  However, there are also examples of vendor intransigence and a desire to lock in customers, and quite a few of the functions that could be deployed for NFV could be decomposed further.  Even more could be designed to be far more modular than they are.  We have to strike a balance somehow.

NetCracker’s concept of making more of NFV and operations modernization about microservices is an example of how that could be done.  If there’s a service whose lifecycle events are so frequent that they are almost data-plane functions, that service has a serious problem no matter how you deploy it.  Generally, management and operations processes have relatively few “events” to handle.  State/event tables are the most common way to represent lifecycle process phases and their response to events, and the intersection of the states and events defines a component, a “microservice” if you like, and one that’s probably not activated often enough that it couldn’t be network-coupled.  I’ve advocated this approach from the first, back to that 2013 meeting of the ISG.

Event-driven OSS/BSS is one way of stating a goal for operations evolution—another is “agile”.  Whatever the name, the goal is to make operations systems respond directly to events rather than imposing a flow as many systems do.  This goal was accepted by the TMF almost a decade ago, but most operations systems don’t achieve it.  A microservice-based process set inside a state/event lifecycle structure would be exactly what the doctor (well, the operator) ordered.

If we want to go further than this, into something composable even when the components have to stay local to each other, then we need to define the composition/execution platform much more rigorously.  An example, for those who want more detail, is the Java Open Service Gateway Initiative (OSGi), which has both a local and remote service capability.  Relatively few network functions now residing in physical network devices conform to this kind of architecture, which means you’d have to rewrite stuff or apply the microservices-and-decomposition model to new functions only.

It’s hard for me to see this stuff and not think of something like CHILL or Erlang or Scala—all of these are specialized languages that could be applied to aspects of virtual-function development.  If you’re going to develop for a compositional deployment that ranges from local to network-coupled, you might want to make the location and binding of components more abstract.  If you want to be able to do this in any old language you may need to define a PaaS in which stuff runs and make binding of components an element of that, so you can adapt to the demands of the application or to how its owners want to deploy it.

Microservices, composable operations, and “decomposition” of network functions are all good things, but there’s a lot more to this topic than meets the eye.  Software agility at the level that operators like Verizon or vendors like NetCracker want demands different middleware, different programming practices.  The big challenge isn’t going to be accepting the value of this stuff, or even getting “vendor support” of the concept.  It’s going to be finding a way to advance something this broad and complex in as a complete architecture and business case.  We’ve not figured that out for something relatively simple, like SDN or NFV.

Vendors Aren’t Driving SDN/NFV Anymore, so What Now?

There is an inescapable conclusion to be drawn from recent industry announcements:  Vendors have lost control of SDN and NFV, which means they’ve lost control of the evolution of networking.  Operators, in a state of self-described frustration with their vendors’ support for transformation goals, have taken matters into their own hands.  I’ve gotten emails over the last ten days from strategists and sales types in the vendor community, and they’re all asking the same question, which is “What now?”  It’s a good question in one sense, and it’s too late to ask it in another—or at least too late to have the full set of choices on the table.  But there are always paths forward, some better than others, so we need to look at them.

In a prior blog I made the point that commoditization of connection services was inevitable, and that it was also inevitable that operators will spend less on capital equipment at L2/L3 than they have in the past.  Accepting this truth, I’ve said, is critical to vendors who have historically depended on these layers for their revenue and profits.

The up-front truth for this blog is that it is no longer possible for vendors to control the SDN and NFV revolution even if they were to step up now and do what should have been done all along.  I’ve noted what should have been done many times and in any case, it’s too late to do it.  Buyers have taken their own path now, and vendors need to fit into the operators’ programs and not try to define their own.  I’m not saying they don’t need to pay attention to the focus on opex, to the need to develop a holistic SDN/NFV business case, only that doing that won’t give them control of the game anymore.

The key to accommodating operator initiatives seems to start with sophisticated service modeling.  All SDN and NFV modeling and the associated APIs and orchestration is derived from the software concept of “DevOps” that defined a way of describing deployment of software elements and their connection into systems we’d call applications.  There have always been two models of DevOps, one that describes the steps to take (called the “prescriptive”) and the other that describes the end-state desired (initially called the “declarative” but increasingly called the “intent model”).  The critical first step vendors need to take in modeling is to adopt declarative/intent modeling.

“How-to” modeling cannot be general—it has to be a process description that naturally depends on what you’re doing and where you’re doing it.  If you describe a system of VNFs in terms of its intent, you can deploy it on any convenient platform.  If you say how to deploy it, you can deploy only on the target upon which your instructions were based.  All the emerging operator architectures make it clear that a wide variety of platforms, including legacy “physical network functions” or PNFs, have to be supported for any feature.  The thing, as they say, speaks for itself (for Latin/legal fans, “res ipsa loquitur”).

I personally think an intent model approach would be ideal across the board, meaning everywhere from top to bottom in an implementation.  However, it is essential only at certain key points in the structure of an SDN/NFV package:

  • At the top, where SDN/NFV software interfaces with current OSS/BSS systems.
  • Underneath “End-to-End Orchestration” or EEO, to define the way that infrastructure-based behaviors are collected into functional units.
  • At the “Infrastructure Manager” boundary, to describe how a given behavior is actually deployed and managed for one or more of its hosting options.

Each of these points represent a hand-off that operators are insisting be open, which means that the implementation below has to be represented to the implementation above.  Intent modeling makes that mutual representation practical.

The second point that vendors have to enforce in their implementation is the notion of a VNF PaaS implementation.  All of the APIs that a VNF presents as an interface have to be connected with a logical paired function, and all of the SDN/NFV and management APIs that a VNF would be expected to use have to be offered in a uniform way to the “virtual space” in which VNFs run.  This same requirement exists in a slightly different form for SDN, but in my view it would be met by the support of an intent-model “above” the SDN controller.

This is going to be the most important issue for NFV, I think.  Absent a PaaS-like framework, there is no meaningful portability/onboarding, and no way to contain integration cost and risk.  Commercial VNF vendors are likely to tie up with NFV partners (as they have already) and integrate only with these partners, which opens a risk of each setting licensing terms that operators will find offensive because there’s little or no competition.  Open-source could be totally excluded from the picture.

A “VNF” is a system, a black box that provides a feature or features, asserts an explicit SLA, and contains a range of deployment options that could adapt to conditions by scaling or replacement.  All of this good stuff should happen inside the box, with specific contained APIs to link the functionality with the rest of the service ecosystem.  Absent that, we have no reliable integration, and we are absent that now.

The next point is perhaps the largest problem, and a problem that would have to be solved in order to solve the VNFPaaS challenge.  Management, meaning lifecycle management at all levels, has to be defined explicitly or nothing can be integrated at all—no VNFs, no NFVI, nothing.  The current model is kind of like the software equivalent of the universal constant (“That number which, when multiplied by my answer, yields the correct answer.”)  We have the VNF Manager that might be integrated with each VNF, or it might be centralized, or a combination of both.  What is integrated with a VNF is part of a tenant-service, and what is centralized is part of the management system.  You can’t float between these two environments because it’s not secure or reliable to do so, any more than you can let applications change the operating system.

The really big problem here is that the industry approached all this from the bottom, and you can’t really do management right except from the top.  You manage services against the SLA.  You manage service components against the behavior that you set for them to secure the SLA, and you manage resources to the standards required to make those component-level behaviors work.  Management should be linked to modeling, so that every model layer has appropriate SLAs and management definitions.  That way you have management of the system of functions that make up a service, down to the system of resources that support the functions.

The final point for SDN/NFV vendors is to focus strongly on federation, not only across operator boundaries but across implementations of SDN and NFV at the lower level.  “Federation” in my context means supporting an autonomous implementation at some level by representing it as an opaque model to the level above.

A good modeling approach will take you a long way toward federation support of this sort because an intent model makes the “who” and “how” opaque to the higher-level orchestration process.  However, there are a number of commercial relationships possible among operators, and there’s always going to be a number of different approaches to sharing management data.

Accommodating the commercial relationship is an implementation issue with intent modeling.  The decomposition of a model representing a federated lower-level (or partner) element just means activating whatever that lower-level process might be, at any appropriate level.  So you could have a “treaty federation” where billing data didn’t have to be exchanged, or one where the order process in one domain was activated by the orchestration in another.

The management stuff could be more complicated, depending on how good the management model is to start with.  If you presume my preferred approach, which is a repository in which all “raw” management data is collected and from which management APIs present query interfaces, then there’s no real issue in controlling what a partner sees or how it should be interpreted.

In some respects, operator architectures could make it easier on vendors.  If they fit in the architecture they don’t have to offer a complete solution.  If they fit in the architecture they don’t have to sell the entire SDN/NFV ecosystem.  It could create focused procurements and shorter sales cycles.  It certainly will facilitate more limited service-specific applications of SDN and NFV, as long as they can be fit into the operator’s holistic model.  It’s also surely an indication that the SDN/NFV space is maturing, moving from media hype to the real world.  It’s just important to remember that doesn’t mean media hype becomes the real world.  Operator architectures are the proof of that.

The Critical Open-Source VNF: How We Could Still Get There

One of the most logical places for operator interest in open-source software to focus is in the area of virtual network functions (VNFs).  Most of the popular functions are available in at least one open-source implementation, and operators have been grousing over the license terms for commercial VNFs.  It would seem that an open-source model for VNFs would be perfect, but we seem to have barriers to address in making the approach work.

VNFs are the functional key to NFV because they’re the stuff that all the rest of the NFV specifications are aimed at deploying and sustaining.  Despite this, VNFs have in some sense been the poor stepchild of the process.  From the first, everyone has ignored the fundamental truth that defines VNFs—they’re programs.

Virtually all software today is written to run on a specific platform, with hardware and network services provided through application program interfaces (APIs) presented either by an operating system or by what’s called “middleware”, system software that performs a special set of useful functions to simplify development.  In some cases, the platform (and in particular the middleware) is independent of the programming language, and in others it’s tightly integrated.  Open-source software is no exception.

A convenient way to visualize this is to draw a box representing the program/component, and then show a bunch of “plugs” coming out of the box.  These plugs represent the APIs the program uses, APIs that have to be somehow connected to services when it’s run.  Let’s presume these plugs are blue.

When something like NFV comes along, it introduces an implicit need for “new” middleware because it introduces at least a few interfaces that aren’t present in “normal” applications.  If you look at the ETSI diagrams you see some of these reference interfaces.  These new APIs add new plugs to the diagram, and if you envision them in a different color like red, you can see the challenge that NFV poses.  You have to satisfy both the red and blue APIs or the software doesn’t run.

A piece of network software of the sort that could be turned into a virtual function also has implicit external network connections to satisfy.  A typical software component might have several network ports—one for management access, one as an input port and another as an output port.  Each of these ports has an associated protocol—for example, a management port might support either IP SNMP or a web API (Port 80).  Data ports might have IP, Ethernet, or some other network interface (to connect to a tunnel, for example).

Then there’s what we might call “implicit” plugs and sockets.  Virtual functions have a lifecycle process set, meaning that they have to be parameterized, activated, sustained in operation, perhaps scaled in or out—you get the picture.  This lifecycle process set may or may not be recognized by the software.  Scaling, for example, could be done using load balancing and control of software instances even if the software doesn’t know about it.  But something has to know, because the framework has to connect all the elements and work, even when there are many components with many plugs and sockets to deal with.

What this means is that when a piece of open-source software is viewed as a virtual function, it will have to be deployed in such a way that all the plugs from the software align with sockets in the platform it runs on, and all the sockets presented by NFV interfaces line up with some appropriate plug.  How that might happen depends on how the software was developed.

If we presume that somebody built an open-source component specifically for NFV, we could presume that the software itself would harmonize all the plugs and sockets for all the features.  The same thing could be true if the software was transplanted from a physical appliance and altered to work as a VNF.  Operators tell me that there is very little truly customized VNF software out there in any form, much less open-source.

The second possibility is to adopt what might be considered a variation on the “VNF-specific VNF Manager (VNFM).”  You start with a virtual function component that provides the feature logic, and you combine it with custom stuff that harmonizes the natural plugs and sockets and connectivity expected by the function with the stuff needed by NFV.  This combination of functional component and management stub then forms the “VNF” that gets deployed.  Operators tell me that most of the VNFs they are offered use this approach, but also that only a very few open-source functions have been so modified.

The final possibility is that you define a generic lifecycle management service that talks to whatever plugs are available from the function component, and makes the necessary connections inside NFV to do deployment and lifecycle management.  I’ve proposed this approach for both the original CloudNFV project and my ExperiaSphere model, but operators tell me that they don’t see any signs of adoption by vendors so far.

All of these options for open-source virtual functions expose two very specific issue sets—deployment (the NFV Orchestrator function) and lifecycle management (VNFM).  For each issue set, current trials and tests have exposed a “most-significant-issue” challenge.

In deployment, the problem is that open-source software’s network connection expectations are quite diverse.  In some cases, the software uses one or more Ethernet ports and in others it expects to run on an IP subnet, sometimes with other components, and nearly always with the aid of things like DNS and DHCP services.  One challenge this presents is that “forwarding graphs” that show the logical flow relationship of a set of VNFs may do little or nothing in describing how the actual network connectivity would have to be set up.

In the lifecycle management case, there are two challenges.  One is to present some coherent management view of the VNF status.  In the ETSI model this is the responsibility of the VNFM, which is often integrated with the VNF, but I don’t think this is workable because the VNF may be instantiated in multiple places because of horizontal scaling.  The other challenge is getting the VNF information on its own resources.  You can’t have a tenant service element accessing real resource management data, particularly if it plans to then change variables to control behavior.

I’ve said in prior blogs that VNF deployment should be viewed as platform-as-a-service (PaaS) cloud deployment, where the platform APIs come from a combination of operating system and middleware tools deployed underneath the VNFs, and connectivity and control management tools deployed alongside.  We have never defined this space properly, which means that there is no consistent way of porting software to become a VNF and no consistent way to onboard it for use.

What’s needed here is a simple plug-and-socket diagram that defines the specific way that VNFs talk to NFV elements, underlying resources, and management systems.  The diagram has to show all of the plugs and sockets, for not only the base configuration of the VNF but also for any horizontally scaled versions, including load-balancers needed.

Open source is not the answer to this problem; like any other software it has to run inside some platform.  In fact, the lack of a platform puts the application of open-source software to VNFs at risk because it poses a significant risk in terms of resources needed to adapt the software, and in the open-source world the commercial interest in covering that risk is diminished.

Operator initiatives like the recent architecture announcements from AT&T and Verizon take a step in the right direction, but they’re not there yet.  I’d love to see these operators step up and define that VNFPaaS framework now, so we can start to think about the enormous opportunity that open-source VNFs could open for them all.

What Operators Think Vendors Should Do To Counter Spending and Transformation Risk

These are the times that try the souls of networking sales management.  Most of you know that I have an ongoing dialog with salespeople in many companies, and that dialog says that network spending overall is under pressure.  Legacy infrastructure investment is slow-rolling because of ROI issues, and vendors who have presented next-gen architectures have failed to make a business case for their deployment.  In SDN/NFV, all the sales people tell me that they are undershooting their goals.  Cisco, Juniper, and even Brocade have reported anemic spending by network operators, and Wall Street isn’t liking the equipment space.  What can be done?  I asked some of my operator contacts to find out.

Business as usual isn’t, or shouldn’t be, on the list of options.  In 2013 when SDN and NFV were getting a lot of early attention, there was a chance of redefining networking in such a way as to preserve a great deal of the legacy equipment model.  That opportunity has passed forever at this point, as both vendor financial performance and operator architecture evolution has shown.  However, vendors should still (as I’ve noted in prior blogs) provide specific support for opex reduction to reduce the pressure on capex.

Capex for connection technology, at Levels 2 and 3 of the “true” OSI model, is expected to decline for as far out as operators have any visibility.  Initially, operators expect to slow-roll spending on these layers and put price pressure on vendors (outside the US, shifting to Huawei is a popular approach).  In the longer term, they expect to move to “gray” and “white” boxes, meaning commodity devices that would increasingly include server-hosted switch/router instances.

Even at Level 1 (optical) operators aren’t expecting to generate a near-term windfall, which comes as no surprise to the optical vendors I’m sure.  My contacts tell me that operators have been prepared to shift spending downward providing that the optical vendors presented a strong architecture to reduce costs higher up.  That means shifting functionality downward to a “virtual wire” layer and perhaps facilitating the virtualization of Levels 2/3.  The operators tell me that optical vendors have not been prepared to define that strong architecture, so optical spending is stuck in the general ROI backwater generated by continuing profit-per-bit pressures.

One clear operator reaction to the problem is “elevation”, as one CFO calls it.  Instead of focusing on infrastructure changes (for which nobody is presenting a credible model), the operators are focusing on plastering a service-and-operations skim coat over the cracking foundations.  This process can be as casual as offering portal-based solutions for customer care, which have to be linked to current operations and management systems, or as sophisticated as end-to-end orchestration to unify legacy technology with the various service-specific flavors of SDN and NFV now gaining favor.

Vendor reactions aren’t quite this clear, but perhaps they should be.  If you ask operators what vendors should do, they present the following points.

First, present a model-driven top-end to their legacy, SDN, and NFV offerings that can be incorporated into an end-to-end orchestration (EEO) element.  The number one issue with operators is preventing siloization of operations and management processes so they can harmonize their “skim coat” solutions with their evolving infrastructure.  Their number two issue is agile service development, and they see the EEO-to-network link as being critical in addressing both agility and silos.  It’s less important at this late point for vendors to promote a “complete” strategy (which in any case collides with the operators’ open vision) than to fit into an EEO scheme.

Operators, especially those who had taken the trouble to articulate their approach to network evolution, have been somewhat surprised by vendors’ lack of enthusiasm in supporting the initiatives.  The need to integrate with EEO is absolute, the value to vendors themselves should be clear (you can fit your stuff in if you conform, and you cannot if you do not), yet there’s no rush to define the necessary models.

The second recommendation of operators is get VNF and software vendors to promote realistic license practices.  Some operators claim that if you were to apply the VNF license policies to vCPE services to businesses, the cost to operators would be higher than that represented by appliances.  These same operators believe that NFV kingpins have packed their partner programs with vendors who envisioned NFV as being just another dimension of the old gravy train.  They want open-source VNFs now, but they’d accept license terms that didn’t totally contaminate their business case.

Part of this issue seems to arise from the fact that many VNF providers aren’t appliance vendors and have no experience with that side of the market.  If you don’t realize that your customers have been selling firewalls (for example) for decades, you might be forgiven for thinking that their desire to sell firewall VNFs is your chance to make your numbers.  Revenue-sharing, one vendor put it.  Taking advantage, one operator responds.

The third recommendation is think gray.  Operators see established network equipment vendors refusing to develop commodity switching/routing solutions or OpenFlow switches to protect their network equipment sales.  According to the operators, that will only accelerate the development of credible white-box competitors.  If instead, established vendors brought out lightweight intermediary “gray box” devices with optional proprietary features that would help make the business case or support orderly evolution, they could win acceptance.

Most of the major vendors have dismissed white-box networks, and yet operators have been increasingly committed to them, in no small part because they see vendors’ lack of acceptance as new evidence of manipulative intransigence.  The problem, though, is that operators say that even white-box or hosted-instance vendors present their stuff as one-off alternatives to switches and routers and not as a part of an architectural shift.  Operators say that the shift is the goal, and one-off-ness isn’t an option.

No industry willingly accepts radical transformations to its business model, but when the driver of change comes from outside, when the evidence that change is needed is overwhelming, and when buyers start to take defensive actions as a result of the forces of change, it’s time to make the best of things.  We seem to be at that point now.

For IoT, Forget Network Virtualization and Think “Thing Virtualization!”

How can we best accommodate the notion of virtualization to the application of IoT?  That’s a question that more and more operators and vendors are wrestling with, and it’s a good one.  The answer might be interesting and disruptive—think less about virtualizing the network and more about virtualizing the “things”, the sensors and controllers.

I’m not denying that there are “things” that we might want to access that aren’t currently connected.  I’m not denying that 4G/5G might be a useful way to connect some of them, but I think everything we already know about security and environmental monitoring and process automation proves that connection isn’t really the issue.  You’re probably sitting in the midst a “thing network” as you read this blog, and it’s based on pedestrian sensor/controller technology that doesn’t put any of the “things” directly on the Internet, or on a 4G network either.

So is the whole IoT thing a colossal media/analyst fraud?  Maybe, but there is still a grain of value in the notion if you look beyond the aspirations of vendors and operators for easy money.  The question is how to empower the market in general with the knowledge of what are now (and are likely to remain) “private things” that are neither online nor accessible in any form to general application development.

I’ve talked about one model that could harness the things, so to speak.  If we were to build a massive set of repositories that held the collected knowledge we extract from our things, we could then run queries/analytics that would let people exploit all this data and still (through the analytics apps) apply the necessary protections to insure stability, security, and privacy.  In this approach, IoT is a database presented by a series of APIs.  I think this is a good approach, certainly better than just sticking all these sensors and controllers directly on the Internet and hoping everyone would behave.

There’s another approach, though.  Suppose that we want to preserve the literal “Internet of Things” model but recognize that everything that’s on the Internet isn’t necessarily directly and discretely connected.  We could then employ virtualization to create a series of “virtual things” that are constructed from, related to, the real things.  These could be presented on the Internet through traditional web-like APIs, but the real stuff that supports the virtual presence could be hidden, connected as it is now, and the APIs could still apply policy controls to protect the integrity of the data and the security of the users.

With this model, each “thing” is represented as though it were a kind of website; you could read and write to it and potentially even access it through a web browser.  Like any “website” it could be either on a VPN or on the open Internet, and it could apply encryption and access controls.  In programming terms, it’s a resource accessed with a RESTful API.

Behind each “thingsite” is a process that links it to the real sensor or controller, or set thereof.  This process is similar to that used behind websites to link to transactional applications.  In theory it could operate asynchronously, gathering data and posting it to the thingsite based on policy-determined timing, or it could be triggered by an inquiry to the thingsite.  The process could also be doing database dips, meaning that the thingspace could be a front-end for the repository of thingdata I’ve been talking about.

This model of IoT would preserve the notion of a set of on-the-web sensors and controllers that could be exploited, but they’d buffer the idea with the same kind of virtualization that currently keeps tenant networks separate.  If your company wants to expose a set of sensors/controllers to partners, you simply define a thingspace for them, and let the back-end technology populate it with the information you’re willing to share.  They can do whatever they want with the exposed things, and you don’t have to coordinate with them as long as you’re happy with the data that’s being shared.

“Public” things, meaning things that would be available for use without contractual arrangements, are also possible with the model; you simply expose a thingspace directly online and you apply only the policy filters that are required to conform to evolving privacy regulations.  In theory you could even build in security and load-balancing with this model, spawning multiple virtual things that represent the same set of real ones to share the load of mass access.

Since the back-end applications that feed the thingsites would be able to gateway data from a private sensor/controller network based on any technology, you can immediately harness all the stuff that’s already deployed, or at least that part of the current base that its owners are prepared to open up.  You could also construct, with the proper access to either things or thingsites deployed elsewhere, your own “derived thingsites” that represent analytics-based digestions of one or more sensors, or that introduce data from stuff outside the thingspaces—like retail pricing or personal presence.

What about the original sensors-online model?  Well, if you wanted you could augment virtual things with real ones, but I think that eventually somebody is going to get smart and realize that the cost of supporting a complete online presence with policy and security filters for every “thing” is going to kill the opportunity completely.  A better approach would be to have the real things, even new ones, front-ended by virtual thingsites that could handle all the variables of security, policy, and performance.

So will this approach rise up and take over?  Probably not, because so much of technology these days is about creating buzz rather than creating opportunity.  What could happen, and I think will happen eventually, is that the real IoT opportunities will end up migrating to a practical platform, which could be the thingspace concept or the analytics model.  Somebody who manages to figure this out up front could end up making some big bucks.

Can We Build Agile Infrastructure with the Overlay/Underlay Model?

Let us suppose for a moment that the goal of operators is to reduce equipment and operations cost in concert and at the same time increase their ability to provision current services quickly and flexibly, and develop new services just as quickly.  Let us further suppose that they have addressed the higher-level operations/portal implications of this.  What would the ideal network approach be?

Since it’s clear that operators do want exactly what’s presented in the last paragraph, this is a fair question.  Since the answer to the question will dictate infrastructure spending in the future, it’s an important one.  Interestingly, we have an answer for it, and it’s been around for a fair period of time.

If we go back to a point in my last blog, operators need to be able to make changes to costs and revenues without forcing a fork-lift, large-scale, change-out of infrastructure.  There is simply no way to bear the risk of a large transformation and at this point no time to prove out alternative infrastructure technologies to the degree needed to contain that risk.  We have to evolve with some grace into the future.

My conversation with the MEF’s CTO convinced me that their Third Network model has merit, providing that the model embrace something that is strongly hinted at but not featured—the concept of an overlay technology.  If the lower three layers of the OSI model (what the model says is actually in the network) is Levels 1 through 3, then let’s call this overlay layer Level I or Li for short.

The basic notion for Li is that services would be defined and delivered at this new layer, which would then consume tunnels (“virtual wires”) created at the layers below.  Since services would now be using existing network technology only as a physical layer, you’d be able to change out any or all of that stuff at whatever pace you find optimal because lower-layer implementations are opaque to the higher layers.

Overlay connections are based on a header that’s appended to data payloads before they’re encapsulated for handling by the tunnel protocol.  They subdivide the traffic at any tunnel-point, and at each such tunnel-point the subdivisions can either extract the traffic with a given header and deliver it to a user access point, or “cross-connect” it to another tunnel.  It’s in how this is done that the efficiency and value of the Li model is determined.

In the original Nicira overlay-SDN model, a LAN or VLAN or VPN architecture created the tunnel paths, and these connected physical network/IT elements like servers.  The SDN overlay then subdivided access by tenant.  In theory, each server could either extract header-identified traffic for its local users or cross-connect it onward.  This is not unlike how lower OSI layers relate to higher layers; you can pull traffic from a LAN (Level 2) and connect it to another LAN through a WAN connection, via a router.

The current SD-WAN products have a slightly different approach but use the same overlay concept.  Here, a series of connections made at a lower level to the same access point are effectively united by a higher overlay that can ride on any of the low-level options.  This higher layer then presents the user interface.

The general overlay model that might be viewed as the basis for MEF’s Third Network should be able to work with any of the following tunnel-models:

  1. The lower-level tunnels can connect all the way to the access points, creating a virtual mesh. The overlay technology would then provide only service-specific handling and addressing, and each tunnel access point would simply forward a packet on the right tunnel.  This would work for modest-scale virtual networks where a fully scalable forwarding technology (like SDN switches) was used.
  2. The lower-level tunnels connect to some number of aggregation points hosted within the network based on traffic topology. At these points, forwarding rules would cross-connect them.  This is the structural model that would optimize the use of hosted/virtual router instances.
  3. The lower-level tunnels, in addition to one of the above approaches, cross a protocol or administrative boundary where tunnel-to-tunnel connection is not available, and where tunnels from each side must therefore terminate. The Li layer now has to cross-connect the tunnels appropriately just to pass across the boundary.

The issue that can mess up a good overlay strategy could be called “tunnel granularity”.  If you have too little tunnel granularity, then you can’t create tunnels to the access points for an overlay-based service without a lot of tunnel cross-connecting.  Not only does this process increase delay and packet loss risk, the fact that it’s happening for a concentration of users sharing an inadequate number of lower-level tunnels means it might well grow in demand to the point where addressing it with a hosted router instance would be difficult.  You’d like to get your lower-level tunnel mesh as close to serving all the access points as possible.  The MEF has been working to improve Ethernet’s ability to support connected-path multiplicity efficiently, and that’s good.

Here is where “universal SDN” might be very helpful.  If you think of an OpenFlow-driven concatenation of forwarding table entries as a kind of “naked tunnel”, you see that SDN could create any arbitrary tunnel configuration end to end if desired.  If you combine this with agile optics (ROADMs) then you’d have a highly functional physical layer over which you could overlay any convenient L2/L3 service protocol while largely ignoring issues like topology and even path failures (because they’d be handled or controlled below).

The overlay approach would be easy to apply to mobile infrastructure because it’s already heavily based on tunnels (EPC).  It would also be easy to apply to business virtual network services and to cloud application services.  It’s not as clear that you could adopt an overlay model for the Internet, which suggests that either you’d want to retain standard Internet routing at least in the core and augment it with SDN forwarding, or at least retain it for non-content delivery services, which are already supported largely from CDNs.

There’s no shortage of potential vendors to support the model, starting with the classic overlay-SDN Nicira/VMware play and extending to SD-WAN vendors like Talari, Citrix/CloudBridge, Silver Peak, and Riverbed/SteelConnect.  In addition, most virtual routers (software router instances) can interconnect tunnels and so could be used to build an overlay-modeled service framework.  However, vendors have been shy so far in committing to the approach, preferring to sell to enterprises in more limited missions rather than to operators.  Even the SD-WAN vendors whose products could easily frame an overlay model (even within the Third Network approach) haven’t played that capability as a differentiator.

The likely reason for this is that selling SD-WAN to enterprises is working, and selling it as a mainstay for next-gen networking is a Great Unknown, particularly for vendors who don’t call on operator CTOs and don’t participate in emerging-network standards.  Despite the resistance, I think it’s clear that overlay networking could play a major role in next-gen infrastructure, perhaps the dominant one.  It may be that the evolution of the MEF’s Third Network will finally legitimize the approach and address the critical question of overlay/underlay relationships.

How Equipment Vendors Can Counter Cautious Operator Spending

With the exception of Huawei, network equipment vendors are facing tightening spending by operators.  The reason, obviously, is that compression in profit-per-bit that I’ve been talking about—the compression that’s led to operator support for “transformation” and their interest in SDN and NFV.  Since SDN and NFV have not evolved fast enough and far enough to generate the kind of radical improvements in cost and revenue operators had hoped for, their only response is to slow capital spending.  The impact is greatest in wireline, because wireless is too competitive for anyone to skimp on improvements.  Vendors like Juniper with little credible wireless contribution to make suffer, obviously, but nearly every vendor who isn’t a price leader is feeling the pinch.

So what’s to be done?  While I might be (and am) confident that there’s a way out of the compression problem, I’m not the guy who’s going to be stuck with an enormous technical albatross if the method doesn’t pan out.  Operators have long capital cycles and so they’re unusually risk-adverse with respect to writing down failures.  They either have to reduce the risk you’re doing the wrong thing, or they have to do nothing—or more precisely they have to do nothing different and risk-building in terms of architecture, just build with cheaper components.  Hence, Huawei.

Vendors across the board have failed to deal with this resistance to failure, and some add insult to injury by promoting the OTT notion of “fast fail” as a model the operators need to adopt.  There is no way of fast-failing a trillion-dollar infrastructure.  What operators need, and have always needed, is a set of new-approach hypothesis that link directly to a benefit, and that can be proven out in a modest-scale trial.  The biggest casualty of bottom-up specifications is the potential to fulfill this need.  The early work has no business context in which it can prove either benefits or realization.

But OK, we’re here and it’s now.  AT&T and Verizon have both issued papers describing an architecture model and these should resolve a lot of the issues, right?  Not so fast.

What AT&T has done is set goals, which can define benefits.  What Verizon has done is frame a solution set inside an architecture.  You can do a lot with these two, particularly if you could somehow combine them, but guess what?  Vendors are singularly unimpressed with, perhaps even unhappy with, the two approaches.  “Selling” has become not a fitting of your product plans to the demands of your buyers, but rather a coercion or manipulation of your buyers to accept what you’ve decided to produce.  The reason, of course, is that vendors want to make money and their fondest wish is that operators just suck it up and buy stuff and forget all this newfangledness.

What we need now is recognition that resolving the problem of cost reduction without killing vendor support for your initiatives must lie in focusing on costs other than equipment costs, and in achieving transformation on a large scale without making infrastructure changes on a large scale.  As I’ve said in past blogs, this means focusing on opex reductions that can be achieved by a higher layer of orchestration, one that accommodates both legacy and new SDN/NFV technologies.  That would let you test and realize benefit-based changes without forcing you to commit to major infrastructure upgrades that could only be justified if you were sure they’d work.

The problem getting to that (happy?) goal is a combination of the fact that when you do a bottom-up spec you don’t get to the top till the end of the process (if at all) and that same old issue of vendor self-interest.  Network equipment vendors are reluctant to embrace top-down abstraction-based operations because it anonymizes network equipment and threatens incumbencies.  IT vendors in the SDN/NFV space are similarly reluctant because these top-down approaches don’t sell servers right away.  And everyone is reluctant because, in the main, they don’t have the top-layer tools in place.

One of the most important developments in this area is the emergence of operator-driven initiatives to define holistic SDN/NFV architectures.  Verizon, in particular, has emphasized the notion of a layered orchestration model that would allow a higher-level orchestrator to harmonize not only legacy and emerging network technologies but also multiple vendor-specific implementations.  This overcomes the fact that neither SDN nor NFV standards include modernizing operations practices or incorporating or evolving future networks from legacy deployments.

Another potential solution is the use of a generalized orchestration model, championed by some of the six vendors who have complete NFV solutions (ADVA, Ciena, and HPE in particular).  This approach could in theory be applied two ways—a top-to-bottom orchestration architecture and a selective architecture.  With the former, the vendors’ solutions would be accepted as the only orchestration approach, and this seems to run afoul of the current service-specific SDN/NFV evolution trends.  With the latter, you’d adopt the generalized orchestration model where there’s no competing implementation, and use a stub/adapter to incorporate the competing models by supporting abstractions that represent them.

It’s this last approach that shows the most promise, but vendors have not been enthusiastic in promoting it.  Part of the reason is that most still hope to achieve their own “lock-in” of early NFV deployments, and fear that embracing an open model would hurt them as often as help them.  Part is the fact that you have to implement the stubs to represent “foreign” models, which of course means that there has to be some foreign model structure to represent.  At this point, absent any specific intent-model requirement for NFV or SDN (SDN’s is coming along), that could be challenging.  In particular, it would leave the top-level orchestration vendor at risk to changes made by vendors below.

That problem, unfortunately, can happen in any multi-layer orchestration approach, and that’s why in the end the operator models may be the only hope.  Verizon or another Tier One could compel vendors to open and stabilize their models so that lower-level service-specific implementations would fit inside an end-to-end orchestration model.  Other vendors almost surely could not.

Everything comes back to the point I made about vendor differentiation and model-based abstraction.  If operators think of equipment as simply a realization of a given abstract model, then it’s harder for vendors to differentiate.  Operator-driven models would probably not include special differentiating features from vendors, given operator demands for an open approach.  Vendors need to somehow support open-network goals and retain some opportunity to exploit their own special sauce.

The first-quarter slump in operator spending (which vendors want to believe is just a blip on an otherwise untroubled horizon of spending growth despite ROI compression) argues for taking decisive action.  A vendor could develop an operations-savings approach that would at least mitigate the problem of loss of differentiation.  For example, they could develop their own models to link their lower-level management systems to EEO tools, which could then exploit their own differentiation.  As long as their models only enabled these special capabilities and didn’t mandate them, would operators refuse them?  Probably not, and they might even use the features if they were valuable, even at the cost of openness.

Remember too that Verizon and AT&T are emphasizing a shift to white-box products, which means products non-differentiable at the data plane level.  Verizon has also explored the notion of displacing physical routers with software instances, recognizing that hardware acceleration may be required for the hosting.  I think that widespread use of router instances will also require “virtual-wire” partitioning of traffic at L1/L2 to eliminate large L3 aggregation missions that servers are never likely to be able to support efficiently.

I said early on that if vendors did not find a way to secure significant non-capex benefits through SDN and NFV, operators would re-architect networks to reduce spending on switching and routing, and also achieve opex savings through L2/L3 simplification.  I think that’s happening.  I think that everything happening in the network market today demonstrates a need for vendors to push an operations-savings approach, and to take control of the way their own orchestration and management tools integrate with emerging high-level EEO tools.  In fact, I think that vendors have already lost millions by not having this capability, money operators would have spent on infrastructure had the profit compression pressure been relieved by operations savings.  Not losing any more should be a priority.