What the Network Operator Literati Think Should Be Done to Accelerate NFV

I am always trying to explore issues that could impact network transformation, especially relating to adopting NFV.  NFV offers a potentially radical shift in capex and architecture, after all.  I had a couple emails in response to some of my prior blogs that have stimulated me to think of the problem from a different angle.  What’s the biggest issue for operators?  According to them, it’s “openness”.  What are the barriers to achieving that?  That’s a hard topic to survey because not everyone has a useful response, so I’ve gathered some insight from what I call the “literati”, people who are unusually insightful about the technical issues of transformation.

The literati aren’t C-level executives or even VPs and so they don’t directly set policy, but they are a group who have convinced me that they’ve looked long and hard at the technical issues and business challenges of NFV.  Their view may not be decisive, but it’s certainly informed.

According to the literati, the issues of orchestration and management are important but also have demonstrated solutions.  The number of fully operational, productized, solutions ranges from five to eight depending on who you talk with, but the point is that these people believe that we have already solved the problems there, we just need to apply what we have effectively.  That’s not true in other areas, though.  The literati think we’ve focused so much on “orchestration” we’ve forgotten to make things orchestrable.

NFV is an interplay of three as-a-service capabilities, according to the literati.  One is hosting as a service to deploy virtual functions, one is connection as a service to build the inter-function connectivity and then tie everything to network endpoints for delivery, and one is function as a service which relates to implementations of network functions with virtual network functions (VNFs).  The common problem with these things is that we don’t base them on a master functional model for each service/function, so let’s take the three elements in the order I introduced them to see how that could be done.

All hosting solutions, no matter what the hardware platform is, or hypervisor, or whether we’re VMs or containers, should be represented as a single abstract HaaS model.  The goal of this model is to provide a point of convergence between diverse implementations of hosting from below and the composition of hosting into orchestrable service models above.  That creates an open point where different technologies and implementations can be combined, a kind of a buffer zone.  According to the literati, we should be able to define a service in terms of virtual functions and then, in essence, say “DEPLOY!” and have the orchestration of the deployment and lifecycle management then harmonize to a single model no matter what actual infrastructure gets selected.

Connection-as-a-service, or NaaS if you prefer, is similar.  The goal has to be to present a single NaaS abstraction that gets instantiated on whatever happens to be there.  This is particularly important for network connectivity services because we’re going to be dealing with infrastructure that evolves at some uncontrollable and probably highly variable pace, and we don’t want service definitions to have to reflect the continuous state of transition.  One abstraction fits all.

The common issue that these two requirements address is that of “brittleness”.  Service definitions, however you actually model them in structure/language terms, have to describe the transition from an order to a deployment to the operational state, and other lifecycle phases that then are involved with maintaining that state.  The service-level stuff, if it has to reference specific deployment and connection technology, would have to be changed whenever that technology changed, and if new technologies like SDN and NFV were deployed randomly across infrastructure as they matured, it’s possible that every service definition would have to be multifaceted to reflect how the deployment/management rules would change depending on where the service was offered.

The as-a-service goal says that if you have an abstraction to represent hosting or connection, you can require that vendors who supply equipment supply the necessary software (a “Virtual Infrastructure Manager” for example, in ETSI ISG terms) to rationalize their products to the abstractions of the as-a-service elements their stuff is intended to support.  Now services are insulated from changes in resources.

The literati say that this approach could be inferred from the ETSI material but it’s not clearly mandated, nor are the necessary abstractions defined.  That means that any higher-level orchestration process and model would have to be customized to resources, which is not a very “open” situation.

On the VNF side we have a similar problem with a different manifestation.  Everyone hears, or reads, constantly about the problem of VNF onboarding, meaning the process of taking software and making it into a virtual function that NFV orchestration and management can deploy and sustain.  The difficulty, says the literati, is that the goal is undefined in a technical sense.  If we have two implementations of a “firewall” function, we can almost be sure that each will have a different onboarding requirement.  Thus, even if we have multiple products, we don’t have an open market opportunity to use them.

What my contacts say should have been done, and still could be done, is that virtual functions should be divided into function classes, like “Firewall”, and each class should then have an associated abstraction—a model.  The onboarding process would then begin by having the associated software vendor (or somebody) harmonize the software with the relevant function class model.  Once that is done, then any service models that reference that function class would deploy the set of deployment instructions/steps that the model decomposed to—no matter what software was actually used.

The problem here is that while we have a software element in NFV that is at least loosely associated with this abstract-to-real translation (though it lacks the rigorous model definitions needed and a full lifecycle management feature set built into the abstraction) we have nothing like that on the VNF side.  The closest thing we have is the notion of the specialized VNF Manager (VNFM), but while this component could in theory be tasked with making all VNFs of a function class look the same to the rest of NFV software, it isn’t tasked that way now.

There are similarities between the view of the literati and my own, but they’re not completely congruent.  I favor a single modeling language and orchestration approach from the top to the bottom, and the literati believe that there’s nothing whatsoever wrong with having different models and orchestration at the service layer and to decompose the abstractions I’ve been talking about.  They also tend to put service modeling up in the OSS/BSS and the model decomposition outside/below it, where I tend to integrate both OSS/BSS and network operations into a common model set.  But even in these areas, I think I’ve indicated that I can see the case for the other approach.

One point the literati and I agree on is that orchestration and software automation of service processes is the fundamental goal here, not infrastructure change.  Most of them don’t believe that servers and hosting will account for more than about 25% of infrastructure spending even in the long run.  They believe that if opex could be improved through automation, some of the spending pressure (that for example resulted in a Cisco downgrade on Wall Street) would be relieved.  SDN and NFV, they say, aren’t responsible for less spending—the profit compression of the converging cost/price per bit curve is doing that.  The literati think that the as-a-service abstraction of connection resources would let operators gain cost management benefits without massive changes in infrastructure, but it would then lead in those changes where they make sense.

It seems to me that no matter who I talk with in network operator organizations, they end up at the same place but by taking different routes.  I guess I think that I’ve done that too.  The point, though, is that there is a profit issue to be addressed that is suppressing network spending and shifting power to price leaders like Huawei.  Everyone seems to think that service automation is the solution, but they’re all seeing the specific path to achieving it in a different light.  Perhaps it’s the responsibility of vendors here to create some momentum.

What Microsoft’s LinkedIn Deal Could Mean

Microsoft announced it was acquiring business social network giant LinkedIn and the Street took that as positive for LinkedIn and negative for Microsoft.  There are a lot of ways of looking at the deal, including that Microsoft like Verizon wants a piece of the OTT and advertising-sponsored service market.  It seems more likely that there’s more direct symbiosis, particularly if you read Microsoft’s own release on the deal.

LinkedIn, which I post on pretty much every day, is a good site for business prospecting, collaboration, and communication.  It’s not perfect, as many who like me have tried to run groups on it, but it’s certainly the winner in terms of engagement opportunity.  There are a lot of useful exchanges on LinkedIn, and it borders on being a B2B collaboration site without too much of the personal-social dimension that makes sites like Facebook frustrating to many who have purely business interests.

Microsoft has been trying to get into social networking for a long time, and rival Google has as well, with the latter launching its own Google+ platform to compete with Facebook.  There have been recent rumors that Google will scrap the whole thing as a disappointment, or perhaps reframe the service more along LinkedIn lines, and that might be a starting point in understanding Microsoft’s motives.

Google’s Docs offerings have created cloud-hosted competition for Microsoft, competition that could intensify if Google were to build in strong collaborative tools.  Google also has cloud computing in something more like PaaS than IaaS form, and that competes with Microsoft’s Azure cloud.  It’s illuminating, then, that Microsoft’s release on the deal says “Together we can accelerate the growth of LinkedIn, as well as Microsoft Office 365 and Dynamics as we seek to empower every person and organization on the planet.”

Microsoft’s Office franchise is critical to the company, perhaps as much as Windows is.  Over time, like other software companies, Microsoft has been working to evolve Office to a subscription model and to integrate it more with cloud computing.  The business version of Office 365 can be used with hosted exchange and SharePoint services.  Many people, me included, believe that Microsoft would like to tie Office not only to its cloud storage service (OneDrive) but also to its Azure cloud computing platform.

Microsoft Dynamics is a somewhat venerable CRM/ERP business suite that’s been sold through resellers, and over the years Microsoft has been slow to upgrade the software and quick to let its resellers and developers customize and expand it, to the point where direct customers for Dynamics are fairly rare.  There have also been rumors that Microsoft would like to marry Dynamics to Azure and create a SaaS version of the applications.  These would still be sold through and enhanced by resellers and developers, targeting primarily the SMB space but also in some cases competing with Salesforce.

Seen in this light, a LinkedIn deal could be two things at once.  One is a way of making sure Google doesn’t buy the property, creating a major headache for Microsoft’s cloud-and-collaboration plans, and the other is a way to cement all these somewhat diverse trends into a nice attractive unified package.  LinkedIn could be driven “in-company” as a tool for business collaboration, and Microsoft’s products could then tie to it.  It could also be expanded with Microsoft products to be a B2B platform, rivaling Salesforce in scope and integrating and enhancing Microsoft’s Azure.

Achieving all this wondrous stuff would have been easier a couple years ago, frankly.  The LinkedIn community is going to be very sensitive to crude attempts to shill Microsoft products by linking them with LinkedIn features.  Such a move could reinvigorate Google+ and give it a specific mission in the business space, or justify Google’s simply branding a similar platform for business.  However, there is no question that there is value in adding in real-time collaboration, calling using Skype, and other Microsoft capabilities.

The thing that I think will be the most interesting and perhaps decisive element of the deal is how Microsoft plays Dynamics.  We have never had a software application set that was designed for developers and resellers to enhance and was then migrated to be essentially hybrid-cloud hosted.  Remember that Azure mirrors Microsoft’s Windows Server platform tools, so what integrates with it could easily integrate with both sides of a hybrid cloud and migrate seamlessly between the two.  Microsoft could make Dynamics into a poster child for why Azure is a good cloud platform, in face.

Office in general, and Office 365 in particular, also offer some interesting opportunities.  Obviously Outlook and Skype have been increasingly cloud-integrated, and you can see how those capabilities could be exploited in LinkedIn to enhance group postings and extend groups to represent private collaborative enclaves.  Already new versions of Office will let you send a link to a OneDrive file instead of actually attaching it, and convey edit rights as needed to the recipient.

So why doesn’t the Street like this for Microsoft, to the point where the company’s bond rating is now subject to review?  It’s a heck of a lot of cash to put out, but more than that is the fact that Microsoft doesn’t have exactly an impressive record with acquisitions.  This kind of deal is delicate not only for what it could do to hurt LinkedIn, but what it could do to hurt Microsoft.  Do this wrong and you tarnish Office, Azure, and Dynamics and that would be a total disaster.

The smart move for Microsoft would be to add in-company extensions to LinkedIn and then extend the extensions to B2B carefully.   That way, the details of the integration would be worked out before any visible changes to LinkedIn, and it’s reasonable to assume that B2B collaboration is going to evolve from in-company collaboration because it could first extend to close business partners and move on to clients, etc.

From a technology perspective this could be interesting too.  Integrating a bunch of tools into a collaborative platform seems almost tailor-made for microservices.  Microsoft has been a supporter of that approach for some time, and its documentation on microservices in both Azure and its developer program is very strong.  However, collaboration is an example of a place where just saying “microservices” isn’t enough.  Some microservices are going to be highly integrated with a given task, and thus things you’d probably want to run almost locally to the user, while others are more rarely accessed and could be centralized.  The distribution could change from user to user, which seems to demand an architecture that can instantiate a service depending on usage without requiring that the developer worry about that as an issue.  That could favor a PaaS-hybrid cloud like that of Microsoft.

This is also a pretty darn good model of what a “new service” might look like, what NFV and SDN should be aiming to support.  Network operators who are looking at platforms for new revenue have to frame their search around some feasible services that can drive user spending before they worry too much about platforms.  This deal might help do that.

Perhaps the most significant theme here is productivity enhancement, though.  We have always depended as an industry on developments that allow tech to drive a leap forward in productivity.  That’s what has created the IT spending waves of the past, and what has been lacking in the market since 2001.  Could this be a way of getting it all back?  Darn straight, if it works, and we’ll just have to wait to see what Microsoft does next.

Server Architectures for the Cloud and NFV Aren’t as “Commercial” as We Think

Complexity is often the enemy of revolution because things that are simple enough to grasp quickly get better coverage and wider appreciation.  A good example is the way we talk about hosting virtual service elements on “COTS” meaning “commercial off-the-shelf-servers”.  From the term and its usage, you’d think there was a single model of server, a single set of capabilities.  That’s not likely to be true at all, and the truth could have some interesting consequences.

To understand hosting requirements for virtualized features or network elements, you have to start by separating them into data-plane services or signaling-plane services.  Data-plane services are directly in the data path, and they include not only switches/routers but also things like firewalls or encryption services that have to operate on every packet.  Signaling plane services operate on control packets or higher-layer packets that represent exchanges of network information.  There are obviously a lot less of these than the data-plane packets that carry information.

In the data plane, the paramount hosting requirements include high enough throughput to insure that you can handle the load of all the connections at once, low process latency to insure you don’t introduce a lot of network delay, and high intrinsic reliability because you can’t fail over without creating a protracted service impact.

If you looked at a box ideal for the data plane mission, you’d see a high-throughput backplane to transfer packets between network adapters, high memory bandwidth, CPU requirements set entirely by the load that the switching of network packets would impose, and relatively modest disk I/O requirements.  Given that “COTS” is typically optimized for disk I/O and heavy computational load, this is actually quite a different box.  You’d want all of the data-plane acceleration capabilities out there, in both hardware and software.

Network adapter and data-plane throughput efficiency might not be enough.  Most network appliances (switches and routers) will use special hardware features like content-addressable memory to quickly process packet headers and determine the next hop to take (meaning which trunk to exit on).  Conventional CPU and memory technology could take a lot longer, and if the size of the forwarding table is large enough then you might want to have a CAM board or some special processor to assist in the lookup.  Otherwise network latency could be increased enough to impact some applications.

The reliability issue is probably the one that gets misunderstood most.  We think in terms of having failover as the alternative to reliable hardware in the cloud, and that might be true for conventional transactional applications.  For data switching, the obvious problem is that the time required to spin up an alternative image and make the necessary network connections to put it into the data path to replace a failure would certainly be noticed.  Because the fault would probably be detected by a higher level, it’s possible that adaptive recovery at that level might be initiated, which could then collide with efforts to replace the failed image.  The longer the failure the bigger the risk of cross-purpose recovery.  Thus, these boxes probably do have to be five-nines, and you could argue for even higher availability too.

Horizontal scaling is less likely to be useful for data-plane applications for three reasons.  First, it’s difficult to introduce a parallel path in the data plane because you have to introduce path separation and combination features that could cause disruption just because you temporarily break the connection.  Second, you’ll end up with out-of-order delivery in almost every case, and not all packet processing will reorder packets.  Third, your performance limitations are more likely to be on the access or connection side, and unless you paralleled the whole data path you’ve not accomplished much.

The final point in server design for data plane service applications is the need to deliver uniform performance under load.  I’ve seen demos of some COTS servers in multi-trunk data plane applications, and the problem you run into is that performance differs sharply between low and high load levels.  That means that a server that’s assigned to run more VMs is going to degrade everything, which means you can’t run multiple VMs and adhere to stringent SLAs.

The signaling-plane stuff is very different.  Control packets and management packets are relatively rare in a flow, and unlike data packets that essentially demand a uniform process—“Forward me!”—the signaling packets may spawn a fairly intensive process.  In many cases there will even be a requirement to access a database, as you’d see in mobile/IMS and EPC control-plane processing.  These processes are much more like classic COTS applications.

You don’t need as high hardware reliability in the signaling-plane services because you can spawn a new copy more easily, and you can also load-balance these services without interruption.  You don’t need as much data-plane acceleration unless you plan on doing a lot of different signaling applications on a single server, because the signaling packet load is smaller.

Signaling-plane services are also good candidates for containers versus virtual machines.  It’s easier to see data-plane services being VM-hosted because of their greater performance needs and their relatively static resource commitments.  Signaling-plane stuff needs less and runs less, and in some cases the requirements of the signaling plane are even web-like or transactional.

This combination of data and signaling plane requirements makes resource deployment more complicated.  A single resource pool designed for data-plane services could pose higher costs in signaling-plane applications because they need less resources.  Obviously a signaling-plane resource is sub-optimal in the data plane.  If the resource pool is divided up by service type, then it’s not uniform and thus not as efficient as it could be.

You also create more complexity in deployment because every application or virtual function has to be aligned with the right hosting paradigm, and the latency and cost of connection has to be managed in parallel with the hosting needs.  This doesn’t mean that the task is impossible; the truth is that the ETSI ISG is already considering more factors in hosting VNFs than would likely pay back in performance or reliability.

It seems to me that the most likely impact of these data-plane versus signaling-plane issues would be the creation of two distinct resource pools and deployment environments, one designed to be high-performance and support static commitments, and one to be highly dynamic and scalable—more like what we tend to think of when we think of cloud or NFV.

The notion of COTS hosting everything isn’t reasonable unless we define “COTS” very loosely.  The mission for servers in both cloud computing and NFV varies widely, and optimizing both opex and capex demands we don’t try to make one size fit all.  Thus, simple web-server technology, even the stuff that’s considered in the Open Compute Project, isn’t going to be the right answer for all applications, and we need to accept that up front and plan accordingly.

“The Machine” and the Impact of a New Compute Model on Networking

The dominant compute model of today is based on the IBM PC, a system whose base configuration when announced didn’t even include floppy disk drives.  It would seem that all the changes in computing and networking would drive a different approach, right?  Well, about eight years ago, HPE (then HP Labs) proposed what it called “The Machine”, which is a new computer architecture based on a development that makes non-volatile memory (NVM) both fast and inexpensive.  Combine this with multi-core CPUs and optical coupling of elements and you have a kind of “computer for all time”.

Today, while we have solid-state disks, the performance of the NVM is far slower than traditional memory, which means that you still have to consider a two-tier storage model (memory and disk).  With the new paradigm NVM would be fast enough to support traditional memory missions and of course be a lot faster for flash/rotating media missions.  It’s fair to ask what the implications could be for networking, but getting the answer will require an exploration of the scope of changes The Machine might generate for IT.

One point that should be raised is that there aren’t necessarily any profound changes at all.  Right now we have three memory/storage options out there—rotating media, flash, and standard DRAM-style volatile memory.  If we assumed that the new memory technology was as fast as traditional volatile memory (which HPE’s material suggests is the case) then the way it would likely be applied would be cost-driven, meaning it would depend on its price relative to DRAM, flash, and rotating media.

Let’s take a best-case scenario—the new technology is as cheap as rotating media on a per-terabit basis.  If that were the case, then the likely result would be that rotating media, flash, and DRAM would all be displaced.  That would surely be a compute-architecture revolution.  As the price rises relative to the rotating/flash/DRAM trio, we’d see a price/speed-driven transition of some of the three media types to the new model. At the other extreme, if the new model were really expensive (significantly more than DRAM), it would likely be used only where the benefits of NVM that works at DRAM speed are quite significant.  Right now we don’t know that the price of the stuff will be, so to assess its impact I’ll take the best-case assumption.

If memory and storage are one, then it makes sense to assume that operating systems, middleware, and software development would all change with respect to how they use both memory and storage.  Instead of the explicit separation we see today (which is often extended with flash NVM into storage tiers) we’d be able to look at memory/storage as being seamless, perhaps addressed by a single petabyte address space.  File systems and “records” are now like templates and variables.  Or vice versa, or they’re both supported by a new enveloping model.

One obvious benefit of this in cloud computing and NFV is that the time it takes to load a function/component would be shorter.  That means you could spin up a VNF or component faster and be more responsive to dynamic changes.  Of course, “dynamic changes” means you’d also be spin up an instance of a component faster.

The new-instance point has interesting consequences in software development and cloud/NFV deployments.  What happens today when you want to instantiate some component or VNF?  You read a new copy from disk into memory.  If memory and disk are the same thing, in effect, you could still do that and it would be faster than rotating media or flash, but wouldn’t it make sense just to use the same copy?

Not possible, you think?  Well, back in the ‘60s and ‘70s when IBM introduced the first true mainframe (The System 360) and programming tools for it, they recognized that a software element could have three modes—refreshable, serially reusable, and reentrant.  Software that is refreshable needs a new copy to create a new instance.  If a component is serially reusable it can be restarted with new data without being refreshed, providing that it’s done executing the first request.  If it’s reentrant, then it can be running several requests at the same time.  If we had memory/storage equivalence, it could push the industry to focus increasingly on developing reentrant components.  That concept still exists in modern programming languages, by the way.

There are always definitional disputes in technology, but let me risk one by saying that in general a reentrant component is a stateless component and statelessness is a requirement for RESTful interfaces in good programming practice.  That means that nothing used as data by the component is contained in the component itself; the variable or data space is passed to the component.  Good software practices in creating microservices, a hot trend in the industry, would tend to generate RESTful interfaces and thus require reentrant code.  Thus, we could say that The Machine, with a seamless storage/memory equivalence, could promote microservice-based componentization of applications and VNFs.

Another interesting impact is in the distribution of “storage” when memory and storage are seamless.  We have distributed databases now, clusters of stuff, DBaaS, and cloud storage and database technologies.  Obviously all of that could be made to work as it does today with a seamless memory/storage architecture, but the extension/distribution process would break the seamlessness property.  Memory has low access latency, so if you network-distribute some of the “new memory” technology you’d have to know it was distributed and not use it where “real” memory performance was expected.

One way to mitigate this problem is to couple the distributed elements better.  HPE says The Machine will include new optical component coupling.  Could that new coupling be extended via DCI?  Yes, the speed of light would introduce latency issues that can’t be addressed unless you don’t believe Einstein, but you could surely make things better with fast DCI, and widespread adoption of the seamless memory/storage architecture would thus promote fast DCI.

The DCI implications of this could be profound for networking, of course, and in particular for cloud computing and NFV.  Also potentially profound is the need to support a different programming paradigm to facilitate either reentrancy/statelessness or REST/microservice development.  Most programming languages will support this, but many current applications/components aren’t reentrant/RESTful, and for virtual network functions it’s difficult to know whether software translated from a physical device could easily be adapted to this.  And if management of distributed seamless memory/storage is added as a requirement, virtually all software would have to be altered.

On the plus side, an architecture like this could be a windfall for many distributed applications and for something like IoT.  Properly framed, The Machine could be so powerful an IoT platform that deals like the one HPE made with GE Digital (Predix, see my blog yesterday) might be very smart for HPE, smart enough that the company might not want to step on partnership deals by fielding its own offering.

The cloud providers could also benefit mightily from this model.  Platforms with seamless memory/storage would be, at least at first, highly differentiating.  Cloud services to facilitate the use of distributed seamless memory/storage arrays would also be highly differentiating (and profitable).  Finally, what I’ve been calling “platform services” that extend a basic IaaS model into PaaS or expand PaaS platform capabilities could use this model to improve performance.  These services are then a new revenue source for cloud providers.

If we presumed software would be designed for this distributed memory/storage unity, then we’d need to completely rethink issues like placement of components and even workflow and load balancing.  If the model makes microservices practical, it might even create a new programming model that’s based on function assembly rather than writing code.  It would certainly pressure developers to think more in functional terms, which could accelerate a shift in programming practices we already see called “functional programming”.  An attribute of functional programming is the elimination of “side effects” that could limit RESTful/reentrant development, by the way.

Some of the stuff needed for the new architecture are being made available by HPE as development tools, but they seem to want to make as much of the process open-source-driven as they can.  That’s logical, providing that HPE insures that the open-source community focuses on the key set of issues and does so in a logical way.  Otherwise it will be difficult to develop early utility for The Machine, and there will be a sensitivity to price trends over time if pricing factors can be expected to change the way the new memory model is used, because these changes could then impact programming practices.

A final interesting possibility raised by the new technology is taking a leaf from the Cisco/IBM IoT deal.  Suppose that you were to load up routers and switches with this kind of memory/storage and build a vast distributed, coupled, framework?  Add in some multi-core processors and you have a completely different model of a cloud or vCPE, a fully distributed storage/compute web.  Might something like that be possible?  Like other advances I’ve noted here, it’s going to depend on price/performance for the new technology, and we’ll just have to see how that evolves.

IoT is Creeping Toward a Logical Value Proposition!

All too often in technology, we see concepts with real possibilities go stupid on us.  Lately many of the key concepts have doubled down on stupidness, departing so far from relevant market benefits that there’s little hope of success.  IoT, probably the most hyped of all technology concepts in recent times, has surely had its own excursions into the realm of stupid, but unlike many tech notions it seems to have a chance of escaping back to reality.  You can see useful points being recognized, and it’s not too late for IoT to realize its potential.

The notion of a literal “Internet of Things” where all manner of sensors and controllers are put online to be accessed, exploited, and probably hacked by all, is one of the dumber excursions from a value perspective.  The notion that an IoT strategy is a strategy to manage the devices themselves isn’t any better.  From the first, it should have been clear that IoT is a big-data, analytics, and cloud application, and that it has to first exploit those sensors already deployed on largely private networks, often using local non-IP protocols.  Now we’re seeing signs of a gradual realization of what the real IoT needs to be.

Startups in the IoT space have provided some support for data storage and analytics for well over a year.  A Forbes article summarizes some of the key players in the space, but if IoT is a potential market revolution then startups are really selling themselves and not their technology.  IoT adopters will generally want to bet on somebody with a big name, and that’s particularly true of network operators looking for a realistic IoT service strategy.

In November of last year, I blogged about the GE Digital Predix platform, which as I said was the first credible IoT story from a major provider.  With strong analytics and a good strategy for capturing current sensor data, Predix has all the pieces to be a universal IoT framework, but the company has stressed “industrial IoT” rather than the universality inherent in its platform.  One thing the breadth of Predix may have done is to encourage other IoT vendors to focus their efforts on specific applications in either a horizontal or vertical sense.

One example of focus is addressing what many IoT users would see as the high first-cost barrier to IoT applications.  The cloud is a natural heaven for a logical ease-into-IoT model, and so it’s not surprising that cloud providers have IoT service offerings:

  • Amazon, the cloud leader, has an IoT offering that focuses on a unified model for device and cloud applications and facilitates the integrated use of a variety of “foundation services” hosted by AWS. Their approach is more development-centric than productized.
  • Google has a streaming and publish/subscribe distribution model that adds predictive analytics and event processing to IoT, all based on Google’s ubiquity as a cloud provider. Their Cloud Dataflow programming model may be a seminal reference for both batch and streaming IoT development.
  • Microsoft offers both premises tools for developing IoT applications and Azure cloud tools. They also integrate Cortana capability with inquiries and analytics, and they’ve won some very public deals recently.
  • Oracle offers its IoT Cloud Service, which focuses explicitly on the two key truths about IoT—you have to exploit current sensors connected through legacy private networks and you have to focus on data storage and analytics.
  • Salesforce’s IoT Cloud extends sensor and analytics concepts to websites and other customer information, and offers event triggering of cloud processes. The focus, not surprisingly, is on CRM but it appears that broader in-company use would also be possible.
  • Operators like AT&T and Verizon have IoT services that focus on connectivity, but AT&T also provides industry-specific integrated solutions.

Then, back in May, HPE talked about its IoT model in “platform” terms, which is how the media and market is now distinguishing between the sensor-driven IoT nonsense and the more logical application-and-repository concept.  The HP story had an unfortunate slant in its title, in my view: “Hewlett Packard Enterprise Simplifies Connectivity Across the IoT Ecosystem”.  The announcement does contain device and connectivity management elements, and the title tended to focus everyone on that aspect.  But HPE also provided a repository, data conversion, and analytics platform vision that should have been the lead item.  HPE is also partnering with GE Digital to power the Predix platform, which may suggest the company wants to be an IoT host for multiple software frameworks.

The most recent announcement is from IBM and Cisco.  The companies have agreed to provide Cisco hosting of Watson analytics so that event processing can be managed locally, making control loops shorter.  The move is not only potentially critical for both Cisco’s and IBM’s IoT differentiation, it’s an illustration that one of the key values of IoT in process management and event handling would best be supported by functionality hosted close to the sensors.  This explains why so many cloud IoT stories are gravitating toward complex event processing, and it also illustrates why IoT could be a very powerful driver for NFV.  Data centers close to the edge could host IoT processes with a shorter control loop, and that could help justify the data center positioning.  Edge data centers could then host service features.

Cisco’s ultimate IoT position could be critical.  The company has, in the past, been dazzlingly simplistic in its view of the future—everything has to come down to more bits for routers to push.  You could view complex-event-process (CEP) IoT that way, of course.  On the other hand, you could view it as an example/application of “fog computing” or the distribution of intelligence across the network.  The latter view would be helpful to Cisco, IBM, and IoT overall, and given that Cisco has recently had some management changes that suggest it’s moving in a different direction than Chambers had taken it, perhaps there’s a chance we’ll see some real IoT insight and not just another report on traffic growth.

Insight is what IoT is really about.  We could expect to capture more contextual data from wide use of IoT if we could dodge the first-cost problems, privacy issues, and security challenges of the inherently destructive model of “everything on the Internet”.  This is a big data and analytics application in one sense, complex event processing in another sense, and that’s how IoT has to develop if it’s going to develop in any real sense.  At some point, some platform vendor is going to step up and frame the story completely, and that could put IoT on the fast track.  Wouldn’t it be nice for some “revolutionary” technology to actually revolutionize?

How Will Virtual-Network Services Impact Transport Configuration?

One of the important issues of multi-layer networking, and in fact multi-layer infrastructure, is how things at the top percolate down to the bottom.  Any kind of higher-layer service depends on lower-layer resources, and how these resources are committed or released is an important factor under any circumstances.  If the agility of lower layers increases, then so does the importance of coordination.  Just saying that you can do pseudowires or agile optics, or control optical paths with SDN, doesn’t address the whole problem, which is why I propose to talk about some of the key issues.

One interesting point that operators have raised with me is that the popular notion that service activities would directly drive transport changes—“multi-layer provisioning”—is simply not useful.  The risks associated with this sort of thing are excessive because it introduces the chance that a provisioning error at the service level would impact transport networks to the point where it would impact other services and customers.  What operators want is not integrated multi-layer provisioning, but rather a way to coordinate transport configuration.

Following this theme, there are two basic models of retail service, coercive/explicit and permissive/implicit.  Coercive services commit resources on request—you set them up.  Permissive services preposition resources and you simply connect to them.  VPNs and VLANs are coercive and the Internet is permissive.  There are also two ways that lower-layer services can be committed.  One is to link the commitment below to a commitment above, which might be called a stimulus model, and the other is to commit based on aggregate conditions, which we might call the analytic model.  This has all been true for a long time, but virtualization and software-defined networking is changing the game, at least potentially.

Today, it’s rare for lower network layers, often called “transport” to respond to service-level changes directly.  What happens instead is the analytic model, where capacity planning and traffic analysis combine to drive changes in transport configuration.  Those changes are often quite long-cycle because they often depend on making physical changes to trunks and nodes.  Even when there’s some transport-level agility, it’s still the rule to reconfigure transport from an operations center rather than with automatic tools.

There are top-down and bottom-up factors that offer incentive or opportunity to change this practice, providing it can stay aligned with operator stability and security goals.  At the bottom, increased optical agility and the use of protocol tunnels based on anything from MPLS to SDN allow for much more dynamic reconfiguration, to the point where it’s possible that network problems that would ordinarily have resulted in a higher-layer protocol reaction (like a loss of a trunk in an IP network) can instead be remediated at the trunk level.  The value of lower-layer agility is clearly limited if you try to drive the changes manually.

From the top, the big change is the various highly agile virtual-network technologies.  Virtual networks, including those created with SDN or SD-WAN, are coercive in their service model, because they are set up explicitly.  When you set up a network service you have the opportunity to “stimulate” the entire stack of service layers, not to do coupled or integrated multi-layer provisioning but to reconsider resource commitments.  This is what I mean by a stimulus model, of course.  It’s therefore fair to say that virtual networking in any form has the potential to change the paradigm below.

There are two possible responses, then, to the way lower-layer paths and capacity are managed.  One is to adopt a model where service stimulus from above drives an analytic process that rethinks the configuration of what’s essentially virtual transport.  An order with an SLA would then launch an analytics process that would review transport behavior based on the introduction of the new service and, if necessary, re-frame transport based on how that meeting that SLA would alter capacity plans and potentially impact target resource utilization and the SLAs of other services/customers.  Another is to shorten the cycle of the analytic model, depending on a combination of your ability to quickly recognize changes in traffic created by new services and your ability to harness service automation to quickly alter transport characteristics to address the changes.  Which way is best?  It depends on a number of factors.

One factor is the scale of service traffic relative to the total traffic of the transport network.  If a “service” is an individual’s or SMB’s personal connectivity commitment, then it’s very likely that the SLA for the service would have no significant impact on network traffic overall, and it would not only be useless to stimulate transport changes based on it, it would be dangerous because of a risk of overloading control resources with a task that had no helpful likely outcome.  On the other hand, a new global enterprise VPN might have a very significant impact on transport traffic, and you might indeed want to reflect the commitment the SLA for such a service reflects even before the traffic shows up.  That could prevent congestion and problems, not only for the new service but for others already in place.

Another factor is the total volatility at the service layer.  A lot of new services and service changes in a short period of time, reflecting a variety of demand sources that might or might not be stimulated by common drivers, could generate a collision of requests that might have the same effect as a single large service user.  For example, an online concert might have a significant impact on transport traffic because a lot of users would view it in a lot of places.  It’s also true that if services are ordered directly through an online portal rather than through a human intermediary there’s likely to be more and faster changes.  The classic example (net neutrality aside for the moment) is the “turbo button” for enhanced Internet speed.

The final factor is SLA risk.  Even a fast-cycle, automated, analytic model of transport capacity and configuration management relies on traffic changes.  If those changes ramp rapidly, then it’s likely that remediation will lag congestion, which means you’re going to start violating SLAs.  There’s a risk that your remedy will create changes that will then require remediation, creating the classic fault avalanche that’s the bane of operations.

I think where this ends up is that virtual networking at multiple layers will need to have layer or layer-group control, with behavior at the higher layer coupled by analytics and events to behavior at the lower layer.  You don’t provision transport with services, but you do stimulate the analysis or capacity planning of lower layers when a service-layer change is announced.  That lets you get out in front of traffic changes and prevent new services from impacting existing ones.  Since virtual networks are explicit rather than permissive, they present a unique opportunity to do this, and it might be that the ability to stimulate transport-layer analytic processes will be a critical byproduct of virtual network services.

Events–The Missing Link in Service Automation

In my blog yesterday I talked about service modeling, and it should be clear from the details I covered that lifecycle management, service automation, and event handling are critical pieces to NFV.  The service model ties these elements together, but the elements themselves are also important.  I want to talk a bit more about them today.

Almost a decade ago, the TMF had an activity called “NGOSS Contract” that later morphed into the GB942 specification.  The centerpiece of this was the notion that a service contract (a data model) would define how service events related to service processes.  To me, this was the single most insightful thing that’s come out of service automation.  The TMF has, IMHO, sadly under-realized its own insight here, and perhaps because of that the notion hasn’t been widely adopted.  The TMF also has a modeling specification (“the SID”, or Service Information and Data model) that has the essential features of a model hierarchy for services and even a separation of the service (“Customer-Facing”) and resource (“Resource-Facing”) domains.

Service automation is simply the ability to respond to events by invoking automated processes rather than manual ones.  In yesterday’s blog I noted that the rightful place to do the event steering to processes is in the model (where the TMF’s seminal effort put it and incidentally where Ciena’s DevOps toolkit stuff makes it clear that TOSCA and Ciena also can put it).  What we’re left with is the question of the events.  Absent service events there’s no steering of events to processes and no service automation.

The event stuff can’t be ignored, and it’s more complicated than it looks.  For starters, there’s more than one kind of service event.  We have resource events that report the state of resources that host or connect service features.  We have operations events that originate with the OSS/BSS, customer service rep, network operations center, or even directly with the customer.  We also have model events that originate within a service model and signal significant conditions from one model element (a lower-level dependent one) to another (the higher-level one), for example.  Finally, with NFV, we have virtual network function (VNF) events.  Or we should.

One of the glaring gaps in NFV work so far is the relationship between virtual functions as elements of a service and both the resources below and the service structures above.  The current NFV work postulates the existence of an interface between a virtual function (which can be made up of multiple elements, including some management elements) and the rest of the NFV logic, meaning the orchestration and management components.  That’s at least an incomplete approach if not the wrong one; the connection should be based on events.

The first reason for this is consistency.  If service automation is based on event steering to appropriate processes you obviously need events to be steered, and it makes little sense to have virtual functions interact with service processes in a different way.  Further, if a virtual function is simply a hosted equivalent of a physical device (which the NFV work says it is) and if physical devices, through management systems, are expected to generate resource events, so should VNFs.

The second reason for this is serialization and context.  Events are inherently time-lined.  You can push events into a first-in-first-out (FIFO) queue and retain that context while processing them.  If you don’t have events to communicate service conditions at all levels, you can’t establish what order things are happening in, which makes service automation totally impossible.

Reason number three is concurrency and synchronization, and it’s related to the prior one.  Software that handles events can be made multi-threaded because events can be queued for each process and for multiple threads (even instances of a single process).  That means you can load-balance your stuff.  If load balancing is an important feature in a service chain, doesn’t it make sense that it’s an important feature in the NFV software itself?  And still, with all of this concurrency, you can always synchronize your work through events themselves.

Generating events is a simple matter; software that’s designed to be event-driven would normally dispatch an event to a queue, and there the event could be popped off and directed to the proper process or instance or thread.  Dispatching an event is just sending a message, and you can structure the software processes as microservices, which is again a feature that Ciena and others have adopted in their design for NFV software.  When you pop an event, you check the state/event table for the appropriate service element and you then activate the microservice that represents the correct process.

State/event processes themselves generate events as one of their options.  In software, the typical behavior of a state/event process is to accept the input, generate some output (a protocol message, an action, or an event directed to another process) and then set your “next-state” variable.  Activating an ordered service works this way—you get an Activate event, you dispatch that event to your subordinate model elements so they activate, and you set your next-state to ACTIVATING.  In this state, by the way, that same Activate event is a procedure error because you’re already doing the activating.

Can we make a VNF generate an event?  Absolutely we can, just as we can make a hardware management system generate one.  Who do we dispatch a VNF event to?  To the service model element that deployed the VNF.  That element must then take whatever local action is appropriate to the event, and then dispatch events to higher- or lower-level elements as needed.

Phrased this way, the NFV notion of having “local” or “proprietary” VNF Manager (VNFM) elements as well as “central” elements actually can be made to work.  Local elements are part of the service object set that deploys the VNF—a resource-domain action in my terms.  Central elements are part of the service object set that defines functional assembly and collection—the service-domain behaviors.  In TMF terms these are Resource-Facing- and Customer-Facing Services (RFS and CFS, respectively).

If everything that has to be done in NFV—all the work associated with handling conditions—is triggered by an event that’s steered through the service data model, then we have full service automation.  We also have the ingredients needed to integrate VNFs (they have to generate an event that’s handled by their own controlling object) and the tools needed to support a complete service lifecycle.

You also have complete control over the processes you’re invoking.  Any event for any service element/object can trigger a process of your choice.  There’s no need for monolithic management or operations systems (though you can still have them, as collections of microservices) because you can pick the process flavor you need, buy it, or build it.  This, by the way, is how you’d fulfill the goal of an “event-driven OSS/BSS”.

This approach can work, and I think any software architect who looks at the problem of service automation would agree.  Most would probably come up with it themselves, in fact.  It’s not the only way to do this, but it’s a complete solution.  Thus, if you want to evaluate implementations of NFV, this is what you need to start with.  Anything that has a complete hierarchical service model, can steer events at any level in the model based on a state/event relationship set, and can support event generation for all the event types (including VNF events and including model events between model elements) can support service automation.  Anything that cannot do that will have limitations relative to what I’ve outlined, and as an NFV buyer you need to know about them.

A Deep Dive into Service Modeling

The question of how services are modeled is fundamental to how services can be orchestrated and how service-lifecycle processes can be automated.  Most people probably agree with these points, but not everyone has thought through the fact that if modeling is at the top of the heap, logically, then getting it right is critical.  A bit ago I did a blog on Ciena’s DevOps toolkit and made some comments on their modeling, and that provoked an interesting discussion on LinkedIn.  I wanted to follow up with some of the general modeling points that came out of that discussion.

Services are first and foremost collections of features.  The best example, and one I’ll use through all of this blog, is that of a VPN.  You have a “VPN” feature that forms the interior of the service, and it’s ringed by a series of “Access” features that get users connected.  The Access elements might either be simple on-ramps or they might include “vCPE” elements.  When a customer buys a VPN they get what almost looks like a simple molecule; the central VPN ringed with Access elements.

Customers, customer service reps, and service architects responsible for building services would want to see a service model based on this structure.  They’d want Access and VPN features available for composition into VPN services, but they would also want to define a “Cloud” service as being a VPN to which a Cloud hosting element or two is added.  The point is that the same set of functional elements could be connected in different ways to create different services.

Logically, for this to work, we’d want all these feature elements to be self-contained, meaning that when created they could be composed into any logical, credible, service and when they were ordered they could be instantiated on whatever real infrastructure happened to be there.  If a given customer order involved five access domains, and if each was based on a different technology or vendor, you’d not want the service architect to have to worry about that.  If the VPN service is orderable in all these access domains, then the model should decompose properly for the domains involved, right?

This to me is a profound, basic, and often-overlooked truth.  Orchestration to be optimally useful has to be considered at both the service level and the resource level.  Service orchestration combines features, and resource orchestration deploys those features on infrastructure.  Just as we have a “Service Architect” who does the feature-jigsaws that add up to a service, we have “Resource Architects” who build the deployment rules.  I’d argue further that Service Architects are always top-down because they define the retail service framework that is the logical top.  Resource architects could be considered to be “bottom-up” in a sense, because their role is to expose the capabilities of infrastructure in such a way that those capabilities can couple to features and be composed into services.

To understand the resource side, let’s go back to the Access feature, and the specific notion of vCPE.  An Access feature might consist of simple Access or include Access plus Firewall.  Firewall might invoke cloud hosting of a firewall virtual network function (VNF), deployment of a firewall VNF in a piece of CPE, or even deployment of an ordinary firewall appliance.  We have three possible deployment models, then, in addition to the simple Access pipeline.  You could see a Resource Architect building up deployment scripts or Resource Orchestrations to handle all the choices.

Again applying logic, AccessWithFirewall as a feature should then link up with AccessWithFirewall as what I’ll call a behavior, meaning a collection of resource cooperations that add up to the feature goal.  I used the same name for both, but it’s not critical that be done.  As long as the Service Architect knew that the AccessWithFirewall service decomposed into a given Resource Behavior, we’d be fine.

So, what this relationship would lead us to is that a Resource Architect would define a single Behavior representing AccessWithFirewall and enveloping every form of deployment that was needed to fulfill it.  When the service was ordered, the Service Architect’s request for the feature would activate the Resource model, and that model would then be selectively decomposed to a form of deployment needed for each of the customer access points.

If you think about this approach, you see that it defines a lot of important things about modeling and orchestration.  First, you have to assume that the goal of orchestration in general is to decompose the model into something else, which in many cases will be another model structure.  Second, you have to assume that the decomposition is selective, meaning that a given model element could decompose into several alternative structures based on a set of conditions.

So a higher-level model element can “contain” alternatives below, and can represent either decomposition into lower-level elements or deployment of resources.  Are there other general properties?  Yes, and they fit into the aggregation category.

If two lower-level things (Access and VPN, for example) make up a service, then the status of the service depends on the status of these things, and the deployment of the service is complete only when the deployment of each of the subordinate elements is complete.  Similarly, a fault below implies a fault above.  To me, that means that every object has an independent lifecycle process set, and within it responds to events depending on its own self-defined state.

Events are things that happen, obviously, and in the best of all possible worlds they’d be generated by resource management systems, customer order systems, other model elements, etc.  When you get an event directed at a model element, that element would use the event and its own internal state to reference a set of handler processes, invoking the designated process for the combination.  These processes could be “internal” to the NFV implementation (part of the NFV software) or they could be operations or management processes.

If we go back to our example of a VPN and two AccessWithFirewall elements, you see that the service itself is a model element, and it might have four states; ORDERED, ACTIVATING, OPERATING, and FAULT.  The AccessWithFirewall elements include two sub-elements, the Firewall and the AccessPipe.  The Firewall element could have three alternative states—FirewallInCloud, FireWallvCPE, and FirewallAppliance.  The first two of these would decompose to the deployment of the Firewall VNF either in a pooled resource or in an agile premises box, and the latter would decompose to an external-order object that waited for an event that said the box had been received and installed.

If we assume all these guys have the same state/events, then we could presume that the entire structure is instantiated in the ORDERED state, and that at some point the Service model element at top receives an Activate event.  It sets its state to ACTIVATING and then decomposes its substructure, sending the AccessWithFirewall and VPN model elements an Activate.  Each of these then decompose in turn and also enter the ACTIVATING state, waiting for the lowest-level deployment to report an Operating event.  When a given model element has received that event from all its subordinates, it enters the OPERATING state and reports that event to its own superior object.  Eventually all these roll up to make the service OPERATING.

If a lower-level element faults and can’t remediate according to its own SLA, then it reports a FAULT event to its superior.  The superior element could then either try remediation or simply continue to report FAULT up the line to eventually reach the service level.  When a fault is cleared, the model element that had failed now reports Operating and enters that state, and the clearing of the fault then moves upward.  At any point, a model element can define remedies, invoke OSS processes like charge-backs, invoke escalation notifications to an operations center, etc.

Another aggregation point is that a given model element might define the creation of something that lower-level elements then depend on.  The best example here is the creation of an IP subnetwork that will then be used to host service features or cloud application components.  A higher-level model defines the subnet, and it also decomposes into lower-level deployment elements.

I would presume that both operators and vendors could define model hierarchies that represent services or functional components of services, and also represent resource collections and their associated “behaviors”.  The behaviors form the deployment bottom process sets, and so if two different vendors offered slightly different requirements they could still perform interchangeably if they rationalized to the same behavior model, which could then be referenced in a service.

This is a lightweight description of how a service model could work, and how it could then define the entire process of service creation and lifecycle management.  All the software features of SDN, NFV, and OSS/BSS/NMS are simply referenced as arguments in a state/event table for various model elements.  The model totally composes the service process.

The final logical step would be to develop a tool that let operator architects drag and drop element to create higher-level service models all the way up to the service level, and to define resources and deployment rules below.  These tools could work from the service model catalog, a storage point for all the model elements, and they could be based on something open-source, like the popular Eclipse interactive development environment used for software-building.

You might wonder where the NFV components like MANO or VNFM or even VNFs are in this, and the answer is that they are either referenced as something to deploy in a behavior, or they’re referenced as a state/event-driven process.  You could build very generic elements, ones that could be referenced in many model element types, or you could if needed supply something that’s specialized.  But there is no single software component called “MANO” here; it’s a function and not a software element, and that’s how a software architect would have seen this from the first.

A data-model-driven approach to NFV is self-integrating, easily augmented to add new functions and new resources, and always updated using model-data-driven activities not software development.  A service model could be taken anywhere and decomposed there with the same kind of model software, and any part of a model hierarchy could be assigned to a partner provider or a different administrative zone or a different class of resources.

This is how NFV should be done, in my view, and my view on this has never changed from my first interactions with the ETSI process and the original CloudNFV project.  It’s what my ExperiaSphere tutorials define, for those interested in looking up other references.  This is the functional model I compare vendor implementations against.  They don’t have to do everything the way I’d have done it, but they do have to provide as workable an approach that covers at least the same basis.  If you’re looking for an NFV implementation this is the reference you should apply to any implementations out there, open source or otherwise.

There’s obviously a connection between this approach and the management of VNFs themselves.  Since this blog is already long, I’ll leave that issue for the next one later this week.

Why a Model-Driven NFV Architecture is Critical (and How Ciena’s Looks)

In the fall of 2013 I had a meeting with five Tier One operators to discuss what was needed to make NFV work.  At the meeting, one of the key figures in the NFV ISG and a representative of a big operator made two comments I think are critical and have stood the test of time.  The first, referencing the fact that capex savings for NFV were now expected to be at most 20%, was “If I want a 20% reduction in capex I’ll just beat Huawei up on price!”  That meant that savings for NFV had to come from opex, and to demonstrate opex, my contact said, “You have to present an NFV architecture in the context of a service lifecycle.”

A service lifecycle, to an operator, has four distinct phases:

  1. Architecting, which is the process of defining a service as a structure made up of feature components that eventually decompose downward into resource commitments.
  2. Ordering, meaning the connection between an orderable service model and a customer, made through a self-service portal or a customer service rep.
  3. Deployment, the conversion of a service model that’s been ordered into a set of resource commitments that fulfill the service description and SLA.
  4. Management, the support of the service in its operational state, responding to conditions that would break the SLA, create inefficiency risks, or to customer-initiated changes or termination.

I took this to heart in my ExperiaSphere project, which divides the tutorial slides into this same four phases, and I’ve used these phases to assess the various vendor “NFV solutions”.  Most of the solutions were incomplete, as you can probably see just from the names of the phases.  Even where there were complete solutions, the specifics available from a vendor in online or distributed documentation was rarely enough to allow me to present the solution in detail.

Ciena’s is one of the six vendors I believe have a complete service-lifecycle story.  Their just-announced Blue Planet DevOps Toolkit also provides the requisite detail in their analyst preso, which I’m referencing here, and so I want to talk about it to illustrate why that first architecting phase is the key to the whole NFV story.

The Architect phase of a service lifecycle is really a series of iterations that are intended to build a decomposable model of a service that can then be offered as a retail product.  It’s always been my view that the model should be hierarchical and should describe a service as a feature composition, each feature of which eventually linked to some resource management/deployment task.  This corresponds to the “declarative” model of DevOps, for those familiar with the software development world.

There are two pieces to the Architect phase, one to model the services and the other to control the resources.  This corresponds to Ciena’s notion of a Service Template and Resource Adapters, with the latter being roughly equivalent to the ETSI NFV ISG’s Virtual Infrastructure Managers.  Ciena uses the OASIS TOSCA (Topology and Orchestration Specification for Cloud Applications) language, which I’ve said for several years now is the most logical way to describe what are effectively cloud-deployed services.  The real intent of the Ciena Blue Planet DevOps Toolkit is to build these Templates and RAs and to then provide a framework in which they can be maintained, almost like the elements of an application, in a lifecycle management process that’s somewhat independent from the lifecycle management of the deployed services.

The Template and RA separation corresponds to what I’ve called “service domain” and “resource domain” activities in the Architect phase of a service.  The service domain is building services from feature elements that have been previously defined.  These can be augmented as features evolve, and revised as needed for market reasons, and it’s this add-and-revise process that’s analogous to software evolution for a running application.  The RAs associate in most cases with management systems or resource controllers that can configure, deploy feature elements, and change resource state.  Service Templates, at some point, reference RAs.

Service providers can use the Toolkit to build all this themselves, conforming to their own requirements and their own network infrastructure.  Vendors or third parties can also build them and submit them for use, and it would be logical IMHO to assume that many vendors would eventually realize that building an RA or Infrastructure Manager is essential if their stuff is to be used to host virtual functions, connect them, etc.

There is nothing as important to NFV’s success as the notion of service and resource models, yet little or nothing is said in the NFV specifications about how these things would be created and used.  The Ciena approach uses TOSCA to describe service characteristics and parameters (the NFV ISG is now looking at a liaison with OASIS/TOSCA to describe how the parameters they’re suggesting would be stored in a model).  These fall into three categories—parameter values used to guide decomposition of model elements, parameters used to describe service conditions upward toward the user, and parameters that are sent directly or indirectly to RAs for deployment and management.

Ciena’s focus on model-building here is critical because it would facilitate just what their preso says, which is vendor independence and facile integration of resources.  The only thing needed to make this story complete is a way of authoring the lifecycle processes too.

Lifecycle processes could be defined in this kind of model, for each model element.  That’s critical in making the models reusable, since each model element is a kind of self-contained and self-described service unit that knows how to deploy and sustain itself.  Since ordering a service is logically the first lifecycle stage, the entire operations process can be defined and composed this way, at every level.

The implicit Ciena approach is that a service template, or model, has built into each element the rules associated with that element’s life, both as an order template and when instantiated, as a piece of a service.  No matter where you put this element, geographically or in a service context, its model will determine how it is sustained, and nothing else is needed.  Give a model to a partner or subsidiary, or to a customer, and if they have the right RAs they can deploy it.  That will be true if you can define lifecycle processes into the service templates, and if you can’t then there’s a hole that will invalidate at least some of the benefits of the Ciena approach.  That means the question of whether the Toolkit supports lifecycle definitions, and how that works, is critical.

Most of you know that I don’t accept verbal inputs from vendors on important points because they’re not public disclosures.  I’d invite Ciena to respond by commenting on LinkedIn to my question here.  Alternatively, I’d like to see a document that describes the approach to lifecycle definition, without an NDA so I can reference it.

The good news is that I have been assured by two TOSCA experts that there would be no problem defining lifecycle processes in state/event form within a TOSCA template.  It’s only a matter of providing the mechanism to build it through something like the Toolkit, then steering events through it.  I’d love to see someone describe this in public, and if your company has it, I’d like to hear from you.

I’ve also realized with this announcement that I need a way of getting information on purported NFV implementations in depth and with confidence that the vendor really has what they say.  That’s the kind of information I want to base my assessments on, and pass on to those who read this blog.  I’m going to be working out an approach, and I’ll publish it as an open document for vendors when I’m done.

Where Are We in SDN/NFV Evolution?

Some examples of the operator architecture for SDN and NFV networks have been out for a couple weeks or more.  We’re now starting to hear more about vendor changes and even a bit more about the conditions in the marketplace.  I’ve certainly been hearing plenty, and so I want to provide a status report.

It has been clear for some time that there is no compelling movement to totally transform infrastructure using SDN and NFV in combination.  Optimally, SDN and NFV could represent almost half of operator capex by 2022 but that figure is now unlikely to be reached because it would require a holistic commitment to infrastructure change.  The progress toward next-gen infrastructure is now most likely to be service-specific, and the primary area of opportunity lies in mobile networking today and IoT down the line.

What separates mobile/IoT opportunity from other services (enterprise vCPE, for example) is scope.  Mobile/IoT is explicitly cloud-hosted so anything you build becomes reusable and leverageable, and there’s enough of either to build up a fairly significant repertoire of data centers.  This lowers both cost and risk for add-on services, providing that the architecture under which you deploy is suitable for extension.

One dimension of suitability is the breadth of the business case.  The problem with service-specific business cases is their specificity.  You could easily frame an opportunity so specialized in benefits that it would justify deployment where other opportunities would not.  Operator say that even mobile/IoT service opportunities are hampered by a lack of vertical integration of the technology elements involved in SDN and NFV.  Operators today believe that neither SDN nor NFV has a mature management model, which makes it nearly impossible to assess operations costs.  Since opex represents more of each revenue dollar than capex, the lack of clarity there inhibits anything but very specialized and localized investment.

The good news is that one early problem with both SDN and NFV, which is the dependence of success on coexistence with legacy elements, is now seen as solved.  Most vendors now promise such integration; Ciena has just announced a kind of next-gen-network-SDK that facilitates model-building for both resources and services, a key step to solving the legacy problem.  In theory, modeling also supplies the first step toward operational integration, but the rest of the steps here are taking longer, probably because of the continued debate over the best management approach.

We are making progress here, too.  At the beginning of this year there were no credible models for a complete future-network architecture, and we now have public models from both AT&T and Verizon and some vendors (Netcracker, most recently) have published their own top-to-bottom vision.  While even many within the operators’ organizations don’t think all the details are ironed out yet, there’s general acceptance of eventual success.  The median year for availability of a complete solution from credible sources is 2018 and a very few operators think that by this time next year there will be at least one proven model that addresses the full range of issues.

Leveraging vCPE opportunity may be critical for NFV because nearly every large operator has at least some of it, and most will see it either as a lead-off opportunity or a quick follow-on.  The challenge is that credible vCPE deployments so far are more likely based on premises hosting of functions in generalized appliances than on resource pools.  The applications are justified by a belief that agile function replacement and augmentation presents a revenue opportunity, a view that is somewhat credible in the MSP space but not as clearly so in the broad market.  And MSPs don’t usually deploy much infrastructure; they rely on wholesale/overlay relationships.

Many operators believe that consumer vCPE could be the escape from the premises-hosting trap, but the challenge has been that the cost of operationalizing a large population of cloud-fulfilled consumer edge functions isn’t known, and the savings available are limited by the low cost of devices and the need to have something on premises to terminate the service and offer WiFi connectivity to users.

SDN at present is primarily a data center play, which means that it would be likely to deploy primarily where larger-scale NFV deploys—mobile infrastructure.  According to operators, the big barrier to SDN deployment has been the lack of federated controllers to extend scope and the ability to efficiently support a mixture of SDN and legacy.  As noted above, these problems are now considered to be solved at the product level, and so it’s likely that trials will develop quickly and move to deployments.

Getting SDN out of the data center is the real challenge, according to operators.  There is a hope that SDN could build up from the fiber layer to become a kind of universal underlay network, and some operators point to the Open Optical Packet Transport project within the Telecom Infra Project (which both ADVA and Juniper just joined) as a possible precursor to an agile electro-optical underlayment.  Since that project is purely optical at this point, such an evolution would take time.

ADVA and Ciena both have strong orchestration offerings, acquired from Overture and Cyan, respectively, and at least some operators hope they’ll present something in the form of an integrated optical/SDN-tunnel picture.  Brocade might also play in this to exploit their Vyatta routing instances to build virtual L3 networks.  Nobody is moving very fast here, say operators, and so it probably can’t happen until 2018.

Vendors are facing their own challenges with SDN and NFV, as both comments I’ve received from their salesforce and a recent article in the New IP shows.  The primary problem, say vendors, is that the buyer wants a broad business case that most product offerings can’t deliver because of scope issues.  What’s needed is an integration project to tie deployment to operations efficiently, and while there are products that could build this vertical value chain (and all of it could be offered by half-a-dozen vendors), the sales cycle for something like this is very long, and vendors have been pushing “simplified” value propositions, getting them back to those service-specific deployments that cannot win on a large scale.

IoT could be an enormous boost for both SDN and NFV if the concept is recognized for what it is, which is a cloud-analytics and big-data application first and foremost.  IoT could be an even larger driver for edge data center deployments, and so could revolutionize both SDN and NFV—the former for DCI and the latter for more feature hosting than any other possible service.  HPE has made a few announcements that could be considered as positioning IoT this way, but they’ve not taken that position aggressively and the industry still sees this as a wireless problem.

Who’s ahead?  It’s hard to say, but operators think the vendors considered most likely to “win” at large-scale SDN/NFV modernization are Huawei and Nokia, with Ericsson seen as an integrator winner.  Both are seen as having a strong position in mobile infrastructure, which is where the first “big” success of both SDN and NFV is expected, and both have credible SDN/NFV stories that link to management and operations.  Brocade, who just demonstrated a hosted-EPC deployment in concert with Telefonica, and Metaswitch who has had a cloud-IMS option for some time, are also seen as credible players in integrator-driven projects.

The integrator angle may prove interesting.  Ericsson has been the classic telco integrator, though both Nokia and Huawei have expanded their professional-services credibility.  Ericsson also has strong OSS/BSS position, but they’re not seen by operators as leading the OSS/BSS evolutionary charge.  Operators are also wondering whether the Cisco relationship will end up tainting them as an objective integrator, but on the other hand it gives Ericsson a bit more skin in the game in terms of upside if a major SDN/NFV deployment is actually sold.

All of the IT vendors (Dell, HPE, IBM, Intel/Wind River, Microsoft, Oracle, and Red Hat) are increasingly seen as being hosting or platform players rather than as drivers of a complete (and benefit-driven) deployment.  That’s not an unreasonable role, but it lacks the ability to drive the decision process because that depends so heavily on new service revenues or operational efficiencies.  HPE and Oracle do have broad orchestration and potentially operations integration capability, but these capabilities are (according to operators and to their own salespeople) not being presented much in sales opportunities for fear of lengthening the selling cycle.

Vendors with fairly limited offerings can expect to benefit from increased support for from-the-top orchestration, both from vendors like Amdocs and Netcracker and in operator-developed architecture models.  However, the OSS/BSS-framed orchestration options have to be presented to the CIO, who has not been the driver for SDN/NFV testing and trials.  In fact, most operator CIOs say they are only now becoming engaged and that proving the operations benefits will take time.

Operators and vendors still see more issues than solutions, but the fact is that solutions are emerging.  I think that had a respectable top-down view of SDN/NFV been taken from the first, we’d have some real progress at this point.  That view is being taken now, though, but it seems likely it will rely more on professional services for custom integration than on any formal standard or even open-source project.  It’s going to take time, and every delay will lose some opportunity.  That’s why it’s important to confront reality right now.