More on NFV Orchestration and Open Source

Carol Wilson of Light Reading did a nice piece on the operators’ mixed position on open-source, quoting comments from the Light Reading NFV Carrier SDN event.  I’ve blogged about some of the points of the discussion, but I wanted to cover those that I hadn’t covered, or perhaps hadn’t covered fully.  The fall is the season of networking shows for the carrier space, so it always opens interesting new things to review.

The opening point of the article is very important in my view; the four operators on the panel said they didn’t see a value in waiting for standards; they just slowed things down.  That’s the view I get generally from operators these days, though there are differences in perspective depending on what operator organization you’re talking to.  Most now realize that the market isn’t going to wait for standardization, which is why a path like open source is important.

Operators and standards processes originated at a time when operators planned to support a future demand source whose emergence was under their control.  Today, the only service that matters much is Internet/broadband, and for that service the OTTs are the only relevant driver.  That means operators risk getting dangerously behind the curve if they can’t respond to outside influences.  Standards groups have never, and probably will never, work nearly fast enough.

Where I disagree with the operators on the panel, and where at least a slight majority of operators I’ve talked with also disagree, is that APIs and abstraction layers are enough to replace models.  There are situations where that can be true, but also situations where APIs and abstractions are a doorway into the wrong room.

I don’t really like the term “abstraction layer” unless it’s coupled with the basic truth that an abstraction layer abstracts something into something else.  The first “something” is a set of resources/infrastructure available to host or support services.  The “something else” is a set of things that will be mapped to that first “something” set.  A Virtual Machine (VM) is an abstraction of a server.  The point is that you can’t have an abstraction layer without having the abstractions, which if they aren’t models are something that sure looks like models.

I agree with the comments DT made in its keynote at the event; the industry would benefit from defining the abstractions as models.  It goes back to the notion of an “intent model” which is an abstraction of a set of functional behaviors, linked to a data model that defines the parameters and status variables.  A “router” might be an example of an intent-model abstraction, and if that were so, the implementations that could claim to be a router would be those that could map to the abstraction.  Beyond that capability, nothing else would matter.  That seems to me to be a key to open implementations of networks.

The alternative to having a set of abstract models like “router” is to allow everyone to create their own abstraction, like “router 1” through some arbitrary “router n”.  If we do that, then any vendor could create their own proprietary abstraction for “router”, and none of them would be assured as being interoperable.  Differences in the properties of the abstraction would mean differences in how the abstraction could be realized from below, and used from above, in a service.  That means everything would be vendor-specific and brittle.

What I’m really not comfortable with is the notion that API specifications solve anything.  An API exposes the features of a component.  The nature of the API depends on what the features are, what data is associated with them, and to a degree what the architecture of the two components that the API links is supposed to be.  At one extreme, you could say that every intent-model abstraction has an API, which gets us to the point of having to explicitly code for model relationships.  At the other (the right one) you could say that there is really only one API, the “intent-API”, which passes the data elements as a payload.  That allows a general event-and-model-driven relationship.

What most operators mean by “API” is a high-level interface, not to service orchestration or composition or lifecycle management, but to a service portal that is then linked downward to that detailed stuff.  The portal APIs don’t compose, they expose.  If they expose lousy design, then the best of them is still a lousy API.  You worry about exposure when you have an effective underlying service lifecycle architecture that you can expose.

Another comment I can’t agree with is that MANO’s issues arise from the fact that it has to “interact with so many different pieces and layers, there is no open source of traditional vendor solutions that tackles that, it’s the nature of the beast….”  MANO’s problems are far deeper, and different, in my view.  In fact, its largest problem is that it was scoped to interact in one very simple and limited way—as a means of creating the functional equivalent of a device from the virtual functions that make it up.  That means that everything that’s involved in composing services as a whole is out of scope.  You can’t push service architecture up from below, you have to impose it from above.

OSSs and their migratory issues and complexity really aren’t the problem either.  They were designed to interact with something very close to a set of service/device abstractions, so a good model-based service lifecycle management approach would be able to mesh with OSS/BSS relatively easily.  The fact that we’re expecting OSS/BSS to change or adapt to MANO shows that we’re pushing spaghetti uphill again.  We already know how OSS/BSS looks at services; we just have to model how resources are harnessed to create them and we have a nice meet-in-the-middle.

The point here is that yes, standards are a waste of time.  So is open-source if it’s not done right, meaning not based on intent modeling and event-driven systems.  Right now, we don’t have any example of open-source that is, and so right now we don’t have anything that I think will solve the problems of service lifecycle automation.  Could we get there?  Sure, and in my view fairly quickly.  I think that the whole situation could be defined correctly in six months if it were to be approached the right way.  I tried to lay out the best model in my ExperiaSphere project, and I still believe that’s the way to do it.

All the smoke around the NFV issues we’re now seeing is a good sign; you can’t fix a problem if you don’t recognize and accept it.  It’s also true that accepting a problem doesn’t fix it.  Open source is a software project strategy and a business model, not a software architecture, and you can’t have good software without a good software architecture.  In fact, without a good architecture, any project is a boat anchor not a stepping-stone, because it tends to freeze thinking more and more as it goes on.  We’re rapidly running out of time to get this right at the open-source level, which is why I think that ultimately a vendor will get to the right answer first.

Does Nokia’s Wireline Slicing Change the Game?

There is a lot of 5G that is real.  There is a lot that is almost surely a pipe dream, and there’s some in between.  Network slicing is one of the latter.  The assumption for network slicing is that a 5G network can be divided into separate, parallel, virtual slices that act as independent networks.  These can then partition services, offering things like better isolation of MVNOs in wireless.  Nokia now plans to bring network slicing to wireline broadband to make the concept (and Nokia’s influence) more universal, so we need to take a look at the broad mission and see if slicing is in your future.

Probably the best way to think about network slicing is to compare it to virtual machines sharing a server.  Each VM “acts” like a server and is largely partitioned from the others, but they do share underlying resources and so those resources are somehow partitioned too.  That could create either a permanent division of resources, or some form of elastic by-the-rules division of capacity.

The idea behind this is that there are applications of broadband that demand sharing, and today there’s very little that can be done to directly apportion the stuff being shared.  For mobile services, as I noted above, Mobile Virtual Network Operators or MVNOs sell service by riding on the cellular networks of others.  We also have, in the wireline world, a number of examples of division of resources.  Verizon’s FiOS, for example, separates data and TV by the wavelength (which is just the inverse of frequency) of the light used to transport them.  The mobile example shows a case where sharing is somewhat disorderly; direct control over the resources used by MVNOs is complicated at best.  The FiOS example shows a fixed partitioning that might in some cases waste capacity.

There are two architectural models of what Nokia calls “Software-Defined Access Networks” or SDAN (being standardized by the BBF as TR-370).  In the first model, the slicing is done by having a management-plane process that controls a network abstraction layer over the broadband access service.  This doesn’t differ much from the current approach to MVNO hosting, except that the techniques for creating, maintaining, and connecting with the SDANs are standardized.  The second model assumes that network nodes would actually perform the resource partitioning, which in theory might then use an overlay or encapsulation approach to create virtual pipes over a real one.

It’s not hard to find applications for SDAN.  We already have several, as I’ve noted.  It’s also possible that things like VoIP, IPTV streaming, IoT and home control, business services delivered “as-a-service”, and so forth could also use the SDAN approach.  However, the question really isn’t what could be done with slicing, but what couldn’t be done without it.  That’s a harder question to answer.

The distinction here is that slicing is going to cost something, particularly in its node-implemented form.  You’d have to replace or upgrade current devices to make it work.  The software to do it isn’t likely to be free, there’s overhead in both capacity and operations efforts to sustain it…you get the picture.  Somehow that cost downside has to be offset by an upside, an application that is so clearly better with slicing that the business case for adopting it can be made fairly easily.

Do we have that?  The biggest technical shift slicing creates is the ability to support independent handling of the slices, meaning that for mobile services, for example, MVNOs would have the latitude to do more of their own authorization and service signaling and coordination.  That could bring about a set of mobile services very different from those of the host mobile operator’s services.  The next-biggest shift is (mostly in the node-controlled model) the ability to create very strict partitioning of resources, which could be a benefit to services that needed precise service quality standards.

For the negatives, the biggest downside to slicing is that additional cost.  Not only does slice management add costs for the primary broadband operator, exercising the benefits of independent service signaling and control could be expected to add to the costs borne by the virtual “slice operators”.  Next-biggest is the uncertainty of the migration.  The management-sliced model and the APIs are designed to facilitate an evolution from the current devices to future node-sliced services, but there are two key transitions to be made; first to the management-sliced APIs and then to the node-sliced devices.  Both have to be justified, and it’s not certain either will be.  In particular, if the node-sliced approach that delivers the most decisive differences in capabilities doesn’t come about, the management-sliced approach would have been pretty much a waste of time and money.

I’m not convinced that the market currently justifies slicing, and I’m not sure that it will be easy to create the justification.  The problem is the classic demand/availability linkage; nobody will depend on something or even plan to use it without some assurance of availability, yet nobody will offer something nobody wants.  Thus, it seems to me that Nokia and others in the pure 5G-slicing space are depending on the operators to behave as they did when they were regulated monopolies, meaning build stuff in anticipation of demand.  Even then the operators would have to give slicing credibility.  I don’t think operators will rush to offer slicing because they don’t see credible revenue.

That means that slicing comes along as a result of orderly modernization that can carry through the changes it requires with little or no incremental cost.  There is probably nothing out there on which this orderly modernization could ride, other than 5G.  In 5G/FTTN mm-wave hybrid form, 5G will shift the TV dynamic decisively to streaming.  In mobile form, it’s essential to up the capacity of cells to support a mobile population that’s growing and streaming more.

That means that we probably can’t expect any form of network slicing until we get 5G core deployed in quantity, which probably won’t happen for at least another six years.  We’ll get 5G, but in the 5G-over-4G-EPC Non-Stand Alone or NSA form much sooner.  Nokia might hope that this new and broader approach to slicing will be applicable (in management-sliced form, at least) to the 5G NSA stuff, and that perhaps supporting even node-slicing for wireline broadband will influence how node-slicing could be used to divide a 5G/FTTN RF path.

5G/FTTN’s relationship to slicing is the big question for me.  If Nokia really wants to push slicing into the mainstream, I don’t think there’s a better way to do it than to focus on the 5G/FTTN hybrid.  Since this is new deployment, it isn’t hampered by legacy equipment that would be obsolete if the node-slicing model were to be deployed.  It seems to me that slicing the hybrid home/business broadband access path would quickly establish a framework to validate the business case for slicing.

An imponderable here is regulatory policy.  Some countries, including the US, had regulations in place that would discourage access providers from separating services from the Internet or even creating “fast lanes”.  That policy has been reversed by the FCC in the US, but the reversal is under appeal.  Elsewhere, the way that dodging the Internet with parallel access services would be interpreted by regulators is a question.

This is a smart play for Nokia, despite the questions.  I think it could be rendered a smarter play if Nokia were to focus on 5G/FTTH hybrid access, but there’s still time for Nokia to take some concrete steps to create that linkage.  If they do, they may be on to something.

Is Open-Source or Proprietary NFV Orchestration the Winner?

What makes orchestration so hard?  Another Light Reading piece on NFV asks whether open-source is the right choice for NFV orchestration.  It’s a good question, but only part of the bigger question, which is how we’d define “rightness” to begin with.  It’s pretty obvious that there are multiple visions of orchestration, both within the NFV community and between that community and cloud computing.  How the heck did we get so far astray on even the basics, when orchestration is so critical?

The first issue was taking a limited-scope, bottom-up, view of the problem.  The NFV ISG, from the first, saw their task as being one of creating a simple framework that substituted a collection of virtual network functions (VNFs) for a traditional network device, or physical network function (PNF).  That process, they perceived, would involve things like chaining a succession of VNFs together, hence the focus on “service chaining”.  To the ISG, “orchestration” and “management” were directed first at the deployment of the VNFs that made up the virtual device, and second to manage the things that happen within the VNFs that don’t directly relate to the management of the function, which they presumed would be handled the old way, via EMS/NMS/SMS and OSS/BSS processes.

Most of the work on NFV orchestration has tracked this approach, which means that the NFV work really only addressed the incremental orchestration mission of virtual functions, not the broader mission of orchestrating network services that might or might not contain VNFs.  This was fine when it was presumed that NFV could be justified by the capex savings associated with device replacement by virtual functions, but not when opex efficiency and service agility were included as benefits.  That opex and agility were essential was recognized by the end of 2013, but the scope of NFV orchestration wasn’t broadened.

The second issue, which I touched on in yesterday’s blog, was ignoring the cloud.  Perhaps it would be fairer to say that the early work focused on the cloud for only one component of the architecture, a component that lived below orchestration.  At the time of the early work on orchestration within the ISG, we already had a cloud evolution of orchestration.  Kubernetes, probably the best known of the “orchestrators” for the cloud, was launched 15 years ago as a series of Google projects and open-sourced as Kubernetes in 2014.

The cloud principle of orchestration, embodied not only in Kubernetes but in other tools, including DevOps tools, is that lifecycle management is an overlay, a layer that lives on top of a set of resource abstractions and that operates below a series of “goals” or “end-states” that drive the process.  The orchestration layer takes definitions of those end-states (which represent applications or services) and applies them to the resource abstractions, which in turn are mapped to the real resources.  This differs from the NFV approach, which focuses orchestration only on that last-stage mapping.

The third issue with NFV orchestration was building toward the “right” solution from the wrong baseline.  AT&T’s ECOMP is a massive positive step in terms of functionality, but what ECOMP (now absorbed into ONAP) did was to create a kind of middle-management layer that was still designed to overlay the original NFV orchestration concept.  This approach turned ONAP into what’s essentially an extension of the management system, not an orchestrator.  It’s not abstraction-based, model-based, which is what the cloud would have mandated.

The same bend toward the ISG orchestration approach has impacted the vendor orchestration approaches, for the most part.  There are a few vendors who have a model-driven orchestration strategy, but most of these are still trying to link their approach to the NFV ISG orchestration model, and none that I’ve seen are based on the basic cloud orchestration standard, Kubernetes.  Does that mean that Kubernetes isn’t suitable?  More likely it means that Kubernetes and other orchestrators could play a role in a higher-level master orchestration approach, but if we started our processes by accepting the NFV model of orchestration, where would Kubernetes fit?  It goes back to the first problem.  We didn’t think of the future of NFV orchestration, or more broadly of service orchestration and lifecycle management, as a model-driven approach with each element in the model being an intent model, free to decompose what it contains in whatever way seems optimum.

This relates to the question that the Light Reading piece asks, but perhaps not the way it might appear.  The fact is that we don’t have the right model for orchestration today, period.  The question then isn’t whether open-source or proprietary orchestration is right, because neither is.  The question is whether open-source orchestration can get to the right answer faster than proprietary approaches could, and that’s a tough question to answer.

I don’t believe that ONAP will evolve to the right answer.  I’ve had some concerns about it, temporarily relieved when the ONAP people told me that they were going to be releasing model-based orchestration.  The problem is that they’ve not done that, and it’s clear to me from reviewing the material and talking to operators involved in ONAP that they’re not able to pivot the architecture of ONAP to the right approach, the approach that the cloud has already proven out.

I also believe that the ETSI Zero-Touch Automation (ZTA) stuff is going to miss the boat.  First, it’s jumping off explicitly from the NFV ISG stuff.  Second, these ETSI ISG processes have proven to take forever and follow the traditional standards-group model, which is unlikely to result in anything useful in any reasonable period of time.  That means that the answer to the question “Is open-source the right approach to NFV orchestration” is “No.”  We don’t have it, and we don’t have a convincing framework to achieve it.

That might also be true of proprietary stuff, of course, but a proprietary initiative to solve the problem of orchestration has two advantages.  First, there’s no competing directions to harmonize, as there always is with open source.  Second, there’s the powerful profit motive.  Somebody who gets this right can hope to make a lot of money selling it.  That’s why I’ve been saying that I think it’s going to be up to vendors to solve the orchestration problem.  Some have made progress in doing that already, and I expect that by the end of next year we’ll see some real NFV and service orchestration offerings available.

Here’s How Operators are Seeing 2019 Technology

The traditional fall technology planning cycle for network operators isn’t over yet, but I’m already hearing a lot of interesting stuff, and also that most of the good stuff has already been discussed.  Thus, it’s a good time to take a look at what operators really think about their 2019, and the specific technology areas that are going to advance toward (if not all the way to) reality.

Let’s start with the obvious, which is 5G.  The good news, really good news for a lot of people, is that 5G is going to move with increased speed next year, and that’s only the beginning.  The mixed news (bad for traditional mobile players and good for other network vendors) is that 5G isn’t going to move in as revolutionary a direction as some had hoped.

The two areas of primary 5G focus for 2019 are millimeter wave, in the US and in some other markets, and New Radio (NR) almost everywhere.  I’m naming mm-wave first because despite the fact that only a half-dozen operators are committed to it next year (an eighth of those who gave me input), the dollar value for deployment is the largest for the 5G space.  Half the operators on the list will likely account for over 80% of the money spent.  New Radio has breadth, but it’s still taking baby steps even next year, and so while every operator on my list said they’d be doing some NR deployment, nobody thought it would amount to much more than “pre-positioning” assets for a bigger 2020 move.

What makes mm-wave so much an investment favorite is the fact that since it’s going to deploy as a FTTN hybrid to offer residential and some business broadband, it’s effectively close-ended technology.  Operators provide both the nodal part and the home/business part, and so they can pick whatever flavor makes sense and be sure that everything will talk.  NR requires mobile devices, mobile devices without NR services are illogical, and so there’s the classic who-moves-first-and-takes-the-risk problem.  It can resolve only one way; operators will have to pre-deploy to encourage handset players to offer new devices, and users to take a chance and buy them.

Operators tell me that there will be some phones, primarily Android models (they say) with 5G (Qualcomm) chips roughly the end of Q1, and there will be some operators who deploy 5G NR (over the LTE core, for what’s called “Non-stand-alone” or NSA) some real NR sites around that same time.  They expect early devices to cost something like a 50% premium over standard handsets, but these new gadgets will also support the traditional 4G LTE and even HSPDA bands as well.  Think of them as early-adopter targeted, bragging rights primarily.

5G NSA is expected to expand quickly next year, though.  By the end of the year, the operators I’ve gotten data from think they’ll have 5G coverage in most of the major metro areas they serve.  They expect that by 2020, mainstream 5G devices will be available, and that by 2021 they’ll be offering 5G NSA services “widely through their service area”.  By that time, 5G device premiums will have fallen to less than 15%.

The second thing operators say has gotten the stamp of approval is carrier cloud, but according to what they’re telling me, there’s still a lot of haze surrounding what’s going to happen there.  One reason is that about a third of operators call what they’re expecting to start deploying “carrier cloud”, another third “edge computing” and the final third simply lumps hosting in with other projects, ranging from 5G to video and ad delivery.  That spreading of commitment is what’s making me nervous; it seems to me that they’re lumping a lot of hopes into a “carrier cloud” bucket, and most of those hopes would have to be fulfilled to make something really happen.

The problem with carrier cloud is that it’s not an incremental commitment.  For carrier cloud to work economically and operationally, an operator would need to deploy several thousand servers and the proper cloud and lifecycle software tools.  None of the operators said they had a budget to just build things in the hope that applications and paying customers would simply come along.  To make matters even more complicated, there didn’t seem to be much of a consensus on what the software framework for carrier cloud would be.

Everyone, of course, thinks Linux is the foundation.  The next level up, tool-wise, would normally be a cloud stack and orchestration.  OpenStack only gets a nod from about 20% of operators, another 20% or so say they like containers and Kubernetes, and the rest are still “considering alternatives”.  For lifecycle management, the truth is that only ONAP is out there at this point, but only about 15% of the operators thought there was any chance they’d adopt ONAP next year, and a third of that were sure they would—that comes to 4 out of the group.

One factor that’s confusing operators who have a general commitment to both 5G and carrier cloud is how SDN and NFV fit in.  Fewer than 10% of the operators believed either SDN or NFV would be “major drivers” of either 5G or carrier cloud, and just over 10% thought SDN or NFV could contribute “significantly” to either 5G or carrier cloud deployments.  However, nearly all operators believe that SDN and NFV will play a role in both 5G and carrier cloud.

Of the two, SDN is the one with the clearest mission.  Every operator who believes in either technology believes in SDN deployment in carrier cloud, both for data center networking and for data-center interconnect (DCI).  For most, they expect to pick a flavor of SDN from a vendor rather than adopting an open model, largely for integration and stability reasons.  SDN in 5G has the support of less than half the operators, largely because there’s still confusion on exactly what the 5G ecosystem will end up looking like.

NFV has the widest variation in support of any of our new technologies.  About a third of operators see NFV in virtual CPE (vCPE) missions, but most admit that their deployments will likely not follow the ETSI specs because they’re overkill for simple function-in-vCPE hosting.  About 20% think they might use NFV in 5G down the line, perhaps 2021, but again they admit that what they’re really likely to deploy are cloud instances of packet core functions that don’t rely on the ETSI specs.

The biggest problem operators cite for NFV is the licensing fees for the VNFs.  They say that the pricing that prevails so cripples the NFV business case that it often doesn’t make any sense at all to deploy it.  Only about a quarter of operators will also admit that the rest of the NFV ecosystem isn’t in place—that in particular they need better lifecycle management tools.

This adds up to a 2019 that’s really mostly about the cloud, one that perhaps finally accepts that the future of network infrastructure isn’t a bunch of standards groups, but the adoption of cloud practices that have been around for as long or longer than either SDN or NFV.  That’s good news in one sense; it shows a maturing of the thought processes.  It’s bad news in that it says we’re spinning our wheels on a lot of initiatives that aren’t paying off.

How Convincing is the Operator Commitment to Open Source?

Just how committed are operators to open source?  We see a lot of stories about their sweeping shift to open-source technology, but is that shift a real one?  I did a quick survey of operators last month, and over three-quarters of them said that their current open-source deployments (beyond things like Linux and other software hosting elements) was less than 5%.  Almost the same number said that it wouldn’t likely grow much until after 2020.  What’s really going on?

There is no question that operators want to be more committed to open source.  In that same impromptu survey, only one operator said they didn’t think open-source technology would figure in their transformation plans to any greater degree than it did today (as a hosting platform for OSS/BSS).  This operator didn’t see a transformation driven by a shift to hosted features, so every operator who does see that sees open-source elements playing a big role.

What, then, is holding them up in the near term?  There are three dominant issues, and all three of them could tell us a lot about how our transformed future might emerge.

The biggest problem cited by operators is the lack of a complete open-source ecosystem to fulfill requirements.  About half the operators said that they couldn’t find all the pieces of transformation in open-source form, and the other half said they could find what they needed but that they were uncomfortable about the integration.

This issue seems to emerge from the counterpressures of two points.  First, the industry tends to focus on narrow offerings in almost every space for the simple reason that broad technology shifts are difficult to justify.  Oil the squeaky wheel.  Second, though, operators need a high-level vision of what the total new infrastructure architecture would be, or they can’t fit things into it with any confidence, or build to it through their own efforts.  Lifecycle management seems to be the particular focus of these points.

You can see, in initiatives like AT&T’s ECOMP (now ONAP), that there’s still a tendency to view the lifecycle management process as an “application” in a traditional sense, something that runs in a monolithic way, takes “inputs” in the form of orders, changes, and faults, and somehow makes things work.  It’s been clear for the last decade, in my view, that this approach isn’t workable, but it’s still the intuitive pathway to solving the problem of automating complex services built on complex infrastructure.

Enthusiasm for ONAP, once at a level of almost 80%, has dipped to roughly 50%, and while operators don’t tend to state their reservations in software architecture terms, they do see problems with “integration”, “scalability”, and so forth.

The second problem operators list for open source is the disorderly change and release processes they find in open-source groups.  A few operator planners remember the old days of telco switches and their releases, which were planned on a two-year timeline and released every six months.  Most of them, while they don’t yearn for this nice structured framework, do believe that there are simply too many changes made too quickly, and that all the figures in the open-source ecosystem dance aren’t hearing the same music.  A release of something fundamental can break a bunch of things because of dependency issues, and yet that release may be needed to get some other key feature supported.

There’s some truth in the view that open-source changes a lot, and with a certain lack of coordination across multiple projects.  However, there’s also truth in the view that the changes are driven by market needs, and so a more orderly release process would likely compromise feature evolution, perhaps to the point where it took too long to finally implement what was really needed for a given service or feature to deploy usefully.

We have open-source companies (Red Hat, in particular), and even telecom initiatives from those companies, but so far those initiatives aren’t complete, meaning they miss the boat on the first of our issues.  It might be fair to argue that unless we can resolve that first point of completeness, we’ll never get a solution to the problem of release disorder because the releases won’t solve operators’ problems.  The obvious solution to this would be for Red Hat to take the lead and define/assemble a complete ecosystem.  They’ve not done that, and that could be because they don’t know how (they don’t know what’s needed), because they can’t find the pieces out there, or because they haven’t tried.

The third problem is product direction, integration and support.  With respect to product direction, open-source stuff is a bit like a Ouija board; it goes where consensus pushes it.  With regard to integration and support, you’re back to the issue of project scope for the former, and to “community” support for the latter.

Support is perhaps the biggest point in this particular issue, and that clearly points at the open-source support-bundled community.  Operators have run into integration and support issues with things like NFV, and they see the broader ecosystem they know is needed (even if that ecosystem is still a hazy picture at best) will have even more of those issues.  They’re used to having somebody be responsible for everything, and it’s interesting that while operators are eager to avoid “vendor lock-in” or “proprietary” strategies, they seem to want the one-stop-shop benefits that come with those downsides.

This is where I think it’s fair to say that operators’ mindset has to be changed.  As sifi writer Heinlein once said, “There’s no such thing as a free lunch”.  You don’t get free support with free software, you get no support in the sense that operators have come to interpret the term.  To create another whimsical phrase “You can’t back-seat drive a self-driving car.”  Operators can’t expect to have a decisive role in directing a project they elect not to resource decisively.

It would be exciting to see operators really shift decisively to open source, which of course is why we see so many stories about it.  The publicity isn’t helpful, though, because it hides the issues operators themselves cite, and hidden issues rarely get addressed.  For the three operators cited to me, it’s going to be difficult enough to get them resolved without adding the issue of hiding them.  As is often the case with operators who shy away from “vendor lock-in”, their future may end up being decided by a vendor or vendors who step up to solve the problems.

Both Red Hat and VMware seem to have specifically targeted the operator space, and this could bring about a more ecosystemic vision of operator software, but so far neither firm has covered even the most critical software spaces thoroughly.  It’s hard to say whether they intend to, because in this day and age, it’s not uncommon for vendors to start something with a lot of flair, then slow-roll the hard stuff.  If that happens here, we may have to wait a long time for the open-source revolution to really hit the network operators.

Is Broadcom’s Bet on WiFi vs 5G a Good One?

We finally have a vendor coming out to say that maybe WiFi and not 5G is the answer to IoT connectivity requirements.  An SDxCentral report says Broadcom isn’t depending on 5G to open up connectivity, but instead points out that when 4G came along, operators had aspirations of replacing WiFi in buildings, only to come back to WiFi in the end.  So, is this an opportunistic play on Broadcom’s part, or maybe even the start of new realism?  The concept of the “Internet of things” has an inherent imperfection in the definition of both “Internet” and “things”.

Do IoT elements have to be “on” the Internet in that they are directly connected to it via some fixed or wireless technology, or can they simply be accessible from the Internet via some intermediary hub or controller?  Popular thinking seems to favor the former, and the concept of open public sensors and controllers, but reality suggests the latter.  Today, and for the foreseeable future, it would seem, the vast majority of IoT devices are installed as subordinate to a controller or hub.

The second issue is harder, and it creates a demand underpinning for the first.  As I’ve noted in earlier blogs, we already have a rich market for facility-installed sensors and controllers, in the home security market and for the increasingly hot space of active home control through voice agents (Alexa or Google Home) or smartphones.  Here, we are seeing a bit of a market shift that has to be accounted for in our thinking about the future.

Home security sensors and even process control sensors and controllers have traditionally used either hard wiring or a specialized wireless protocol that can run over the air or via home powerline.  Think Insteon, X10, or Zigbee.  There are legions of vendors supporting these devices, they’re cheap and open, and they are also supported by home control hubs that can then expose them to home agents, voice or otherwise.

These gadgets tend to be geekie, though.  Setting up a home security system using any of the traditional technology options is something difficult for the average user, and in any event there’s a trend to focus increasingly on WiFi-connected elements when new applications (like doorbells, security cameras, and thermostats) are introduced.  These devices are often supported with published APIs that enable integration not only with apps and voice control, but also with other applications.

What this seems to be doing is gradually shifting the market focus from the technology-centric sensor/controller frameworks of the past to an app-centric, WiFi-connected, framework.  At one level, that increases the chance that Broadcom is making the right choice in backing WiFi for IoT connectivity, but there’s still more happening.

Home WiFi isn’t the most secure thing in the world, as many people have learned.  One big problem is that the security of WiFi-connected devices is variable because vendors don’t always design them to be secure and don’t always upgrade their firmware to respond to news of a vulnerability.  Once a device on the home network is compromised, it’s possible to interfere with or even access other devices on the same network.  In addition, home Internet connections are almost always subject to tampering by the simple expedient of cutting the wires (or fiber) at the entry point.  Enter cellular.

Many higher-end home security systems have the option for a cellular backup connection, for which the user pays a monthly fee.  This connection can’t have its wires cut; it’s run off the system battery, and that battery raises a counterpressure to the WiFi connection trend.  Running home sensors from WiFi requires a lot more power, which means that fully WiFi-connected security grids may be too power-hungry to run long from batteries.  Thus, the desire for tamper-proof notification of authorities in the event of a problem may drive at least higher-end users (who are still the largest source of “IoT” deployments) to use less WiFi and more direct wiring, low-power local telemetry, and cellular connections.

5G itself could change this dynamic, not so much through “mobile” 5G but through millimeter wave and the 5G/FTTN hybridization we’re already seeing emerge.  If we presume that a given facility (home, office, or factory) were to be connected via 5G/FTTH, then we’d have two things; a local 5G transceiver and antenna in the facility, and a wireless connection to the outside world.  Could this combination impact the situation?

Current PCs, phones, tablets, doorbells, cameras, and thermostats don’t have 5G connectivity.  The early 5G/FTTN hybrids will use millimeter wave that’s not suitable for mobile phone use, so many of these devices will never have 5G mm wave connectivity.  What could happen, though, is that network operators who deploy 5G/FTTN will provide their own facility-control hubs based on that technology, or include features to support facility control in their broadband gateways.  Third-party controllers and devices could then connect via WiFi.  That, of course, brings us back to WiFi.

The situation for IoT elements that aren’t in facilities (homes, offices, or other places where there’s WiFi or 5G/FTTN connectivity).  There, we’d need to either connect them with wiring or via cellular, including mobile 5G.  This is the 5G mission operators are excited about, for the obvious reason that they could get a lot of new 5G service customers.  However, it’s likely that those interested in most IoT applications won’t want to pay the freight for that connection.  One major city’s planner told me that even if their mobile operator offered to discount connections for IoT devices to a third or even a quarter of the basic cell-service price, the cost would be too high.  He said the city would wire the sensors instead.

As was the case with 4G, then, 5G has some crippling issues if it’s used to target IoT applications.  The biggest one is the same—cost of service.  That doesn’t mean that 5G chips won’t be needed, only that where they’re needed is likely to be the same kind of mobile devices that use earlier-generation chips.  For that space, both quantities of chips and likely market leadership is already established.

It’s really the millimeter-wave stuff that Broadcom and others should be looking at.  There seems little doubt that the 5G/FTTH hybrid method of delivering “wireline” broadband will be broadly deployed in urban and suburban areas, and even in rural pockets where density is high enough.  Obviously, this version of 5G will have more range than WiFi, enabling “roaming” within a neighborhood and even reducing the need for 5G mobile data plans.

Then there’s the possible impact of 5G/FTTN hybrids on legacy wireline providers.  Might they decide to offer 5G mm-wave services from their own node points, just to get in on the neighborhood service or (if they’re a cable MSO) offer mobile service over a wider area?  We don’t know how far 5G mm-wave would reach or if it would be added to the inventory of mobile device frequencies, but it’s possible.

For right now, Broadcom may have made the right choice by dodging the 5G wave and staying out of 5G IoT.  For later on?  We’ll have to see on that point.

Are Open APIs a Revolutionary Opportunity or a Cynical Trap?

APIs are good and perhaps even great, and “open” APIs are even better.  That’s the party line, and it has many elements of truth to it.  Look deeper, though, and you can find some truly insidious things about APIs, things that could actually hamper the software-driven transformation of networking.

“API” stands for “Application Program Interface”, and in software design it’s the way that two software components pass requests between them.  If a software component wants to send work to another, it uses an API.  “Services”, which are generalized and independent components, are accessed through APIs too, and so are “microservices”.  In fact, any time application behavior is created by stitching work through multiple software elements, you can bet that APIs are involved in some way.  The very scope of APIs should alert you to a fundamental truth, which is that there are a lot of different APIs, a lot of variations in features, styles, mechanisms for use, and so forth.

At the highest level of usage, one distinction is whether an API is “open”, meaning that its rules for use are published and available to all, without restrictive licenses.  Many software vendors have copyrighted not only their software but their APIs to prevent someone from mimicking their functionality with a third-party tool.  That practice was the rule in the past, but it’s become less so as buyers demand more control over how they assemble software components into useful workflows.  Certainly today we don’t want closed APIs, but opening all APIs doesn’t ensure that a new operator business model would emerge.

One of the reasons is that most pressure to open up APIs is directed at opening connection services to exploit their primitive elements.  The thinking (supposedly) is that by doing this, operators would be able to make more money selling pieces than selling entire services.  That’s not turned out to be true in other times and places where wholesaling service elements was tried, and few if any operators today believe this is a good idea.  We need to have new things, new features, exposed by our new APIs, and those new APIs have to expose them correctly, meaning optimally.

At the functional or technical level, you can divide APIs into two primary groups—the RESTful and the RPC groups.  RESTful APIs are inherently client-server in structure; there is a resource (the server) that can deliver data to the client on request.  The behavior of the server is opaque to the client, meaning that it does its thing and returns a result when it’s ready.  RPC stands for “remote procedure call”, and with an RPC API you have what’s essentially a remote interface to what otherwise looks like a piece of your own program, a “procedure”.

Within each of these groups, APIs involve two distinct elements—a mechanism of access and a data model.  The former describes just how a request is passed, and the latter what the request and response look like.  Generally, APIs within a given group can be “transformed” or “adapted” to match, even if the mechanisms of access and data model are somewhat different.  There’s a programming Design Pattern called “Adapter” to describe how to do that.

Where things get complicated is in the implied functional relationship of the components that the API links.  One important truth about an API is that it effectively imposes a structure on the software on either side of it.  If you design APIs to join two functional blocks in a diagram of an application, it’s likely that the API will impose those blocks on the designer.  You can’t have an interface between an NFV virtual network function and an element management system without having both of those things present.

We saw this in the NFV ISG’s end-to-end model, which defined functional blocks like the Management and Orchestration (MANO), VNF Manager (VNFM), and Virtual Infrastructure Manager (VIM).  While the diagram was described as a functional model, these blocks were the basis for the creation and specification of APIs, and those APIs then mandated that something like the functional block structure depicted was actually implemented.

The functional diagram, in turn presumes the nature of a workflow between components, and in this case presumes a traditional monolithic management application structure.  That’s a problem because a service is made up of many elements, each of which could be going through their own local transition at a given point.  In traditional management systems, an element has a management information base (MIB) that represents its state, and this information is read by the management system for handling.  Thus, you get a management flow that consists of an element state, and processes then determine what to do about that state.  Everything has its own state in a service, and so it’s easy to see how deciding what to do about a single element’s state in the context of the grand whole could be difficult.

The notion of state here naturally gives rise to the issue of stateful versus stateless processes and APIs.  In theory, RESTful APIs should be stateless, meaning that the resource or server side doesn’t remember anything between messages.  That makes it possible for clients to (in theory, at least) access any instance of a server/resource to get the same result.  It also means you can fail something over by reinstantiating it, and you can also scale something under load.

All of this has to be related to some broad software-architecture goal to be truly useful, and as I’ve said many times, I think that goal is the intent-data-model-and-event-driven structure similar to that proposed a decade ago by the TMF.  In this structure, an event is analyzed based on the current state of the modeled element it’s associated with, and this analysis (based on a state/event table) kicks off an associated process and sets a successor state if needed.

In an event-driven process, everything reduces to an event.  That means that higher-layer things like service orders or management status requests are events, and events activate the processes in a standard way.  Which is how?  The answer is that there should be a standard “process linkage” API that is used by the event-handling tool, one that presumes a process link (in the form of a URI, the general case of our familiar URL) is present in the state/event table, and that then activates that process.  The exact mechanism isn’t critical, but if you want the process to be scalable, it should be stateless and designed to a REST API.

What’s the data model passed?  The answer is the element in the data model that’s currently in focus, the one whose state/event table is being used.  Whatever data that intent-modeled structure has available is available to the process being activated, along with the event data itself.  It’s fairly easy to transform data elements to match process requirements, and so this kind of API would be very easy to define and use.

The process of posting events could be more complicated, but not necessarily.  My own ExperiaSphere work showed that for the TMF’s approach to data-coupling of events and processes to work, it was essential that there be a rule that events could only be exchanged among adjacent model elements—superior/subordinate elements, in other words.  This limits the need to make the entire model visible to everything, and also simplifies the way that events could be exchanged.  Presumably it’s easy to make a model element “see” its neighbors, and if each neighbor was identified then the posting of an event to it would be easy.

There is a complexity at the “bottom” of a model hierarchy, where the model element encloses not a substructure of elements but an implementation of a feature.  Real hardware/software events would have to be recognized at the level of the implementation, and the implementation of a primitive “bottom” element would then have to be able to generate an event upward to its superior element.  Only “bottom” elements enclosing actual implementations would have to worry about this kind of event translation or correlation.

If we had true model-driven service definition and lifecycle management, the only APIs we’d really need are those that generate an event, one to be passed into the model’s event-to-process orchestration to drive a change in state or inquire about status.  These APIs would be very simple, which means that they could be transformed easily too.  The barriers to customization and to the creation of useful services would fall, not because the APIs enabled the change but because they didn’t prevent what the fundamental architecture enabled.

Which is the point here.  With APIs as with so many topics, we’re seizing on a concept that’s simple rather than on a solution that’s actually useful.  We need to rethink the structure of network features, not the way they’re accessed.  Till we do that, APIs could be a doorway into nothing much useful.

What Exactly is “Cloud-Native” and How Do We Get There?

We are still hearing feedback from operators on the importance of “cloud-native” in things like NFV.  As an example, Fierce Telecom has run THIS piece on the topic.  I’m still concerned that we’re jumping onto a terminological bandwagon thinking it will have a tangible effect on things, and that’s not the case.  We need to look at the specific problems of NFV, and the way “cloud-native” might impact them.

As I’ve noted before, one foundational issue that NFV fell prey to from the first was focusing on the transformation of physical network functions (PNFs) into virtual network functions (VNFs).  The underlying presumption of that focus was that a network is built from a very static set of “network functions” that are currently embodied in specialized devices and should instead be hosted software instances.  This foundational point gives rise to the next generation of issues.

The first of those second-gen issues is that the fundamental nature of the network, as a collection of connected network functions, doesn’t get transformed by NFV.  There is a difference between a “router” and a “virtual router”, but the more you get fixated on the notion that the two will be used the same way, the less you get from the process of virtualization.  For example, a network of virtual routers hosted in fixed places and replaced by re-hosting them when something breaks, behaves a lot like a network of real routers.

The second of our issues is that the narrow focus on PNF-to-VNF limits the scope of impact of NFV.  The network functions are still used and managed the same way, which means that the only thing that NFV can really do is manage the efficiency of the PNF-to-VNF transformation; the rest stays the same.  Operations improvements are limited, and we’ve long since learned that it’s got to be opex efficiency and agility improvements that make a broad business case for NFV.

The PNF-to-VNF transformation is itself an issue.  If there exists a “network function” or NF that can live in multiple forms, physical and virtual, then software architects would say that you have to start your transformation architecture by defining NFs as a kind of super-intent-model, the implementation of which then has to be matched to the interfaces the NF specifies.  Onboarding is then the process of implementing the NF and meeting those interfaces, which is at least a specific task.  However, the NFV ISG didn’t do that.

Cloud-native advocates seem to be suggesting that their approach could resolve all these problems, but that assertion has its own fundamental flaw, which is that you can’t be just part cloud-native.

I could define and develop an implementation of a VNF that would be cloud-native, and at the same time would fall into every one of the issue traps I’ve just noted.  That’s because the problem isn’t just how a VNF is implemented, it’s how NFV works as a software system.  A cloud-native implementation of the wrong architecture is still wrong…and the fact it might be more efficiently wrong won’t make it right.  Let’s face a simple truth here.  There is no way of doing an effective cloud-native VNF within the current NFV architecture model.

Can we have a cloud-native VNF being controlled by a monolithic MANO or VNFM?  Can we grow apples on an orange tree?  Here again, it’s not a matter of saying that we’d employ cloud-native tactics to implement MANO or VNFM.  I submit that the biggest thing wrong with something like NFV’s MANO or VNFM isn’t how they are implemented, but that we think they exist as discrete elements at all.

A service should be represented by an abstract data model, defining a related collection of functional elements that correspond to network functions, whether they exist as PNFs today or are invented new to support emerging service opportunities.  This model, as the TMF has long described it, is used to associate service events with service processes.  What NFV wants to call “MANO” or “VNFM” isn’t a software element, it’s a composed event/process relationship set.  One service event is a “Deploy” order, for example.  That event activates the process of committing implementations to the abstract model NFs, which is “orchestration”.  If an error occurs, that error is an event that then activates a set of processes to correct it, which is “management”.  There is no MANO or VNFM, only a model-coordinated event-to-process-set relationship.  We compose management like we compose services.

This also represents my concerns about something like ONAP.  It was clear from the genesis of ONAP in AT&T’s ECOMP that it was, like NFV, based on the presumption of a set of connected and specific software processes, not on an event-to-process-via-model approach.  In fact, it didn’t really have a good or complete model structure.  About a year ago, I blogged that the ONAP people were promising to integrate data modeling, and I was concerned about the pace of that integration and the extent to which it could fundamentally shift a monolithic design to an event-driven one.  We are still not where ONAP said they wanted to be, and I’m more concerned than ever that they’re proving that not only is it very difficult to evolve to the event-to-process-via-model approach, it may be impossible.

Web giants like Facebook, Google, and Twitter have designed “cloud-native” applications, and they didn’t do it by taking monolithic software and somehow transforming it.  NFV, or ZTA, or 3GPP, aren’t going to get to cloud native any differently, if they really hope to get there.  All these players invented technologies, defined new architectures, to get massive scalability and agility of features.  That’s how network operators have to do it.  That’s what “cloud-native” really means.

I gently disagree with Telus’ Bryce Mitchell (quoted in the Fierce Telecom piece) that this is a matter of management mindset.  This is a matter of software architecture, and if we expect telecom managers to drive software design we have a very long and unhappy road ahead.

Are We Running Out of Ways to do Network Software?

At the recent Linux Foundation Open Networking Summit, operators had a lot to say about open-source, zero-touch automation, and NFV.  While it was clear (as Light Reading reported HERE) that operators remain very optimistic about open-source and open-system approaches to network evolution, it’s not all beer and roses.

One interesting thing that attendees of the event told me was that they’re seeing industry trends in open-source solutions much less unified and helpful than expected.  One comment was “I didn’t expect there to be a half-dozen different projects aimed at the same outcomes.”  The problem, say operators, is that many of the benefits of open-source, particularly in interoperability of solutions and elements, are defeated by the multiplicity of platforms.

Despite this, most of the operators told me that they weren’t really in favor of having some over-arching body (3GPP-like) take over everything and meld a common vision.  There are several reasons for this view, all good in my view.

The biggest of the reasons is that operators aren’t sure that any 3GPP-like body is capable of defining something that’s going to be software-centric.  It seems to most operators that what’s really needed is an admission that old-style standardization is pretty much over except perhaps as a guideline for hardware.  However, they are really uncomfortable about the alternatives.

The band-of-brothers framework of open-source seems to many operators to lend itself to dispersion of effort and disorder in architecture and approach.  However, operators who contacted me were literally split on whether that was bad or good.  Some believed that an initial dispersal of projects and approaches could be the only pathway to finding the best approach.  Let the projects vie for attention and the best one win.  Others say that this would take too long.  The first group counters that just deciding on what single approach to pursue would take even longer.  You get the picture.

While there’s a difference in how operators view the one-or-many-projects issue, there was pretty solid convergence on the fact that having many different projects because you’ve taken a single logical requirement and divided it arbitrarily into sub-projects is a bad idea.  NFV’s MANO (and other elements) were the subject of a lot of direct criticism in this regard.

I’ve never been a fan of the “contain your project to be sure you get it done” approach, when it’s at the expense of doing something broad enough to be a complete solution.  As I’ve said before, I spoke in public in the spring of 2013 to the NFV ISG meeting on that point.  Operators at the time were on the fence, and didn’t pursue the topic or change directions of the group.  Now, they tell me they’re sorry they didn’t do that.

NFV orchestration and management is based on something that’s basic and fundamentally limiting.  The goal of NFV (implicitly) is to create virtual network functions that represent the same feature set as physical network functions, meaning appliances/devices.  NFV presumed that the PNF management practices would manage the functionality and that NFV only had to take care of the “inside of the box” stuff.  MANO deploys VNFs that, once deployed, are managed the old way from the OSS/BSS/NMS side.  This is a problem, of course, because it means that there can be no operational efficiency improvements made by NFV.

The experience of the NFV ISG is one reason operators expressed concern about whether a “standards” or “specification” group of the old style could be trusted to do anything useful at all.  “Obviously, we didn’t as a body have the right approach here, and we’re still the same kind of people we were then,” said one operator.  This operator wondered whether the traditional mindset of element-network-service management (decades old) was still controlling everyone’s thinking.

This last point was particularly concerning to operators who are now looking at zero-touch automation.  Operators are concerned that the new ETSI ZTA group will be “another NFV ISG” in terms of how the project is structured and how long it takes.  Several were particularly concerned that the group seemed to be building on the NFV ISG work, which they felt was already in the wrong place.

The specific concerns of operators about automation and NFV center on what seems to them to be disorder in the area of orchestration.  If you divide the general problem of service lifecycle automation into little enclaves, each with its own orchestration and management framework, you face the problem of having different orchestration and management strategies everywhere, which operators see as inviting inefficiency and errors.  They’d like to see a unified approach, a single orchestrator that can manage everything.

This goal has led some operators to question the ONAP model for lifecycle orchestration and management.  ONAP seems to many to be limited in scope of application, and perhaps too willing to accommodate multiple layers of orchestration and management based on different technologies.  I noticed that operator views on ONAP after the conference seemed a bit more cautious than they’d been a couple months ago.

This is a tough one for me.  On the one hand, I think that multiple layers of semi-autonomous orchestration and management are inevitable because of the variety of implementations of network automation and cloud technology already in place.  The big benefit of the notion of intent modeling, in my view, is that these can be accommodated without interfering with management efficiency overall.  On the other hand, many of you know that I’ve advocated a single orchestration model all along because surely it’s easier to learn and use one model efficiently than to learn/use many.

It’s my impression that operators don’t really understand intent modeling, in large part because of the current tendency to make an intent-model story out of everything short of the proverbial sow’s ear.  This would reinforce operators’ own views that they may not have the right mindset for optimum participation in the task at hand.  Software experts are needed for software-centric networking, and operators’ software resources tend to be in the CIO OSS/BSS group, which isn’t exactly on the leading edge of distributed, cloud-centric, event-driven software design.

That, finally, raises what I think is the main point, also mentioned to me via email after the event.  “I guess we need to take more control of this process,” said an operator.  I guess they do.  I’ve said before that too many operators see open-source as meaning “someone else is doing it”.  There isn’t anyone else; if you expect your needs to be met, you have to go out there and ensure they are.  Hoping vendors will somehow step up on your behalf is inconsistent with the almost-universal view that vendors are out for their own interests.

How do operators do that?  ONAP does have many of the failings that have rendered NFV and MANO less than useful, and it was launched as an operator open-source project (as ECOMP, by AT&T).  The underlying problem is that the business software industry is overwhelmingly focused on transaction processing, and network operations is about event processing.  There are relatively few software experts who have the required focus, and as NFV showed (in its end-to-end model, published in the summer of 2013), there’s a tendency for us all to think about software as a collection of functions to which work is dispatched, rather than as a set of processes coordinated via a data model to handle events.  This fundamental difference in approach, if it’s not corrected early on, will fatally wound anything that’s done, no matter what the forum.

Creating an Optimum Platform for IoT Functions

There are a lot of axioms in the world, but in networking some of them are more like what we used to call “postulates” in my geometry days.  Postulates are statements presumed to be true, and it just might be that some of our cherished views fall more into that category.  That’s particularly true with IoT and event processing.

Let me start by saying that it’s an axiom that, all other things being equal, lower latency in event processing is better.  Latency has the effect of lengthening the response between a reported condition, meaning an event, and the arrival of the response at the appropriate control point.  Let that loop get long enough and you could do anything from alienate the user of the application to actually causing damage.  However, of course, all other things are rarely equal.  Networking and the cloud are trade-offs of various things against cost.

There’s another axiom, which is that when you have a problem with multiple contributing factors, the best course is to attack the factors that make the greatest contribution.  In IoT and event terms, that means that if there are many sources of incremental delay in a control loop, you should address the largest sources first.

Propagation delay, meaning the time it takes for an event to get to the place where it’s to be processed, is a source of latency.  As I’ve noted in other blogs, signals propagate in fiber at the rate of about 100,000 miles per second, which means that they cross a hundred miles in a millisecond.  An autonomous car would go 0.088 feet, or about an inch, in that time.  A hundred miles is probably a long way from a hypothetical edge, and a millisecond doesn’t sound like a lot to many people, including some who have commented on my blogs in the past.  On the surface, they could be right that propagation delay isn’t a big thing for IoT.  How about deeper than the surface?

I’ve done some modeling after discussion with some industrial control and event specialists, and also done some modeling.  The results are interesting.  They don’t say that edge computing isn’t justified, only that it may not be justified for the reasons we think it is.  That means that there may be alternatives to edge computing for IoT that should be reviewed.

The standard mechanism discussed for IoT and event-handling tasks is the serverless, or “functional/lambda” model.  With this model, there’s a series of what we could call “function hosts” that have the software platform elements needed to host small stateless process elements.  Above these function hosts is an orchestration layer that uses event sources, policies, and resource availability to decide where a function will be run.  The function is then loaded and run, receiving the event.

My contacts in the event and IoT world say that their experience with functional computing is that the process of loading and running a function will take a lot more time than a millisecond.  In fact, the majority of the experiences are ranging from 100 to 200 milliseconds, and many even longer.  Obviously that dwarfs propagation delay as a source of end-to-end delay.

The obvious way to eliminate the functional computing delay is to eliminate the serverless concept.  If we had resident processes, running perhaps in containers, that were handling the events, there’d be no delay in loading and running them.  The problem with that is that it’s likely that events are a relatively uncommon thing in the IoT world, making it expensive to have processes sitting around waiting for them to happen.

An enhancement to this approach would be to customize our function-hosting platforms, perhaps having a very large RAM to hold a lot of processes-in-waiting and also eliminating a lot of the normal overhead that even containers have.  That’s workable because functions are a very lightweight form of logic, not requiring a lot of middleware or operating system services.  Most of my contacts think that this approach could cut the load-and-run delay to less than 10 milliseconds.  The efficacy of this process-in-waiting model depends on how many different processes are involved.  This is perhaps one place where edge computing could come in.

It’s reasonable to assume that different kinds of events would be present at different frequencies depending on the specific location being monitored.  Putting it another way, it’s reasonable that in many cases, there might be only a small number of “processes-in-waiting” for a given area, based on the specific applications of IoT there.  It’s also reasonable to assume that the most time-critical stuff could be hosted via processes-in-waiting, with deeper processing and insight whose timing was less critical hosted deeper and more conventionally.  Thus, process-in-waiting combined with edge computing could make sense because it reduces load-and-run latency not propagation delay.

There’s some evidence that load-and-run latency is worse when you go to “deeper” and larger data centers.  A few contacts that have tried to test that presumption say that it appears that larger resource pools take longer to orchestrate, increasing latency to as much as two times that of a “local” small, or edge, platform.  There’s not much good data on this front, though.

My modeling says that if you were to host functions in a process-in-waiting model with a customized OS platform, you could get one hundred times as many in a server as you could “normal” containers, and perhaps forty times as many as you could versus a streamlined function-optimized container.  This, without assuming that you’d have to load-and-run for each event.  The number of processes-in-waiting could be increased by using more RAM, and beyond the RAM limit you could reduce latency with solid-state, high-speed, storage.

What all this suggests is that we should examine exactly what event processing requires in terms of function loads and runs, and customize function hosting according to what we find.  The challenge that poses for everyone is obvious; we run the risk of specializing the resource pool to the point where there’s a risk that we can’t easily switch resources from broad application or feature hosting missions to event missions.  It may be possible to resolve that issue by looking at function hosting as a specialized container mission.

I think that the Amazon and Microsoft models, which orchestrate function operations on top of a separate hosting model, could probably be applied to a container system as easily as it could to a specialized function-hosting system.  To me, that seems the best path forward.