More Action in the SD-WAN Space

The SD-WAN space has been percolating for years, and there are some recent signs that it may finally be sticking its head out beyond the old extend-the-VPN-to-small-sites mission. One thing that seems to stand out in two announcements is a managed services connection. As always, though, there’s no shortage of competitive intrigue in the mix, and so the outcome of all of this is still a bit murky.

SD-WAN is the most recent virtual-networking strategy to emerge, but not the first. From the early days of virtualization and the cloud, virtual networks were the go-to strategy for sharing network and hosting infrastructure among multiple tenants, including organizations within a company whose information technology had to be kept separate for security/compliance reasons. Virtual networks create, in some way, an overlay on top of ordinary IP networks, and this overlay strategy allows for connectivity management at a highly refined level, without impacting or depending on features of the underlying IP transport. Connectivity is separated from transport, in short.

Enterprises have so far embraced virtual networking for more mundane reasons. Some have adopted it in their data centers to improve security, and most recently many have used SD-WAN to extend corporate VPNs to locations where MPLS is either not available or not economical. The refined connectivity control that virtual networking and SD-WANs could offer wasn’t the user priority, which is why most SD-WAN products did, and still do, little to enhance connectivity control value-add.

Not all SD-WAN products have been so myopic. When Juniper acquired 128 Technology, they acquired what I’ve said from the first was the best overall SD-WAN and virtual-networking strategy available. The biggest selling point for 128 Technology was and is its session-aware handling of traffic, which means that it can identify user-to-application relationships and assign traffic priorities and even connection permission based on that. Integration of 128T’s SD-WAN with Juniper’s AI platform (Marvis and Mist) provides improved operations automation, and the combination of the two offers both network operators and managed service providers (CSPs and MSPs) a strong basis for a managed SD-WAN offering.

Managed services are quietly getting hot, and especially managed SD-WAN. Part of the reason is that enterprises are generally having issues with acquiring and maintaining the skilled network operations professionals they need, given competition from cloud and network providers and equipment vendors. Another part is that in the small-site locations where SD-WAN is the only VPN option, local technical skills are likely very limited, and a managed service is the only realistic solution. My data shows that CSP and MSP managed SD-WAN services are the largest source of new SD-WAN sites.

Then there’s the whole SASE thing. When you start talking about a service edge device and security in the same breath, it’s hard not to think of SD-WAN and what security it might bring to the table. The session-awareness approach can offer zero-touch security by letting users define what sessions are allowed, and barring everything else. Prioritization for QoE is a natural feature of SASE too.

Finally, the cloud. The largest source of cloud applications for enterprises is the front-ending of core business applications in the data center with friendly web-centric portals. This is how most companies have provided for user access to their data, including online ordering, and it’s also increasingly how companies want to support worker access, particularly given that many workers now want to use smartphones, tablets, Chromebooks, and other stuff in their work-from-home mode of operation. If the cloud is where these new application on-ramps are located, then workers need to get to the cloud rather than directly access the data center. Even before the Juniper acquisition, 128 Technology had a strong hybrid and multi-cloud story, as good as or better than the rest of the players.

If all of this stuff is important, and if Juniper has acquired an SD-WAN player that can do all of it and well, then it would be truly surprising if there weren’t competitive counterpunches in process, and we had two of them recently.

Extreme Networks is acquiring the Ipanema SD-WAN division of Infovista, with the apparent goal of bringing its network offerings up to address competition in the managed SD-WAN space. Extreme already has its ExtremeCloud managed service portfolio, but the Ipanema SD-WAN has “application intelligence”, allowing it to make decisions on QoE based on the specific importance of applications using the network. It’s also able to support dynamic routing in hybrid and multi-cloud applications. Finally, it pushes a “cloud-native” implementation. I think it’s clear that Extreme will be enhancing the application-intelligence features to extend their utility in security and access control, moving them closer to feature parity with Juniper’s 128 Technology and Marvis/Mist capability.

Cisco sees the handwriting on the wall too, but they may be speaking a little different language. Recall that Cisco has its own Cisco Plus network-as-a-service and expensing-versus-capitalizing offering. Cisco also has the most strategic influence of enterprise network buyers of any vendor, having in fact almost twice the influence of Juniper and five times that of Extreme. A decision to push SD-WAN features and enhancements through CSP/MSP channels would undermine their own as-a-service plans and reduce the impact of their enterprise account control. I think their positioning of their ThousandEyes deal shows that tension.

The SDxCentral story characterizes this as “Cisco’s WAN-on-Demand Strategy”, which sure sounds like an as-a-service strategy. Is it a coincidence that this is also how Cisco described its Cisco-Plus…as a NaaS? What the Cisco strategy with ThousandEyes does is improve visibility across all clouds and networks. The WAN-on-Demand stuff is really a Cisco initiative involving a bunch of cloud relationships for SD-WAN routing, which ThousandEyes can provide visibility for and that Cisco’s management products can look into and facilitate acting on. It’s not directly comparable to the Juniper or Extreme session/application awareness stuff.

WAN-on-demand does raise the question of whether Cisco is subducting IP and IP networks, promoting an overlay SD-WAN approach to respond to the fact that the cloud is becoming the application front-end of choice, and thus the thing that customers, partners, and employees have to connect with. By promoting cloud interconnect, Cisco is promoting the use of cloud provider backbones, and if all IP networks are just the plumbing underneath a Cisco NaaS vision, then a Cisco managed service strategy for SD-WAN could become the go-to managed-service solution, or so Cisco hopes.

Cisco could obviously make their offerings available to CSPs and MSPs, and could promote a managed-service vision and SD-WAN vision competitive with other vendors, or they may have decided that they want to get into the space on their own, and are starting to position the “NaaS” term as a placeholder for their own strategy, and a way of avoiding saying they’re going to offer managed SD-WAN and other network services, a statement that would surely raise the risk of channel conflict with CSPs and MSPs. We’ll have to watch how Cisco positions over the next few months, because service provider fall planning cycles are only roughly a month away.

How Can We Data-Model Commercial Terms, Settlements, and Optimization?

In the two earlier blogs in this series on as-a-service and runtime modeling, I looked at how the lifecycle modeling of services could facilitate the handling of service events and support lifecycle automation. I then looked at expanding the modeling to include the orchestration of event-driven applications, and at how the two models could be integrated to improve application and service utility.

Paying for services (like paying for anything) is a higher priority for the seller than for the buyer, but buyers are interested in accounting for what they’ve been asked to pay for. Sellers are interested in cost accounting, profit management, and integrating third-party elements into their own service offerings. There’s obviously a commercial dimension to be considered here.

Let’s start by recapping (quickly) the approach to unified runtime and lifecycle modeling I advocated in the last two blogs. I suggested that the runtime model and the lifecycle model be connected by creating a parallel lifecycle element that was linked to each runtime element by an event exchange that was confined to a change in the in-service state. The runtime element and the lifecycle element would each be bound up in their own models as well, so there would be two model structures, bound by in-service event exchanges at the level of the runtime elements. If an event-driven application had a dozen component elements, then there would be a dozen parallel lifecycle model elements.

Commercial terms, in the TMF approach, are assigned to Customer-Facing Services (CFSs), which to me implies that a service element is always represented by a pairing of a CSF and a Resource-Facing Service or RFS, because obviously it’s the way a service is bound to resources that creates costs. This is logical in many ways, but to me it encourages rigid relationships between the CFS and RFS, and that might have negative consequences when it was necessary to replace a broken resource.

When I did my first, TMF-linked, ExperiaSphere project, I took a slightly different approach and suggested that the “resource domain” would advertise “behaviors” that were then bound at service setup to the “service domain”. This was to keep the service models from becoming specific to resources, something generally important but critical if some service components were to be acquired from third parties. Commercial terms, in my approach, were associated with the behaviors, and the dynamic binding could then consider commercial terms in selecting a resource set to fulfill service requirements.

If we could assume that commercial terms would always be associated with the committing of a service component to its corresponding service, we could make commercial terms management a part of lifecycle management, and it could be applied when the service lifecycle enters “in-service”. That would correspond to the way both the TMF and my first-generation ExperiaSphere worked, but it’s not a suitable solution to service elements that are usage-priced.

Usage pricing requires integration with the runtime event/information flows in order to determine usage. It would be highly undesirable to have usage information collected outside those flows, for performance and resource efficiency reasons, so we would have to assume that usage data would be collected by the runtime service element itself, or that it could be collected by counting events/messages directed to the element, as part of the workflow.

It seems to me that if it’s important to offer support for commercial terms collection and reporting, it would be better to include the collection of the usage data in the “middleware” associated with event/message flows, rather than trying to write it in to each of the processes in the flow. The latter allows for too much variation in approach, and redundant logic. The good news here, I think, is that we have a technology available to do the message statistics-gathering, and it’s already in use in related areas. It’s the Envoy sidecar.

In service mesh applications (including the top two, Istio and Linkerd) a sidecar is used to represent a process element in the mesh. Since the sidecar sees the traffic, it can count it, and in service mesh applications the results can be obtained from the service mesh element (Istio, Linkerd) if there is one. If there’s no current sidecar, then I think there’s ample reason to employ Envoy where it’s necessary to monitor usage and cost. We could obtain the data, and integrate it with overall commercial-terms management, using what I called in 2013 “derived operations”.

What derived operations says is that you obtain the status of something not by polling it directly, but by reading it from a database that’s populated by polling. This principle was considered by the IETF briefly (i2aex, or “infrastructure to application exposure”) as a means of providing application coupling to network MIBs, rather than having many applications polling the same MIB. If we assume that we have Envoy available for all our realtime processes, then we could say that either the lifecycle management or commercial terms data model included instructions to run a timer and collect the data (from Envoy or a service mesh) at a predetermined interval and use it to populate the data model.

If we assume that we have the “rate” price for a service element stored in either the commercial terms or lifecycle management data model, and the usage data is also stored there, we have all the information we need to calculate current costs while the service/application is active. If we are setting up the application, and if we can make selections of resources based on cost, we can use the “rate” prices to compare costs and pick what we like. Any references to the cost/price/usage would be made to the data model and not directly to the Envoy sidecar, service mesh, etc. That would decouple this from the specific implementation.

The question is whether you need to have a separate commercial model or if the lifecycle model could serve both purposes. I’m of the view that adding another model dimension would be justified only if there was a clear benefit to be had, and I’m unable to see one so far. I’d determined in ExperiaSphere that the state/event tables should include a timer to be started before the indicated process was run, and that the timer event should then be activated if the timer expired. This also means that there should be a “stop-timer” indicator in the table, so that suitable events could stop the timer. That allows the timer to be used to activate periodic statistical polling and store the result in the data model, so there’s already a timer available to exploit for polling.

Given all of this, it’s my view that collecting information associated with the commercial terms of a service/element, and creating price/cost data for either selection of components or for generation of bills or reconciliation, is best considered a function of the lifecycle model. Placed there, it matches the needs of those services that are not usage priced without any special accommodation; cost data is simply part of the model. For usage-priced services, the lifecycle model’s in-service state/event processes can include a timer that would, when it expired, indicate that the Envoy/service-mesh package should be polled for statistics, and they’d then also be stored in the lifecycle model.

All of this appears to be workable, but one big question is whether it’s valuable too. The service mesh technology developing for the cloud, combined with various deployment orchestration tools like Kubernetes, have the ability to manage deployment and lifecycle stages, including scaling, without a model. Edge computing might be supportable using just these tools, or it might not. It seems to me that it depends on what “edge computing” turns out to mean, and how 5G hosting might influence it. If edge services are created by composing “service atoms”, potentially from different sources and in different locations, then homogeneous development practices to exploit service mesh/orchestration tools will be more difficult to apply across the whole service, and the composing process will have to be modeled. If connection services and edge application services are combined, that same truth comes forward. We may not need a combined modeling strategy for networks and hosting everywhere, for everything, but we’re likely to need it in some places, and for some things. That makes it important to know how it could be done.

Where Could “Metro Transformation” Take Us?

We tend to think of transformation as a proactive process, meaning that some technology shift has the effect of transforming things. It’s a nice picture for the industry, and it’s easy to understand, but the biggest transformation in networking may be happening a different way, and we may be missing it.

From its earliest days, networks have been focused on traffic aggregation for efficiency. The “access” network feeds the “metro” network, which feeds the “core” network. This structure makes a lot of sense when you assume that network traffic is created by the sum of exchanges between network users, because it ensures connectivity and a measure of network efficiency at the same time.

The problem now is that we have multiple forces that are changing what networks are really being asked to do. One force is the explosive growth in video content, growth that’s surely going to not only continue but likely accelerate as a result of greater demand for streaming video. Another is the advent of 5G, which relies more on hosted features/functions than prior generations of mobile technology, and 5G is a potential driver for our last force, edge computing. To relate these forces of change to the changes they’re forcing, we need to look at each.

Content delivery changes network dynamics because the majority of content traffic is focused on a relatively limited repertoire of material, and because the traffic levels and QoE requirements associated with video content in particular challenge effectiveness over long-haul pathways. Content delivery networks (CDNs) have for decades provided local hosting of content to enhance performance. A CDN is actually less a “network” than a hosting environment, and “local” will generally mean “within a major metro area” for best performance/economy tradeoffs.

5G, and feature hosting in general, changes the network dynamic too. In the past, “features” of networks were the result of cooperative device behavior. With mobile networks, the “devices” tended to be concentrated in the major metro areas because mobile traffic largely originates there, and the features the devices represent related to the users of the network and the management of their mobility, largely metro functions. 5G also broadens the model of feature hosting, replacing the presumption of COTS servers as the host with a wider variety of devices, some placed where singular devices would be the only economical solution. A “pool” of resources then spans both metro and access networks.

Then there’s edge computing. As a model for hosting, edge computing and 5G’s notion of a pool of resources would coincide, with edge computing embracing even the inclusion of both privately owned resources and “cloud” resources in the same pool, something that 5G operator relationships with public cloud providers also encourages. What’s interesting about edge computing is that both the CDN mission and the 5G/network-feature-hosting missions are potentially subsets of edge computing, meaning that the edge is a unifier in at least some sense.

CDNs have largely insulated the IP core from massive traffic growth associated with content delivery. 5G and edge computing seem to insulate the core from feature-related, cloud-related, future missions as well. All those things now can be expected to live primarily within the metro area, and the “metro network” then becomes much more important, to the point where it’s fair to say that the future of networking overall is the future of metro networking.

The biggest piece of that metro-network future is that the metro becomes a giant virtual data center, meaning that the primary purpose of the metro network will be to connect the pooled resources associated with the broad and unified edge mission. That means that some organized and standardized form of data center networking will be the foundation of metro networking, and that data center interconnect (DCI) will focus on creating as low a latency and as high a capacity as possible with DCI connections. There will obviously be “layers” to the virtual-data-center model, meaning that places with real server farms will be the best-connected, but other outlying locations will also be connected with sufficient capacity to ensure no significant risk of congestion and latency occur.

Operationally speaking, this demands a common model for lifecycle management, since any significant variability at the operations practices level would raise the risk of human error and delays in remediation that would, because of the growing focus on metro-hosted features and content, be a major service problem. Finally, there’s a question as to whether there needs to be a framework for building edge applications at the runtime level. Orchestration of independent components of software, particularly for event-driven applications, is a significant technology challenge, and how it’s done can reasonably be expected to impact the network requirements within the metro zone.

Competition is also certain to impact how metro networking evolves. On one hand, the metro is (as I’ve already noted) a giant virtual data center. The value of metro, and the whole reason why metro emerges as the critical opportunity point in all of networking, lies in what’s hosted rather than in how it’s connected. That would create a hosting bias to metro, and a data-center-interconnect bias to metro network planning. If we can build a data center network, could we not extend it via DCI, creating a giant fabric for metro? On the other hand, metro is also the inside of the access network, the major likely point of attack for hackers, and a place where Level 3 practices would surely scale better than Level 2. As a result, we could see a switch model of metro, a router model of metro, and perhaps some other models too.

One interesting possibility is the emergence of SDN as a unifying approach to metro. SDN is already more likely to be found in the data center than anywhere else, and SDN control of forwarding paths eliminates a lot of the issues of large L2 networks. Google, with Andromeda, has also demonstrated that SDN can be used “inside” a BGP emulator edge layer to create an IP core network, so it follows that SDN could be used to build a connectivity layer that could be purposed toward L2 or L3, or both.

Another possibility is that metro could be built out of some kind of extension to the “disaggregated” router/switch model that’s been popularized by DriveNets but is also supported by vendors like RtBrick. Obviously, a disaggregated virtual device is created by uniting multiple physical devices, and it might be possible to adapt that approach to the creation of metro-virtual devices united from multiple physical-device clusters.

The biggest change in metro, of course, is the fusion of network and hosting, which is what’s behind the “metro-as-a-giant-data-center” theme. That fusion not only changes the requirements and options for network-building, it changes the operational goals as well. It’s obvious that somebody who can’t watch video isn’t very interested in whether the CDN caching process is broken, whether the URL is redirecting wrong, or whether the network connection is bad. Edge computing, 5G feature hosting, IoT, and other stuff that will settle in the metro will have similar disregard for the exact problem source, and a big focus on optimizing the remediation. We may be creeping up on something totally new, a notion of “MetroOps”, and if we are, that notion could percolate into the cloud and the data center too.

Extending as-a-Service Modeling to Edge Event-Driven Applications

In the first part of this series, we looked at the possibility of modeling “as-a-service” offerings using a data model, with the goal of deciding whether a common approach to modeling all manner of services could be created. That could facilitate the development of a generalized way of handling edge computing applications, both in terms of lifecycle management and in terms of presenting services to users. It’s the need to represent the service in runtime, not just in lifecycle terms, that we’ll now explore.

We’ll kick this off by returning to software-as-a-service. SaaS has to present a “runtime” interface/API for access. The resource-oriented aaS offerings could be viewed as being related to lifecycle management, and what gets run on a resource-as-a-service offering of any sort is then responsible for making its runtime visible. SaaS has to present the “service” in the way that an application would present it, as a runtime interface. This is interesting and important because many of the 5G applications of edge computing would relate to lifecycle management, while edge computing overall is likely driven by IoT applications where models of an application/service would be modeling runtime execution.

Most applications run in or via as-a-service are represented by an IP address, an API. If the application is loaded by the user onto IaaS or PaaS or onto a container service, then the user’s loading of the application creates the API. In SaaS, the API is provided as the on-ramp, by the cloud provider. In either case, you can see that there is an “administrative” API that represents the platform for non-SaaS, and a runtime API that represents the application or service.

One complication to this can be seen in the “serverless” or function-as-a-service applications. These are almost always event-driven, and because the functions are stateless, there has to be a means of state control provided, which is another way of saying that event flows have to be orchestrated. In AWS Lambda, this is usually done via Step Functions, but as I’ve been saying in my blogs on event-driven systems, the general approach would be to use finite- or hierarchical-state-machine design. That same design approach was used in my ExperiaSphere project to manage events associated with service lifecycle automation.

Given that we can use FSM/HSM for lifecycle automation and for the orchestration of event-driven applications, wouldn’t it be nice if we could somehow tie the processes together? There would seem to be three general ways that could be done.

The first way would be to simply extend our two-API model for SaaS, and say that the administrative API represents the exposed lifecycle automation HSM, and the service API the service HSM. We have two independent models, the juncture of which would be the state of the service from the lifecycle HSM reflected into the service HSM. That means that if the lifecycle HSM says that the service is in the “run” state, then the service HSM is available.

The second approach would be to say that there is a lifecycle HSM linked to each element of the service, each individual component. We’d have a service HSM whose elements were the actual software components being orchestrated, and each of those elements would have its own lifecycle HSM. Those HSMs could still be “reflected” upward to the administrative API so you’d have a service lifecycle view available. This would make the lifecycle state of each component available as an event to that component, and also let component logic conditions generate a lifecycle event, so that logic faults not related to the hosting/running of the components could be reflected in lifecycle automation.

The final approach would be to fully integrate the two HSM sets, so that a single HSM contained both service event flow orchestration events and lifecycle events. The FSM/HSM tables would be integrated, which means that either lifecycle automation or service orchestration could easily influence the other, which might be a benefit. The problem is that if this is to be in any way different from the second approach above, the tables would have to be tightly coupled between lifecycle and service, which would mean there would be a risk of having a “brittle” relationship, one that might require reconfiguration of FSM/HSM process identities and even events if there were a change in how the components of an event-driven service deployed.

Selecting an option here starts easy and then gets complicated (you’re probably not surprised!) The easy part is dismissing the third option as adding more complexity than advantages. The harder part is deciding among the first two, and I propose that we assume that the first approach is simpler and more consistent with current practices, which wouldn’t mix runtime and lifecycle processes in any way. We should assume option one, then, unless we can define a good reason for option two.

The difference between our remaining options is the coupling between runtime behavior and lifecycle behavior, which means coupling between “logic conditions” detected by the actual application or service software and lifecycle decisions. Are there situations where such coupling is likely justified?

One such situation would be where load balancing and event steering decisions are made at runtime. Service meshes, load balancers, and so forth are all expected to act to optimize event pathways and component selection among available instances, including scaling. Those functions are also often part of lifecycle processes, where program logic doesn’t include the capabilities or where it’s more logical and efficient to view those things as arising from changes in resource behavior, visible to lifecycle processes.

This seems to be a valid use case for option two, but the idea would work only if the service mesh, API broker, load balancer, or whatever, had the ability to generate standard events into lifecycle management, or vice versa. You could argue that things like service meshes or load balancers should support event exchange with lifecycle processes because they’re a middleware layer that touches resource and application optimization, and it’s hard to separate that from responding to resource conditions that impact QoE or an SLA.

That point is likely to be more an argument for integration between lifecycle management and service-level meshing and load-balancing, than against our second option. One could make a very solid argument for requiring that any form of event communications or scalability needs to have some resource-level awareness. That doesn’t mean what I’ve characterized as “primitive” event awareness, because specific resource links of any sort create a brittle implementation. It might mean that we need to have a set of “standard” lifecycle events, and even lifecycle states, to allow lifecycle management to be integrated with service-layer behaviors.

That’s what I found with ExperiaSphere; it was possible to define both standard service/application element states and events outside the “ControlTalker” elements that actually controlled resources. Since those were defined in the resource layer anyway, they weren’t part of modeling lifecycle automation, only the way it mapped to specific management APIs. I think the case can be made for at least a form of option two, meaning a “connection” between service and lifecycle models at the service/application element level. The best approach seems to be borrowed from option one, though; the service layer can both report and receive an event that signals transition into or out of the “running” state, meaning an SLA violation or not. The refined state/event structure of both the service and lifecycle models are hidden from the other, because they don’t need to know.

In this structure, though, there has to be some dependable way of relating the two models, which clearly has to be at the service/application element level. These would be the lowest level of the runtime service, the components that make up the event flows. Logically, they could also be the lowest service-layer model elements, and so there would be both a service logic and a lifecycle model for these, linked to allow for the exchange of events described above. Service logic modeling of the event flows wouldn’t require any hierarchy; it’s describing flows of real events. The lifecycle model could bring these service-layer bottom elements back through a hierarchy that could represent, for example, functional grouping (access, etc.), then administrative ownership (Operator A, B, etc.), upward to the service.

If we thought of this graphically, we would see a box representing the components of a real-time, event-driven, application. The component, with its own internal state/event process, would have a parallel link to the lifecycle model, which would have lifecycle state/event processes. This same approach could be used to provide a “commercial” connection, for billing and journaling. That’s what we’ll talk about in the last blog in this series.

Extending Data-Modeled Services to Run-Time: Lessons from aaS Part 1

Abstraction in any form requires a form of modeling, a way of representing the not-real that allows it to be mapped to some resource reality and used as though it was. We have two very different but important abstraction goals in play today, one to support the automation of service and application lifecycles and the other to support the execution of applications built from distributed components. Since both of these end up being hosted in part at “the edge” it sure would be nice if we had some convergence of approach for the divergent missions. It may be that the “as-a-service” concept, which has elements of both missions already, can offer us some guidance, so we’ll explore modeling aaS here, in a two-blog series.

Everyone seems to love stuff-as-a-service, where “stuff” is anything from hardware to…well…anything else. As-a-service is an abstraction, a way of representing the important properties of something as though those properties were the “something” itself. When you buy infrastructure- or software-as-a-service, you get something that looks like infrastructure or software, but is actually the right to use the abstract thing as though it was real. For “aaS” to work, you have to be able to manage and use the abstraction in an appropriate way, which usually means in the way that you’d manage and use what the abstraction represents.

There could be multiple ways of doing that, but I think there’s a value in organizing how that would be done, and at the same time perhaps trying to converge the approach with modern intent-model concepts and with data-driven service management of the type the TMF has promoted. Automating service management, including applications management, is an event-driven process. Control-plane network exchanges are also event-driven, which means that most of what’s critical in 5G could be viewed through the lens of events. That’s a big chunk of the expected future of the cloud, distributed services, and telecom.

In the cloud, as-a-service means that the prefix term is offered just as the name suggests, meaning as a service. IaaS represents a hardware element, specifically a server, that can be real or virtual. SaaS represents an application, so while there is surely a provisioning or setup dimension to SaaS use, meaning a lifecycle dimension, the important interfaces are those that expose application functionality, which is the use of the service not the management of the service. PaaS is a set of tools or middleware elements added to basic hosting. The new container offerings are similar specializations.

Applied to hosting, most IaaS represents a virtual machine, which of course is supposed to look and act like a real server, or is it? Actually, IaaS is a step removed from both the real server and the VM. A real server is bare metal, and a virtual machine is a partitioning of bare metal, meaning you have to load an operating system and do some setup. IaaS from most cloud providers already has the OS loaded, and so what you’re really getting is a kind of API, the administrative logon interface to the OS that’s been preloaded.

To avoid using network service model terms, TMF terms, before we’ve validated they could work, I’m going to call the representation of an aaS relationship a “token”. So, suppose we start our adventure in generalizing aaS by saying that in as-a-service relationships, the service is represented primarily by an “administrative token” that includes a network URL through which the service is controlled. You can generalize this a bit further by saying that there’s an account token, to which various service tokens are linked.

Suppose we had a true VM-as-a-service, with no preloaded OS? We could expect to have our administrative token that represented the VM’s “console”, or a configuration API from which the user could load the necessary OS and middleware. That would suggest that we might have another layer of hierarchy, a token representing the VM, another representing the OS admin login.

From this, it appears that we could not only represent any resource-oriented IT-aaS through a series of connected-hierarchical tokens, but also maintain the relationship among the elements of services. We could, for example, envision a third layer of hierarchy to my VMaaS above, representing containers or serverless or even individual applications. Because of the hierarchy, we could also tie issues together across the mixture.

If we were to rehost a container in such a VMaaS configuration, we would “rehost” the token in the new token hierarchy where it now belonged. At the time of the rehosting, we could create a history of places that particular token had been, too. That could facilitate better analysis of performance or fault data down the line, and even be of some help in training machine learning or AI tools aimed at automating lifecycle management.

What we can take from all of this is that it would be perfectly possible to create a data model to describe, and interact with, those aaS offerings that represent resources. That’s likely because what you do with resources is create “services”, meaning runtime behaviors, and the resources themselves are manipulated to meet the service-level agreements (SLAs), express or implied. That means lifecycle management.

Could the modeling be extended to the runtime services themselves? Since aaS includes runtime services (SaaS, NaaS), it would be essential that we include runtime model capabilities in the picture, just to accommodate current practices, but edge computing applications like IoT are likely to generate services, projected through APIs, to represent common activities. Why have every application for IoT do things like device management, or interpret the meaning of location-related events?

In my early ExperiaSphere project, in the Alpha test in particular, I created a model that represented not only lifecycle behavior but runtime behavior. The application used a search engine API to do a video search, retrieved a URL, and then initiated a NaaS request for delivery. The NaaS request was simulated by a combination of a “dummy” request for a priority ISP connection (with Net Neutrality, of course, there could be no such thing) and a request to a device vendor’s management system for local handling. What the Alpha proved was that you could create a runtime in-use service model, and merge lifecycle behavior into it, providing that you had intent-modeled the service components and had control interfaces available to influence them.

Could that approach work for SaaS and NaaS overall? That’s what we’ll explore in the next blog on our aaS-modeling topic.

The Infrastructure Bill Kicks the Broadband Can…Again

While the bipartisan infrastructure bill isn’t law (or even finished) at this point, we do have some reports on its content. Light Reading offered their take, for example. I downloaded and reviewed the bill, so let’s take a look at it, what it seems to get right, and what it may have missed.

The goal of the broadband piece of the bill should be familiar; it’s all about closing the often-cited “digital divide” that dooms consumers in many rural areas to broadband Internet service capacities far below the national average. However, all legislation is politics in action, and lobbying in action too. What emerged from all the lobbying and politicking was two distinct positions on what broadband should be. One group favored what could be called a “future-proof” vision, where the goal was to provide capacity that’s actually more than most US households have, or even want. Another group favored a broadband model that was aligned with the technology options and practices of the current market.

Fiber proponents, of course, wanted to see gigabit symmetrical broadband, something that can be delivered over fiber but is problematic with nearly every other technology option. What this approach would have done is to expand the digital divide into suburbs and metro areas, rather than closing it, because cable broadband wouldn’t fit the model. In addition, it would likely disqualify fixed and mobile wireless technology, which is the easiest of all our new options to deploy.

The operators themselves were generally in favor of a more relaxed 100 Mbps download and 20 Mbps upload, which some public advocacy groups feared would make two-way video for remote learning (something we needed and still need) and work less useful. Operators were also leery of mandated low-cost options and terms that could force them not to cherry-pick high-value areas where their revenue could be expected to be better.

The notion of universal fiber is simply not realistic, because fiber costs in areas with low demand density would be so high that only massive subsidies could induce anyone to provide the service. Thus, it’s a win for the practical political and technical realities that the standards were set at 100/20 Mbps, though I think that 100/35 would have been almost as achievable and would offer better future-proofing for remote work and education.

What’s a bit disappointing here is that there’s no specificity with regard to how the broadband speed is measured. As I’ve pointed out, simply measuring the speed the interface is clocked at doesn’t reflect the actual performance of the connection. A company could feed a node or shared cable with (for example) 10 Mbps of capacity, clock the interface at 100/20 Mbps, and appear to comply.

This would seem to admit mesh WiFi, a technology that was used over a decade ago and that almost universally failed to meet its objectives. The problem is that WiFi range is short, and mesh technology loads the WiFi cells closest to the actual Internet backhaul, so performance is very unlikely to meet remote work or school objectives, yet it could meet the simple interface-speed test.

There is a provision in the bill that states “The term ‘‘reliable broadband service’’ means broadband service that meets performance criteria for service availability, adaptability to changing end-user requirements, length of serviceable life, or other criteria, other than upload and download speeds, as determined by the Assistant Secretary in coordination with the Commission.” Here, the “Commission” is the FCC and the Assistant Secretary is the Assistant Secretary of Commerce for Communications and Information. It would seem that details of service criteria could be provided later on, which could address the interface-versus-real-speed issues.

Mandated low-cost options are required, but it’s not completely clear how they’d work. The big question is whether the same service would have to be offered (100/20 Mbps) at a lower price point. If that’s the case, then the mandate could result in some major operators (telcos and cablecos) selling off areas where the cost mandates would likely apply, which could mean lower overall economies of infrastructure in those areas and a need for more subsidies. Hopefully some clarity will emerge in the rest of the bill’s process.

The “digital redlining” issue is similar in impact. Network operators typically try to manage “first cost”, meaning the cost of deploying infrastructure when revenues are unlikely to build quickly enough to cover it. That would favor using higher-revenue-potential areas to pull through some of the deeper infrastructure needed, infrastructure that lower-revenue areas could then leverage. One thing I think could be hurt by some interpretations of digital redlining is millimeter-wave 5G. You have to be able to feed the nodes where mm-wave originates through fiber, and early and effective node deployments could naturally favor areas where the “revenue per node” is the highest. Those early nodes could then be the source of fiber fan-outs to other nodes, offering service to other areas whose return on investment couldn’t otherwise justify connection.

The bill seems to ignore another issue regarding neighborhoods, one that I’ve seen crop up again and again. In many states, builders of residential subdivisions, condos, and apartments can cut exclusive deals with an ISP. The ISP “pre-wires” the facility for their broadband in return for the exclusivity. That means that there’s no competition for broadband Internet in these complexes, and little incentive for the dominant ISP to improve services. Since ISPs seek these deals where they believe consumer spending on broadband is high, the same practice may result in having no deals offered in lower-income complexes, which would mean that the neighborhood would have to be wired to offer service. That can create an effective red-lining.

All of this is to be based on a significant data gathering and mapping of unserved (less than 25/3 Mbps) and underserved (less than 100/20 Mbps) broadband service areas. That process is likely to take time, and like all legislation the effect will depend on whether a change in administration results in a change in policy. Time may be important for another reason; there are funds for incentives and subsidies included in the bill, but the majority would help initial deployments but not cover costs down the line. The subsidies provided per user aren’t funded in the long term, so there is no assurance that either incentives or subsidies will have a major impact beyond the next election.

As I said at the start of this blog, legislation is politics, and both the overall infrastructure bill and the broadband portion are examples of politics in action. Since the Telecom Act was passed about 25 years ago, I’ve recognized that Congress isn’t interested in getting the right answer, but rather the politically optimum answer. It’s easy to fob off details to the executive branch, particularly to a Federal Commission like the FCC, but the Telecom Act was mired in shifts in party in power, legal fights, and changes in interpretation. The biggest problem with the broadband terms in the infrastructure bill is that they follow the path that’s failed in the past, and that I think is likely to fail again.

What Would “Success” for 5G Mean?

One of my recent blogs on 5G generated enough LinkedIn buzz to demonstrate that the question of 5G and hype is important, and that there are different interpretations to what constitutes 5G success. To me, that means I’ve not explained my position as well as I could have, which means I need to take a stab at the issue again, specifically addressing a key point of mine.

My basic position on issues relating to 5G (or any other technology) is that there is a major difference between what you can use a technology for, and what justifies the technology. As I said in the referenced blog, there is not now, nor has there ever been, any realistic chance that 5G would not deploy. It’s a logical generational evolution to mobile network technology, designed to accommodate the growing and evolving market. In fact, one of the most important facts about 5G is that it will deploy, which means that having a connection to it offers vendors an inroad in something that’s budgeted. This, at a time when budget constraints for network operator spending are an ongoing problem to vendors.

The question with 5G, then, isn’t whether it will happen, but rather what will drive it, and how far the driver(s) will take it. Putting this in very simple terms, we have two polar positions we could cite. The first is that 5G is really nothing more than the evolution of LTE to 5G New Radio (NR), and that little or no real impact can be expected beyond the RAN. This is the “Non Stand-Alone” or NSA vision; 5G rides on what’s an evolved/expanded form of 4G Evolved Packet Core (EPC). The second is that 5G concepts, contained in 5G’s own Core, will end up transforming not only mobile networks but even wireline infrastructure, particularly the access/metro networks. Obviously, we could fall into either extreme position or something in between.

Where we end up on my scale of Radio-to-Everything-Impacted will depend not on what you could do with 5G, but on what incremental benefit to operator profits 5G could create. If 5G offered a lot of really new applications that would justify additional spending on 5G services, and in particular if operators could expect some of those new applications to be services they’d offer and get revenue from, then 5G gets pushed toward the “Everything” impact side of my scale. If 5G could offer a significant improvement in opex overall, then it would bet pushed toward “Everything” as far as the scope of improvements justified. If neither happens, then 5G stays close to the “Radio” side of the scale, because there’s no ROI to move the needle.

If 5G does in fact end up meaning little more than a higher-capacity, faster, RAN, it doesn’t mean that 5G core would not deploy, but it would mean that the features of 5G Core that were actually used, and could actually differentiate one 5G network (or vendor product) from another would be of less value, and be less differentiating. In fact, they might not even be offered as part of a service at all, in which case there would be no chance the market could eventually figure out how to build applications/services that would move my needle toward the “Everything” end of the scale.

My view of the possible drivers to move 5G toward the “Everything” end of the scale has been that they relate to applications of 5G beyond calling, texting, and simple Internet access. That, to me, means that there has to be a set of service features that are valuable to users, deliverable to a community of devices, and profitable for the operators to deploy. I doubt that anyone believes that something that met these requirements could be anything but software-based, and so I believe that exploiting 5G means developing software. Software has to 1) run somewhere, and 2) leverage some easy (low-on-my-scale) property of 5G to exploit low-apple opportunities and get something going.

Software that’s designed to be edge-hosted seems to fit these criteria. One of 5G’s properties is lower latency at the radio-connection level, which is meaningful if you can pair it with low latency in connecting to the hosting point for the software, the edge. Further, 5G itself mandates function hosting, which means that it would presumably justify some deployment of edge hosting resources, and those might be exploitable for other 5G services/features/applications. However, that’s less likely to be true if the software architecture, the middleware if you like, deployed to support 5G hosting doesn’t work well for general feature hosting. 5G can drive its own edge, but it has to be designed to drive a general edge to really move my needle.

There’s been no shortage of 5G missions cited that would drive 5G. Autonomous vehicles are one, robots and robotic surgery are another. All of this reminds me of the old days of ISDN, when “medical imaging” was the killer app (that, it turns out, killed only itself). All these hypothetical 5G applications have two basic problems. First, they require a significant parallel deployment of technology besides 5G, and so have a very complicated business case. Second, it’s difficult to frame a business model for them in any quantity at all.

If anyone believes that self-driving cars would rely on a network-connected driving intelligence to avoid hitting pedestrians or each other, I’d gently suggest they disabuse themselves of that thought. Collision avoidance is an onboard function, as we have already seen, and it’s the low-latency piece of driving. What’s left for the network is more traffic management and route management, which could be handled as public cloud applications.

Robots and robotic surgery fit a similar model, in my view. The latency-critical piece of robotics would surely be onboarded to the robot, as it is today. Would robotic surgery, done by a surgeon distant from the patient, be easily accepted by patients, surgeons, and insurance companies? And even if it were, how many network-connected robotic surgeries would be needed to create a business case for a global network change?

Why have we focused on 5G “drivers” that have little or no objective chance of actually driving 5G anywhere? Part of it is that it’s hard to make news, and get clicks, with dry technology stories. Something with user impact is much better. But why focus on user impacts that aren’t real? In part, because what could be real is going to require a big implementation task that ends up with another of those dry technology stories. In part, because the real applications can’t be called upon for quick impact because they do require big implementation tasks, and vendors and operators want instant gratification.

How do we get out of this mess? Two possible routes exist. First, network operators could create new services, composing them from edge-hosted features, and target service areas that would be symbiotic with full 5G NR and Core. Second, edge-computing aspirants could frame a software model that would facilitate the development of these applications by OTTs.

The first option, which is the “carrier cloud” strategy, would be best financially for operators, but the recent relationships between operators and public cloud providers demonstrates that operators aren’t going to drive the bus themselves here. Whether it’s because of a lack of cloud-skills or a desire to control “first cost” for carrier cloud, they’re not going to do it, right or wrong though the decision might be.

The second option is the only option by default, then, and it raises two of its own questions. The first is who does the heavy lifting on the software model, and the second is just what capabilities the model includes. The answers to the two questions may end up being tightly coupled.

If we go back to the Internet as an example of a technology revolution created by a new service, we see that until Tim Berners-Lee, in 1990, defined the HTML/HTTP combination that created the Worldwide Web, we had nothing world-shaking. A toolkit opened an information service opportunity. Imagine what would have happened if every website and content source had to invent their own architecture. We’d need a different client for everything we wanted to access. Unified tools are important.

Relevant tools are also important. Berners-Lee was solving a problem, not creating an abstract framework, and so his solution was relevant as soon as the problem was, which was immediately. The biggest problem with our habit of creating specious drivers for 5G is that it delays considering what real drivers might be, or at least what they might have in common.

Public cloud giants Amazon, Google, and Microsoft have a track record of building “middleware” in the form of web-service APIs, to support both specific application types (IoT) and generalized application requirements (event processing). So do software giants like IBM/Red Hat, Dell, VMware, HPE, and more. Arguably, the offerings of the cloud providers are better today, more cohesive, and of course “the edge” is almost certainly a special case of “the cloud”. There’s a better chance the cloud providers will win this point.

The thing that relates the two questions of “who” and “what” is the fact that we don’t have a solid answer to the “what”. I have proposed that the largest number of edge and/or 5G apps would fit what I call a contextual computing model. Contextual computing says that we have a general need to integrate services into real-world activity, meaning that applications have to model real-world systems and be aware of the context of things. I’ve called this a “digital twin” process. However, I don’t get to define the industry, only to suggest things that could perhaps define it. If we could get some definition of the basic framework of edge applications, we could create tools that took developers closer to the ultimate application missions with less work. Focus innovation on what can be done, not on the details of how to do it.

And that’s the 5G story, IMHO. 5G proponents can either wait and hope, or try to induce action from a credible player. I always felt that doing something was likely a better choice than hoping others do something, so I’m endorsing the “action” path, and that’s the goal of sharing my 5G thoughts with you.

Evolving Principles for Service and Application Lifecycle Modeling and Automation

Applications aren’t possible without application development, and in today’s hosted-feature age, neither are advanced services. That makes the question of how to implement edge and telecom applications critical, but it’s a difficult question to answer. Applications will typically have an optimum architectural model, set by the way the application relates to the real world. That model can normally be codified using a combination of middleware tools, and those combine to create a programming/development model. Obviously, there are many possible options once you’ve set that optimum architectural model, and I want to open an exploration of what some options look like, or how they might develop.

I have proposed that edge computing, IoT, telecom, and many other modern tech initiatives are all about processing events, and that the event-centricity of these initiatives differs significantly from the transaction-centricity of traditional application development. Most recently, I suggested that the best approach to implementing an event-centric application was to view it as a hierarchical state machine or HSM. That could be the “optimum architectural model”, but what about the rest?

Let’s start by saying that network (and most IT) services are a composition, a cooperative grouping of “sub-services” or “service elements”. In most cases, these service elements are differentiated/delineated by administrative and/or control boundaries. That means that there is generally a control point associated with a collection of technology, and managing that collection is done through the associated control point. This is the structure that the TMF’s SID data model envisions, where a “product” (what they call a service) is divided into Customer-Facing Services (CFSs) and Resource-Facing Services (RFSs).

Data models don’t implement, but the TMF, in the mid-2000s, launched another project, the Service Delivery Framework or SDF. I was involved in that project as an (at the time) TMF member, and it was there that I first encountered the NGOSS Contract work of John Reilly that I often mention. However, SDF was also an architecture not an implementation, and at one point in the project, a European Tier One contacted me, representing a group of five. They were concerned about how SDF was developing, not as a concept but as a model for software.

“I want you to understand that we stand 100% behind implementable standards,” the Tier One told me, “but we’re not sure this is one of them.” I launched the first phase of ExperiaSphere in 2007 to prove that an implementation could be done, and I made a couple of presentations of the result to the group of five Tier Ones and to the TMF. I never said (and still wouldn’t say) that ExperiaSphere was the only way to do composable services with lifecycle automation, but it proved that there was at least one way, and validated an expansion to the NGOSS Contract approach, which was my goal.

What I envisioned in my ExperiaSphere model was a distributable service model that included a “service element” for each composable piece of the service, meaning those sub-services. These service elements would be what today we call intent models, and they would each have a state/event table that associated events to processes. My presumption was that the service model would be centrally administered, meaning that there was a master copy kept in some convenient place, but later I also hoped to support a distributed model, where each element’s model component was kept in a place optimized for the location of the service components that made it up. The distributed element approach would require that each element “knows” the location of the superior and subordinate elements associated with it, since those elements can exchange events.

Generally speaking, the control points associated with each service element would be connected to the resources that were committed to each of their services, meaning that they would likely receive the events generated by conditions/changes in those resources. If we envisioned a service made up of a single element, then, in ExperiaSphere, that element would process its own internal events and there would be only one event exchanged, which would be the state changes in that service element, communicated up to the top service-model level.

What happens inside one of the service elements, then, is opaque except for any state change events it generates, upward to its superior element, or commands issued downward in event form, to subordinate service elements. ExperiaSphere doesn’t dictate that its state/event approach be followed inside the intent-modeled service elements, only that it be followed for events that are exchanged among the service element processes. However, the state/event model could be applied within a service element to handle locally generated events.

When an event occurs, it would be processed by the specific element responsible for it, in either the centralized or distributed location assigned. The event and current state would identify the process to be invoked, and that process could then be run in a place that’s optimized for what it’s going to do. Thus, even a centralized service model process could invoke a process that’s distributed to the point where it’s acting on local resources, or optimally localized resources. Given that, it would also make sense to distribute the processing of service elements, not just the execution of the state/event-defined processes. A service data model would then be dissected when instantiated, with the service element pieces distributed to process points proximate to the control points for the resources being committed. If a service had a core VPN element and five access-network elements, its data model would have six service elements, and each would be dispatched to a host close to where the resources represented could be controlled.

When running in these distributed locations, the service element process would both generate and field events to/from both the superior element (the one above, in the hierarchy, which in my VPN example would be the service itself) and to subordinate elements. These events would be handled by state/event tables for each element, and so each element is a finite-state machine (FSM). However, the service overall is a hierarchical state machine (HSM) because events couple the service elements.

In the original ExperiaSphere project, where the concept was actually implemented in Java, I implemented a specific class of service element, called “ControlTalker” that represented the actual control point interface and was responsible for commanding actions and receiving events. ControlTalkers were designed to be distributed to the area of control points, and to be linked to services using “late binding” so the service data model wasn’t bound to a specific resource set. This ControlTalker had a specialized state/event table, as necessitated by the fact that the events and states would vary depending on the control point interface. All other service elements were represented by “Experiam” objects, and these all had a common event structure (a Java enumerated type), that ControlTalkers also had to support in addition to their specialized events. For Java programmers, the ControlTalker class extended the Experiam class.

Model-handling in this project was the responsibility of a “Service Factory” (which of course was another class). A Service Factory was capable of supporting a set of services, and any Service Factory instance could process events for any service type it supported. In this implementation, though, there was no attempt to break up the service data model and dispatch pieces of it to different factories; a central factory controlled all the service actions for a given service. What I was working on with ExperiaSphere would add the model-handling distributability, so that any given service element and any number of its subordinate elements could be separated and handled local to the control point.

It was clear from the testing of my Java implementation that the ControlTalkers were the most implementation-critical elements of the whole application. State synchronization and command/response exchanges among the service elements, representing as they did service commands or conditions, were relatively rare and could probably be run even as cloud serverless events in many cases. ControlTalkers had to deal with real-time events from real-world devices, and their performance had to keep up with that real world. However, it seems likely that if you had a capable platform to implement ControlTalkers, it would be fine for implementing the other Experiams that operated on service-level events.

Obviously it was possible to write Java code for this, not relying on any specific tools, but it’s unnecessarily complicated when tools to facilitate things are available. In the second phase of ExperiaSphere, I elevated the approach to remove a specific implementation, but looked at using the OASIS TOSCA model as a service model. TOSCA now has some features that would support an implementation, but it doesn’t have the explicit state/event facilities I wanted. A better approach would be to find something that could be used to implement the distributed service element model-handling (the state/event processing), and another (the same, related, or different) to use to author and host the processes identified in the state/event tables.

That’s what I’m exploring now, and I’d like the approach to be suitable for generalized edge applications too. I’d like to be able to round out the concept of event-driven applications by being able to identify specific, optimized, implementation tools, making lifecycle operations automation, 5G, and so forth, an application of edge computing. When I’ve lined up an approach that conforms or can conform to my ExperiaSphere framework, looks sound, and open, I’ll do a blog on it.