Anyone should be able to pickup a smartphone or tablet or walk into a room with a video screen and be able to make a video call to anyone anywhere around the globe. We have all the technology to make this happen, and yet this goal seems to be always around the corner. Are we there yet? Are we there yet? I am getting impatient. Will the world of video conferencing/chat protocols ever converge to make widespread video calling a reality?

Before we tackle that question let's level set on all the different languages (protocols) that exist in the video conferencing and chat video worlds that make seamless video calling a challenge in today's environment. Video protocols are numerous and they have evolved over the years. The very early days of video saw many proprietary protocols for carrying audio/video onto analog phone lines. It wasn't until the 80s that open protocols came into use. The earliest standards popular for video were the H.32x series of standards. The early 90s saw the advent of H.320 (video over ISDN) and H.324 (video over analog telephones.) This was followed by H.323 (video over IP) that to date is the most popular enterprise video conferencing protocol. All of these protocols came from the ITU, the standards body that controls most of the traditional telco world.

In the late 90s, SIP (Session Initiation Protocol) was born and this came out of the IETF, the standards body that brought us the Internet. SIP has become the de facto standard for Voice over IP calls. A little known fact: Most people may not be aware but until just a few years ago H.323 was still the predominant VoIP protocol used by large telcos and consumer VoIP operators. Nevertheless SIP is starting to take over in all newer deployments. Enterprise video conferencing systems from the major vendors like Cisco/Tandberg, Polycom, LifeSize, Sony are now all dual-stacked with both H.323 and SIP. Nearly all of these same vendors' software clients on laptops and mobile/tablet devices use SIP. Microsoft Lync (formerly known as OCS - Office Communication Service) also uses SIP with some proprietary published extensions. So as you can see SIP is here to stay.

In the early 2000s, Instant Messaging became popular and with it came a few different protocols. The most popular of these is the Jabber, also known as XMPP (eXtensible Messaging and Presence Protocol). While XMPP is mostly used in IM, there are extensions made to the protocol to handle audio/video calling. These extensions are called XMPP/Jingle and have been primarily been used by Google for its popular GoogleTalk, GoogleVoice and Hangouts service.

There are many more proprietary video calling protocols. Too many to count, but only a few that have had widespread adoption, namely Skype and FaceTime. The Skype protocol is used by millions of users with Skype clients and is making its way into mobile devices, TVs and set-top devices. FaceTime, while it uses some of the open protocols mentioned above like SIP and XMPP, is somewhat proprietary due to the unpublished extensions that it uses.

So far we've only talked about protocols. There are a plethora of audio and video compression (codec) standards that are also matter when we are discussing a common language between video callers. This could take up a whole new blog post and I will do that one of these days. Suffice it so say for now that Cisco, Polycom, Skype, Microsoft, Google, Apple, etc. all use different audio and video codecs each with varying video resolutions and audio bands depending on the device being used by the user.

As you can see - many protocols, many audio/video codecs and last but not the least many software and hardware clients and devices. Will this world of video conferencing protocols ever converge to make seamless video calling a reality? Will these islands ever communicate. Humans crossed the oceans 10s of 1000s of years ago and found ways to communicate and trade despite huge differences. Surely we can find a way to cross from one video island to another in this day and age.

Now let's get a dose of reality. The answer to the question of whether these video protocols will ever converge is: No. You should really be asking, do these protocols need to converge to make seamless video calling a reality? The answer again is: No. As long as there is a way to communicate between these video islands they can each use their own protocol. The 2 key areas that matter are protocol interoperability and device/user addressing. Interoperability does not mean that we need to converge to one protocol. It only means that protocols need to be published as open standards or licensed freely in the cases where they are proprietary and in widespread use. Once this is done the rest can be solved.

The global phone network has been around for a long time. Phone calls today use many protocols – analog, SS7, ISDN, H.323, SIP, GSM, CDMA. They use many audio codecs, some of which are the same as the ones listed above with video systems. They have not converged to one protocol or codec in more than a 100 years and yet I can pick up my phone and call anyone anywhere in the world. Why should video calls be any different? The key in the voice world is that it is possible to translate between all of these protocols and codecs. I would conjecture that if I picked up my mobile phone right now and called you, the call would be translated between 3-5 times before reaching you. At each of these translation points there could be protocol translation as well as codec transcoding. But for an average user all of this does not matter. It is for us engineers to figure out how all this magic happens. Protocol translation and transcoding is here to stay. It is a necessary evil to make seamless video calling a reality.

Which then brings me to the next key area that matters, namely addressing. The reason the average user can pick up the phone and call someone else is that we have a universally accepted addressing scheme in the voice world - the global phone numbering plan, also known as E.164 or ENUM number. I can call 1-800-555-1212 and every phone knows how to interpret and initiate that call (with a few caveats around prefixes). In the video calling world this does not exist. Some video systems use IP addresses (like Others use ENUM numbers, but mostly only work within an enterprise or within a closed group of users like in FaceTime. Many of the video software clients like Microsoft Lync use URI addresses, i.e. email address-style, which are easy to type with a keyboard or smartphone, but not easy to initiate with a traditional number keypad or remote control. And yet others use the buddy/friend approach like Skype, GoogleTalk, Facebook that make you exchange buddy information even if you call someone just once. So as you can see there is no universal way to even address the video caller, let alone initiate the call.

This addressing issue can be fixed. At Blue Jeans, we have achieved video calling between many different video systems by side-stepping this addressing issue. We allow each caller to use the addressing that is native to the system that he is using as his video calling client. Skype users call a Blue Jeans buddy, video room systems call a Blue Jeans IP address/hostname, Lync users call a URI and so on. And they all converge on a meet-me bridge. This works well for business multi-party conferences. When it comes to point to point calling, the industry should start to converge on URI/email-style addressing. This can easily allow universal addressing across systems. Email and IM have shown that it is possible to have many different organizations with their own email/IM systems and yet be able to address users and communicate across the organizations - techies call this Federation. ENUM can co-exist with URIs since they are popular in the voice world and will not go way for devices that only have keypads. There are ways to map ENUM numbers onto URIs and the VoIP world has shown that this is possible.

So the bottom line is, as an industry let's focus on interoperability and addressing. Let's quit whining about the many protocols and how to avoid translating and transcoding. Instead let's publish or freely license the protocols and find ways to do efficient global scale translation/transcoding. On the addressing front, let's converge on a way to open up all the islands with URI/email-style addressing with optional ENUM aliases. If we do both of these we can easily get to the reality of picking up a smartphone or tablet or walking into a conference room with a video screen and being able to make a video call to anyone anywhere around the globe.