Video Conferencing

Introduction

In the following we discuss video conferencing tools using the layers defined in the public stack model. Firstly, we will focus on the main technological features of video conferencing tools. We then offer a perspective from the users point of view (the citizens perspective), by showing how the different technological choices relate to design decisions and trade-offs. In order to prioritize the different design options it can be useful to look at the foundational layer of the public stack.

Questions such as whom should the tool be suited for, who defines a successful tool, is user privacy guaranteed, and what is the business model of the tool provider can be used to understand what characteristics are important and which tools should be chosen.

The technology stack

A generic video conferencing solution can be implemented by several technical components that form a rather complex system. In order to simplify the discussion about the properties of such systems, we can group the functionality in two main components, whose interaction provide the video conferencing service to users:

The client: this is the application with which the user interacts and it is a required component for video conferencing. An example of this is for example Whatsapp installed on a user's mobile phone.
The server: this can be thought of as an application that runs on a machine different from where the client runs. The client application may interact with the server to use particular services for video conferencing (more on this later). A server is not always required, and it is normally not explicitly visible for a user, since the interaction with it is taken care of by the client application. It is important to stress the role of a server since users tend sometimes to think that they are directly connected to another user when video conferencing, but their communications often go through a server.

There are mostly two different classes of client applications clearly distinguishable for users: video conferencing that runs in a web browser, and video conferencing that requires one to install an application.

One clear difference is that the former has a lower threshold to use, as the extra step of installing the software is not required. This makes the in-browser option preferable, but it also means that such tools have less "freedom" to operate, since they are constrained by the functionality offered by existing browsers. Usually this functionality is the product of a standardisation process, which has the advantage that this functionality is uniform across different browsers, but the disadvantage that adoption of new features can be slow.

On the contrary, an application can use all the functionality that an operating system (such as MacOS, Windows, Linux) can offer. This fact can be exploited for "good" (e.g. to offer stronger encryption) or for "bad" (e.g. perform operations that are more invasive for the privacy of the user).

The Infrastructure layer

The infrastructure layer is particularly interesting for this use case as it reveals what plays behind the scenes. We consider this layer to encompass the network and the servers that make video conferencing possible.

User devices such as mobile phones need to establish a connection in order to communicate with each other. These connections are carried over a network, such as the mobile telephone network for mobile calls, or the internet for calls such as for Skype calls. If we abstract from the physical devices implementing these networks and apply some degree of simplification, there are two main types of network configurations

In the peer-to-peer configuration, client applications are connected directly to each other. This means that the data (i.e. the audio and the video) is exchanged directly between the participant to the video call, e.g. A can talk directly to B.

In the client-server configuration, the data goes through a server. This means that the data transits through a third party before reaching its final destination. In this case A cannot talk directly to B without passing through server C.

Peer-to-peer offers theoretically more privacy as there is no third party involved in the communication, but there are more tasks that each client has to perform as it can not rely on the services offered by a server. We explain further what the role of a server can be in the following section.

Context layers

We now take a look at video conferencing as a service that involves different aspects.

There are two main phases in video conferencing:

Discovering who you can talk to and establish a connection (we call it signaling)
Communicate, i.e. exchange video and audio data in real-time.

The first phase can be thought of as "looking somebody up in an address book and calling them", while the second phase starts after the called party replies and the conversation can start. We explain how these two phases differ in the peer-to-peer and client-server network configurations.

Peer-to-peer signaling

In a peer-to-peer configuration each client needs to keep track of who the other clients are on the network, and continuously listen to incoming calls. There is no central server where clients can report their presence to, and ask who else is online (like what happens with Skype for example). There is also no central location where users can connect to at the time of an appointment they have previously made.

This implies that each client continuously sends and receives data in order to be aware of who is online. Clients need to perform more work and this scenario is generally more difficult for resource-constrained devices, such as mobile phones.

Peer-to-peer communication

When a connection is established, clients can communicate with each other in the communicate phase. In a conversation each client transmits directly to each other client node. Since there is no server involved, there is also no resource bottleneck due to a server's processing power or network bandwidth.

The limitations are just each client's bandwidth and processing power. Of the two, the main limitation is the bandwidth, since in a conversation with N parties each client receives N-1 audio/video streams from N-1 parties and sends out N-1 audio/video streams to as many parties. In the picture above with 6 parties, A needs to hear the voice and see the video of the other 5 parties, and send its own audio and video to these 5 parties.

Client-server configuration

It might seem that, although with some difficulties for the signaling part, a complete peer-to-peer video conferencing is possible with no server needed. In practice also peer-to-peer communication is difficult to achieve because of the security constraints in network communications. This is because for security reasons often clients are connected to the internet using a mechanism (NAT) that hides their real address (their IP). It is therefore impossible to directly connect to them. This limitation requires the use of services provided by external servers, which for example allow clients to connect to them and exchange data.

So although peer-to-peer is theoretically possible, it is difficult to achieve it for both the signaling and communication phases. A server (likely run by a third party) is often needed. Usually a solution can therefore be peer-to-peer only to a certain extent (see for example Jami), and rely for the rest on centralised means. To give an idea, also the privacy-preserving Signal app requires servers for the signaling phase and sometimes also for the communication phase.

Security

Here we interpret security as security of the communication, ie. its privacy. More general security considerations will be given in the citizen perspective section.

Given that in most cases video conferencing data needs to go through a server, this server should know as little as possible to preserve the privacy of the users. There are two types of threats:

A third party knows who talks to whom since it can observe the signaling phase
A third party knows the content of the communication since it can observe the communication phase.

The first one can be handled with systems like Tor, but it is generally considered to be less sensitive. The second one is usually tackled by encrypting the data. There are two types of methods:

Transport-level encryption: the data is encrypted between the client and the server
End to End encryption: the data is encrypted from client to client

The first method can protect against snooping of the network traffic, but the server can still see the data. This kind of protection is the same as when visiting a website with an url starting with HTTPS. The communication is protected but the server can of course see the content.

With the second method, only the clients can see the content of the communication, and it is therefore preferable. Again, this comes with more work for clients, since clients need to manage the encryption with their own resources, with no help from the server as this would imply that the server can see the content. For example, in the past there have been cases of "fake" end-to-end encryption with Zoom: the server was choosing the encryption keys and distributing them to every client.

End-to-end encryption prevents the server from providing several services that would require access to the content, such as recording the meeting, or allowing people to phone in. Also intelligent functions such as detecting who is talking in order to optimise the bandwidth are not possible.

At the moment there are few offerings for end-to-end encryption. Among the commercial ones Whatsapp and WebEx support it, and in the open source Jami and Signal. Of the most known open source solutions, Jitsi, SylkServer and BigBlueButton, none supports (yet) end-to-end encryption. When using these services you need therefore to trust who runs the server (in case it is a third party) that they will not spy on you. In some cases you might trust the server more than the party you are talking to: Signal for example routes your communications to a party who is not in your address book via their servers, in order not to disclose your IP to the other party.

On the other hand, trusting organisations that are well-intentioned does not mean that the communication is secure, since that depends on how many resources the organisation running the service can dedicate to secure it against attacks from hackers.

Protocol and standards

Most of the solutions that are browser-based use a standard called WebRTC, which has allowed browsers to become platforms for real-time multimedia communication. The origins of WebRTC are to be traced when Google acquired a videoconferencing software company and subsequently open-sourced its technology, with the intention to propose it as a standard to bodies such as the W3C and IETF. As of today, WebRTC is a W3C Candidate Recommendation.

WebRTC has therefore played a role of an enabler for further applications to be developed for every platform where a browser could run.

On the other hand, WebRTC does not support end-to-end encryption yet, although there are plans to develop it. So no WebRTC-based video conferencing tool can be end-to-end encrypted. As already noticed above, this is the downside of using standards, which can be slow to adopt innovations.

The citizen perspective

Looking from a final citizen perspective, there are two groups of characteristics of video conferencing tools that have consequences for users.

The first group is directly noticeable for users as it contributes to the user experience:

How easy is to use the tool
The richness of features of the tool, such as recording the session, or allowing to dial in.
Accessibility, such as from resource-limited devices, or for people with disabilities.

The second group still has consequences, but these are less noticeable:

Privacy
Security

The first two characteristics are the most used ones when choosing a solution with respect to another one. Nevertheless, the third one should also be considered if the goal is to be as inclusive as possible. What is easy to use for a user with normal capabilities can be hard for a visually impaired one. For example, Jitsi was found difficult to use for blind people, as the tool uses many visual clues which are not readable by a screen reader.

Further, there can be a certain tension between the first group and the second one. Ease of use, features and accessibility for resource-limited devices tend to require a situation where the client is "thin" and the server is "fat": more tasks are delegated to the server, with the consequence that the server can observe more of the communication between the clients. For example, end-to-end encryption (more privacy preserving) might require more resources from the client than transport encryption (better for resource-constrained devices).

A server that has more control does not necessarily mean less privacy, as long as the users have control on the server. This happens when for example the server is run by an organisation and used by its employees, or when the organisation can be trusted. On the other hand, if the business model of the tool provider is (also) based on selling user data, then a server can be expected to perform in a privacy-invasive way (see for example past news on Zoom spying on its users). In any case, there is a possible privacy loss due to the introduction of a third party in the scenario.

Security is a dimension on its own, as it can not be categorised in terms of thin vs fat client or peer-to-peer vs centralised solutions. Security depends on the weakest part of the system, and each scenario has one (or more). As an example, peer-to-peer scenarios can be vulnerable if it is easy to impersonate one of the peers, and open source solutions like Jitsi can be vulnerable if the server running Jitsi is not secured and monitored.A high level of security (and privacy) requires to carefully examine each possible solution.

Several potentially conflicting characteristics emerge from the discussion presented so far. These characteristics can be used to examine a particular solution or design a new one. The importance of each dimension should be carefully considered as privileging one might imply penalizing the others.

Related alternatives

Jitsi

Jitsi is a free and open source video conferencing tool.

« all use cases