Clock synchronisation near the Quantum: How-to experience the world in 4 nanosecond increments

There is a lot of noise in the clock synchronisation space. It's a seemingly simple subject (adjusting a clock to be accurate) but in practice it's remarkably complicated to do well.

Lots of companies make lots of claims. Few document to any degree that their products deliver what is advertised, sales people tend to exaggerate, indeed sometimes as a subject matter expert, I am surprised by the sorts of nonsensical claims vendors make. Generally, the standard of solutions, in terms of hardware and software is low. The market segment lacks innovation. One thing that in particular struck me is how the availability of cloud computing power was not previously taken advantage off. It was for these reasons that my colleague Ian Gough and I started Timebeat.app. We thought that both the free and the commercial clock synchronisation solutions did not really deliver what customers in finance, telecom, power, gaming, security etc. were demanding or what modern technology could be leveraged to create.

Today I am pleased to report that the Timebeat.app platform has developed market leading performance in two areas in particular :

Our software has the best management and alerts platform bar none. We use elasticsearch and Grafana hosted in gcloud to store and display data and our software uses X.509 PKI infrastructure to securely offload log data over the Internet. We use gke which provides limitless scale and the containerised nature of the management/alerting back-end makes it as easy to spin up in an on-prem cloud as it is when hosted in a public cloud.
Our versatile (PTP/NTP/PPS/GPS) clock synchronisation software has market leading accuracy and we are able to distribute this accuracy throughout a network while staying inside the ITU-T G.8273.2 Class C (+/-30ns) when acting as a boundary clock and Class B (+/-70ns) when acting as an ordinary clock.

For me though, what I am particularly proud of is the performance of our solution in less than ideal conditions (because most real-life IT estates are less than ideal), so in this article I will compare Timebeat.app's performance across state-of-the-art hardware and across cheap commodity hardware.

The Lab

To compare performance in ideal and less-than-ideal conditions I created the lab network below and installed Timebeat Enterprise.

The Grandmaster Role

GPS-PPS

To get a source of UTC into our lab I use a cheap GPS disciplined oscillator I bought on eBay years ago. To get "minor" time (the fractional part of a second) into the system where Timebeat.app runs in a "grandmaster clock" role - on testhost03 below - I connect a 1PPS signal to a Mellanox ConnectX-6 network card (which has 2 SMA ports for 1PPS in and out). Separately, to get "major" time (the non-fractional part of a particular point in time) we use a serial cable from the GPS receiver and Timebeat.app reads the "NMEA 0183" output. Perhaps it sounds complicated, but it's really not. The snippet below from the configuration is all that's required.

About the Mellanox NIC, I will just mention that it is absolutely state-of-the-art. I like the hardware, I like the performance, I like the people at Nvidia that developed it. They are really clued up individuals. It works very well with Timebeat.app and at 200Gbps, odds are it'll cover most requirements.

When we synchronise the clock on the Mellanox card to the 1PPS connection from the GPSDO we achieve the accuracy below:

Our Grafana based front-end allow us to evaluate performance in real-time. What we observe from the scatter plot above (and indeed the histogram) is two things. Firstly that the PPS performance and synchronisation is so good that every offset measured (in any time frame) is within +/-20ns of the Mellanox clock. This is a good indicator that performance is good and stable. The clock servo algorithm is doing its job well. Secondly, we can see that the Mellanox quantum is 4ns. That is, we might measure an offset of 0, 4 or 8 ns etc., but never in-between these values. The PHC clock advances in 4 ns increments. (I happen to know from conversations with the vendor that this is indeed the case). We can see that the servo algorithm is working hard to keep up with changes in temperature and other variables that impact the tick-rate of the clock on the Mellanox card (and every other clock for that matter) because our front-end also shows the sync algorithm output broken down into its constituent parts:

As is evident on the lab diagram above each host has more than one clock. In fact there is typically one clock per network interface in addition to a system clock (the kernel clock) and the rtc clock (the "bios" clock). Timebeat.app synchronises all of these but we have to limit the complexity of the diagram somewhat. For the purposes of this article what is important to consider is that time from the GPSDO is used to synchronise the clock on the Mellanox card. The time on the Mellanox card is in turn used to synchronise the system clock and the system clock is used to synchronise a clock on another Intel based NIC on the motherboard of testhost03. Every time we have this chain of clocks being used to synchronised other clocks we introduce the possibility of error. If we don't do it well, then this error accumulates along the chain and clocks will no longer be accurate.

PTP

In our lab setup we connect the Mellanox card on testhost03 back-to-back with another identical Mellanox card in testhost04. The two cards are connected with a QSFP56 direct attach cable. There is no switch in between. It's pretty difficult to imagine better condition to connect two hosts with ethernet, so when we evaluate the result we'll keep in mind that the clock synchronisation across this 200Gbps link is absolutely ideal. Conversely, testhost03 also serves time from another (not at all state-of-the-art) network interface on the motherboard. Here the hardware is cheap, the speed is 1Gbps and rather than being connected back-to-back with another host there is a cheap desktop switch with very unpredictable latency and cut-through performance.

The relevant grandmaster clock configuration is shown below. As is evident PTP is being served on two different interfaces in two different PTP domains. Both unicast and multicast are available. For the purposes of this test multicast was used.

The Boundary Clock

Meanwhile on testhost04 we receive time over the high quality 200Gbps link from testhost03 on interface ens5f0. The clock on ens5f0 is used to synchronise the system clock which in turn is used to synchronise the clock on enp1s0. In the scatterplots below we can see how accurately the synchronisation occurs through this chain of clocks. As testhost04 is acting as a boundary clock, this is extremely important as an error accumulated will be passed on.

As is evident there is no more than 10-20 ns error introduced and the servo ensures very good synchronisation across the 3 internal clocks relevant to our experiment on testhost04. It's handy that the chain is available, and that insight is given into what is the source of time and how has synchronisation of time been disseminated throughout the system.

We can see the quanta on the clock on ens5f0. This is a good sign. Given how the PTP calculation is performed even though the clock is still advanced in 4 ns increments the result of the PTP calculation allows for values in 2 ns increments. The histogram of offsets confirms the excellent performance.

Traffic load on high capacity links

I will add that we have run extensive load tests to see if load affects the stability of our servo algorithm as network buffer queues occurs. We find that it makes no difference, but there are several problems associated with generating enough traffic to saturate high capacity links. Below a typical iperf output.

As is the case with testhost03, enp1s0 in testhost04 is a low-cost lan-on-motherboard ethernet controller running at 1Gbps and this interface is connected to the low-cost (unreliable, non-PTP aware) desktop switch. As testhost04 acts as a boundary clock we have configured it to on one side get time from testhost03 and act as a source of time for testhost02 to use. We give the config below.

The Ordinary Clock

We now arrive at the point where we have our two separate paths of time arriving in the same place on testhost02.

Clock(b) -> Clock(c) -> Clock(d) -> Clock(h) (direct route from grandmaster clock to ordinary clock)

and

Clock(b) -> Clock(e) -> Clock(f) -> Clock(g) -> Clock(h) (indirect route from grandmaster clock to ordinary clock via boundary clock)

In both cases the last hop from Clock(g) -> (PTP domain 15) -> Clock(h) and from Clock(d) -> (PTP domain 17) -> Clock(h) travels across the low-end ethernet controllers and via the cheap desktop switch. So what do the comparison of the two time sources look like on testhost02? If the Timebeat.app servo and associated components did a reasonable job the two should be roughly the same. Below we can see the scatterplot.

As is evident we can see that even time that has travelled through a boundary clock and across cheap NICs and low-end desktop switches manage to be within a few nanoseconds of each other. Both the mean and variance of the offset distribution is very good. In the front-end we draw a line graph of the range of the exponential moving average of the two offset distribution so we can compare at a glance (we call this the "traceability" of the time sources).

It is evident from the traceability graph that the difference between the two EMAs stay within 20ns. That means that the two time source vary from approximately -10 to +10ns of each other. This is a very strong indication that very little error has been introduced into the two time sources as a result of taking different paths around our lab network setup. In fact we confirm this also regularly by doing electrical measurements of 1PPS output on testhost02 as compared with a GPS derived 1PPS output. If they match (and they do), then we are good.

So what is the reason for this?

Why does it work down to a few nanosecond accuracies without PTP aware switches and using cheap network cards?

The reason is that Timebeat.app filters outliers in a very effective fashion. We can see this if I show you the plot above but display outliers as well.

Same graph, same time frame. What is evident is how we've managed to filter out the outliers that would have significantly impacted the servo. We simply ignore those offset calculations. Of course Timebeat.app will play nicely along with transparent clocks that make the cut-through time of switches available to PTP, but it's quite nice if the client PTP software is also able to act intelligently. I think Timebeat.app does this particularly well.

Traffic load on low capacity links

I'll give another example from the test environment discussed in this article to illustrate the point. Below I use iperf to saturate the 1Gbps link (which is far easier to saturate than a 200Gbps link) - in fact I do it twice.

As can be observed from the scatterplot below, the filter manages to catch every single value during both iperf runs so that we maintain accurate synchronisation by not tilting the clock servos too hard as a result of network buffer queues etc. Strategies like this are sometimes referred to as "perfect packet" and if implemented correctly they significantly improve accuracy.

In the real world these periods of congestion occur for a few milliseconds here and there in a non-deterministic fashion as clients make requests of servers on the network side and they occur internally in servers when data is transfer across the PCI bus, if the CPU is busy etc. But if you can identify these outliers correctly, then you can ignore them and this improves accuracy greatly.

Conclusion

I think we have shown that we are both good at implementing synchronisation servos and at managing the error introduced by network latencies. Some people (switch vendors primarily) will have you believe that the accuracy of PTP has straight forward positive linear correlation with how much money you spend on buying new PTP aware switches. The reality as we have show is that for sure state-of-the-art hardware has a positive impact on performance, but the quality of the clock synchronisation software is more important than anything else.

I also quite like the front-end we stick on Timebeat.app and for sure it's useful for providing insight into the quality of timekeeping as we have seen. Not only is it useful to our customers, making alerts, reports, deployment easy, but the insight into what is actually going on hop-by-hop in the time chain makes improving the system easier for me its creator. I suspect this is why this system is so much better than other systems available. Simply the insight is complete and all data readily available.

I'm personally excited by the increasing number of NICs that are being launched with 1PPS in/out capabilities and the efforts that are under way to build open hardware where GPS receiver and high quality oscillator are commoditised. For us a software vendor making what we could call a Swiss army knife of clock synchronisation for multiple operating systems with a cloud back-end, I would have to say that the future looks quite interesting.

In this lab we used Linux and hardware timestamping throughout. In the future I might write about how we improve clock synchronisation accuracy on the Windows platform and on virtualised platforms. Both are interesting and have their place in the wider ecosystem too.