Using YAML over the network.


There is a number of popular text based protocols for exchanging data over the network. These include XML, FIX, and JSON.  Chronicle Engine uses YAML which has some advantages and disadvantages.

Isn't text slower than binary?

Text protocols are slower than binary protocols.  The cost of encoding numbers and even unicode strings adds an overhead for the CPU.

While text is slower it has one major advantage over binary which is human readability. This makes it much easier to describe the protocol and implement a solution for the interface without using a framework.

While binary is faster than text, you may find that text is fast enough, in which case you want a format with is as easy to work with as possible.

The following lists the latency for serializing and deserializing an object with 6 field of different types.  The TextWire is in YAML format, the BinaryWire is a binary form and RawWire and SBE are other binary formats.

For more details see Chronicle-Wire/microbenchmarks
All times are in micro-seconds

Wire Format
99.9 %tile
99.99 %tile
99.999 %tile
YAML (TextWire)
YAML (TextWire)
BinaryWire text fields
BinaryWire number fields
BinaryWire field less
RawWire UTF-8
RawWire 8-bit
BytesMarshallable + stop bit encoding

It is usually in these high percentiles that common Java libraries show much higher results, usually due to GC pauses. Even under modest throughputs this latency jitter starts to matter e.g. if you are processing 10,000 messages per second, the following jitter would delay 14-15 messages, not just one.

Format Size in bytes 99.99%tile latency
Jackson 100 8.3 μS
BSON + C-Bytes 96 15.1 μS
Snake YAML 88 4,067 μS
Boon JSON 99 32.5 μS
Externalizable 197 29.3 μS

"+ C-Bytes" means when used with Chronicle Bytes to recycle the buffer.

While Jackson had a good result for 99.99% it's 99.999% was 1,405 μS.

What do these formats look like?


This format has a 4 bytes size prefix which is decoded in the first line
--- !!data
price: 1234
flag: true
text: Hello World!
side: Sell
smallInt: 123
longInt: 1234567890

BinaryWire with text fields

This is what the data looks like when automatically translated into text.
--- !!data #binary
price: 1234
flag: true
text: Hello World!
side: Sell
smallInt: 123
longInt: 1234567890

BinaryWire with number fields

This is what the data looks like when automatically translated into text.
--- !!data #binary
3: 1234
4: true
5: Hello World!
6: Sell
1: 123
2: 1234567890

RawWire without meta data

00000000 27 00 00 00 00 00 00 00  00 48 93 40 B1 0C 48 65 '······· ·H·@··He
00000010 6C 6C 6F 20 57 6F 72 6C  64 21 04 53 65 6C 6C 7B llo Worl d!·Sell{
00000020 00 00 00 D2 02 96 49 00  00 00 00                ······I· ···

BytesMarshallable with stop bit encoding

00000000 18 00 00 00 A0 A4 69 D2  85 D8 CC 04 7B 59 00 0C ······i· ····{Y··
00000010 48 65 6C 6C 6F 20 57 6F  72 6C 64 21             Hello Wo rld!

Simple Binary Encoding

00000000 29 00 7B 00 00 00 D2 02  96 49 00 00 00 00 00 00 )·{····· ·I······
00000010 00 00 00 48 93 40 01 0C  48 65 6C 6C 6F 20 57 6F ···H·@·· Hello Wo
00000020 72 6C 64 21 00 00 00 00  01 00 00                rld!···· ···   

For the full set of results and different wire formats

Human readable

XML and JSON are derived text formats.  They are reduced forms of SGML and Javascript.  This means it can be read by humans but wasn't designed for this purpose.  As we will see, not being specifically designed for human readability has some advantages.

YAML: A format designed for human readability.

The advantage of YAML as we see it, is that it was specifically designed for human readability.  This means it is less verbose, and has a richer set of constructs.

The main disadvantage is it was designed for human readability (rather than machine readability)  It's richer set of constructs means that you can arrange the data to taste, though writing a program to arrange data in a tasteful manner is much harder.

An related disadvantage is that different implementations can be incompatible with each other as there is more options to support, some of which are left to interpretation.   For example, symbols should be placed in quotes if needed and different libraries have a different idea of whether such a symbol needs to be quoted. In the spec, there are examples where strings with quotes in them are not in side quoted. Also there is two quotes, single and double quotes.

So why use YAML?

YAML has the advantage that it was at least designed for reading by humans.  It is not the fastest format, though it can be more than fast enough, and is a very readable format.  If you compare this with XML, JSON, or FIX, these were not designed for speed, nor are they particularly readable.

Protocol Documentation

Using YAML makes it easy to document what needs to be sent over the wire.  We have unit tests for different functionality where we log what is sent and received in text.  We add some meta data around those messages and have an output which can be directly included into our documentation.  This means we have confidence it is correct.

As text we can include when the message is expected to look like and detect when even minor changes have altered the message contents easily. This is as simple as checking the string matches.  The IDE can then show you a multi-line comparison so you can see the exact field which has been altered.

What can we do about YAML being slow?

We use a high level API Chronicle Wire where you can chose the exact wire format as in independent concern.  This means we can switch to using a binary protocol once we have checked that the text protocol works.  We use a Binary translation of YAML, but we can also use a RAW data format which strips away all the meta data for maximum speed.

We also have tools to convert "Binary YAML" automatically to YAML for logging and debugging purposes.


By being able to use YAML for testing and development is a productive way to develop new solutions, with the option to switch to a Binary form for speed is a good way to get a combination of readability and speed.


Popular posts from this blog

Java is Very Fast, If You Don’t Create Many Objects

System wide unique nanosecond timestamps

Comparing Approaches to Durability in Low Latency Messaging Queues