Overview

Most file/serialization formats can be broadly broking into two formats, Human Readable Text and Machine Readble Binary. The Human Readable formats have the advantage of being easily understood by a person reading them. Machine readable formats are easier/faster for a machine to encode/decode.

There are formats which attempt to be a little of both. XML, JSon, CSV are examples of these. However these do not achieve close to the performance a binary format can achieve.

Myth: Machine Readable Binary is always more compact than a Human Readable

Binary can be more compact, however the obscurity of its format makes it difficult to ensure every byte counts. i.e. its usually hard enough getting something work. Making it compact as well is an added complication. However with Human Readable formats, determing how the format can be made more compact is more easily understood.

As text:  38 bytes long, [-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
As binary: 290 bytes long, 
....sr..java.util.ArrayListx.....a....I..sizexp....w.....sr..java.lang.Long;
.....#....J..valuexr..java.lang.Number...........xp........sq.~..........sq.~
..........sq.~..........sq.~..........sq.~..........sq.~..........sq.~
..........sq.~..........sq.~..........sq.~..........sq.~..........x

Even though the first format is more compact, you can immedately see you could drop the [ ] and spaces after the ", " to make it more compact. With the binary formats, it is hard to know where to start.

ComparingHumanReadableToBinaryMain.java

List longs = new ArrayList();
for(long i=-1;i<=10;i++)
    longs.add(i);
String asText = longs.toString();
byte[] bytes1 = asText.getBytes();
System.out.println("As text:  "+ bytes1.length+" bytes long, "+asText);

ByteArrayOutputStream baos = new ByteArrayOutputStream();
ObjectOutputStream oos = new ObjectOutputStream(baos);
oos.writeObject(longs);
oos.close();
byte[] bytes2 = baos.toByteArray();
System.out.println("As binary: "+bytes2.length+" bytes long, "
    +new String(bytes2, 0).replaceAll("[^\\p{Graph}]", "."));

Myth: Machine Readable Binary is always faster than a Human Readable

Its assumed the cost of parsing data in a human readable format always makes it slower, however machine sreadbale formats have to deal with an issue human readbale formats takes for granted, that is byte endianness. For human readable formats the order of digits is fairly obvious, however for machine formats the byte endianess of the data might not match that the natrual byte order of the CPU, leading to a source of overhead (as it has to swap the bytes around) One example of this is using big-endian (e.g. TCP/Network byte order) on a little endian machine e.g. Windows/Linux Intel/AMD. A common class which has this issue is DataInputStream and DataOutputStream which re-arranges the byte order (even if the native byte order matches) For this reason, a fast human readable parse can be as fast or faster. In an earlier article I showed how a Human Readable format could be used to read/write integers 30% faster than using DataInput/DataOuput. Writing human readable data faster than binary.

Myth: Using a Human Readable Format makes it easy to read

Just using a human readable format doesn't mean it will be easier to read than a machine readable format. Reusing existing tools as much as possible makes human readable format preferrable. However, machine readable formats can come with tools which decode the data and make maintain it easier. If you have data which can only be managed with the use of specialist tools, being human readable is not much advantage. Images are a good example of where a machine readable format is the best option. It is hard to image editing or viewing an image without the need for a specialist tool. A practical human readable format would undoubtably lower the quality of the image. ;)

________/.- ,’_______`-. \
_________\ /`__________\’/
_________ /___’a___a`___\
_________|____,’(_)`.____ |
_________\___( ._|_. )___ /
__________\___ .__,’___ /
__________.-`._______,’-.__
________,’__,’___`-’___`.__`.
_______/____/____V_____\___\_
_____,’____/_____o______\___`.__
___,’_____|______o_______|_____`.
__|_____,’|______o_______|`._____|
___`.__,’_.-\_____o______/-._`.__,’
__________/_`.___o____,’__\_
__.””-._,’_____`._:_,’_____`.,-””._
_/_,-._`_______)___(________’_,-.__\
(_(___`._____,’_____`.______,’___)_)
_\_\____\__,’________`.____/.___/_/

On the other hand human readable formats can be almost as obscure. This is a piece of code written in a language I am not worthy of mentioning. ;) Its is descibed as "used to list all of the prime numbers between 1 and R"

(!R)@&{&/x!/:2_!x}'!R

Conclusion

If you are designing a file format, start with a human readable one as its much easier to understand. If this is not compact enough, consider compressing it. If it is not fast enough concider making it a binary format, but make sure it really is faster to use such a format. If you are going to use a binary format make sure you have tools in place to supprot viewing (possibly editing) the data (which you would get for free with a text format)

Vanilla Java

Human Readable vs Machine Readable Formats

Overview

Myth: Machine Readable Binary is always more compact than a Human Readable

Myth: Machine Readable Binary is always faster than a Human Readable

Myth: Using a Human Readable Format makes it easy to read

Conclusion

Comments

Post a Comment

Popular posts from this blog

Java is Very Fast, If You Don’t Create Many Objects

System wide unique nanosecond timestamps

Comparing Approaches to Durability in Low Latency Messaging Queues