Following up to the previous posts, A few comments out on the internet mentioned that my first tests werent very fair to thrift and protocol buffers because they were mostly serializing strings. I gutted the test code and re-wrote the IDL files to use this structure:
message DnsRecord {
required fixed32 sip = 1;
required fixed32 dip = 2;
required uint32 sport = 3;
required uint32 dport = 4;
}
Nothing fancy, basically the standard ipv4 4-tuple.
I also replaced the random record generation with this:
def get_random_records(num=10000): data = [] for x in xrange(num): data.append({ 'sip': 192*255**3+168*255**2+255+random.randrange(0,255), 'dip': random.randrange(1,255**4), 'sport': random.randrange(1024,2048), 'dport': random.choice([21,22,25,80,110,443]) }) return data
This will generate 10000 records with:
- a random source IP on the 192.168.1.0/24 network
- a completely random destination IP
- a source port between 1024 and 2048
- a destination port chosen from six common ports.
The raw size of this data using fixed length ints would be 10000*(4+4+4+4) = 160,000 bytes. The variable length encoding that protocol buffers does should be able to save some space when storing the smaller port numbers.
Running the test code produces the following output:
10000 total records (0.280s) get_thrift (0.060s) get_pb (0.950s) ser_thrift (0.560s) 370009 bytes ser_pb (4.850s) 171650 bytes ser_json (0.080s) 680680 bytes ser_cjson (0.120s) 680680 bytes ser_yaml (17.330s) 610680 bytes ser_thrift_compressed (0.620s) 111326 bytes ser_pb_compressed (3.980s) 98571 bytes ser_json_compressed (0.110s) 124919 bytes ser_cjson_compressed (0.120s) 124919 bytes ser_yaml_compressed (17.160s) 121065 bytes serde_thrift (2.130s) serde_pb (7.550s) serde_json (0.130s) serde_cjson (0.110s) serde_yaml (56.740s)
These results show that protocol buffers and thrift do indeed excel at serializing numeric values. The pre-compressed output from protocol buffers is considerably smaller than the other serialization methods, with thrift ending up somewhere in the middle. In fact, the protocol buffers output is barely larger than the original data would be in compact binary form. Since JSON and YAML serialize numbers to strings, their output ends up being 4 times bigger.
However, once you add in compression, all this fancy extra work to save space only slightly improves on JSON. The speed and simplicity of the JSON+zlib approach can not be ignored...
The protocol buffers speed issues are still there, but I'm sure that over time things will improve. If the C extension for simplejson can speed up serialization by an order of magnitude, I have no doubt that similar improvements can be made to protocol buffers and thrift.
If you want to run these tests for yourself, the code is available from sertest2.tgz
Some other things to try would be to set the default dport to 80, and see how that effects serialization size and speed.