I've been experimenting with thrift and protocol buffers recently. For the most part when I need to serialize something I've been using JSON or compressed JSON. Thrift and protocol buffers have a couple of advantages, and are also supposedly faster and produce smaller output.
The test I've been using is a simple list of hashes, nothing too complicated. here is the protocol buffers file. The thrift file is pretty much the same thing.
package passive_dns;
message DnsRecord {
required string key = 1;
required string value = 2;
required string first = 3;
required string last = 4;
optional string type = 5 [default = "A"];
optional int32 ttl = 6 [default = 86400];
}
message DnsResponse {
repeated DnsRecord records = 1;
}
The optional and default values are one of the benefits of both serialization libraries. A record that matches the default value does not need to be included in the serialized output.
I wrote up a simple test program to compare thrift, protocol buffers, json, and compressed json for size and speed. The results, at least for the type of data I use, are very interesting:
5000 total records (0.745s) get_thrift (0.044s) get_pb (0.608s) ser_thrift (0.474s) 554953 bytes ser_pb (3.087s) 414862 bytes ser_json (0.273s) 718191 bytes ser_yaml (13.121s) 623191 bytes ser_thrift_compressed (0.545s) 287617 bytes ser_pb_compressed (3.150s) 284297 bytes ser_json_compressed (0.326s) 292904 bytes ser_yaml_compressed (13.665s) 290993 bytes serde_thrift (1.289s) serde_pb (5.411s) serde_json (1.474s) serde_yaml (45.637s)
EDIT: Updated to include yaml results
The get_* functions are the times needed to covert the python data structure into the classes that the library needs.
The ser_* functions are the times needed to get and serialize the python data structure to a string.
The ser_*_compressed functions are the times needed to get, serialize, and compress the python data structure.
The serde_* functions are the times needed to get, serialize, and de-serialize the python data structure to and from a string.
The results show that serializing to compressed JSON is both smaller and faster than thrift, and serializing+de-serializing is only slightly slower. If I converted the python data to be (header, rows) like a csv file, rather than a flat list of dicts, the json output would be smaller, and likely faster to serialize.
The totally unexpected result was that protocol buffers clocked in at over 4 times slower than thrift. I find it hard to believe that protocol buffers could be that slow, so I will have to run some more tests to make sure that I am using the library correctly.
If you want to run my tests for yourself, the code is available from sertest.tgz