Serialization of structured data is a key step to transmit information over networks or to store data because is a very cpu-intensive task. Actually, in many communication scenarios the bottleneck is the data serialization and deserialization.
Developing a Serialization mechanism involves the design of a neutral format, platform independent, to allow data interchange over heterogeneous distributed systems.
Middleware alternatives based on verbose serialization formats such as XML or JSON, used in Web Services and REST, expose very poor performance. The emergence of cloud computing and service integration in large distributed systems is driving companies and developers to consider again fast binary formats and lightweight Remote Procedure Call (RPC) frameworks. We compared Apache Thrift vs Protocol Buffers vs Fast Buffers.
Protocol Buffers is an alternative developed by Google and designed to be smaller and faster than XML. Protocol Buffers is the basis for a custom RPC engine used in nearly all inter-machine communication at Google.
Apache Thrift is an RPC framework developed at Facebook aiming “scalable cross-language services development”. Facebook uses Apache Thrift internally for service composition.
eProsima Fast Buffers is our alternative, an open source serialization engine optimized for performance and based on CDR (Common Data Representation), a standard serialization format from the OMG (Object Management Group).
These 3 alternatives generate serialization and deserialization code from the data structure definition. The developer should define his data structures using an Interface Definition Language (IDL) in a file and a tool parses this file to generate the serialization and deserialization code.
We will consider also a variation of eProsima Fast Buffers, our eProsima Dynamic Fast Buffers. In this case no IDL is required, there is an API to describe your data types at runtime, generating dynamically (de)serialization support.
The Goal: Performance comparison
Our goal is to measure the total time to serialize and de-serialize a data structure for these alternatives using both simple and complex structures of different sizes to obtain a complete performance report.
We will use two different data structures:
- Simple Structure: long integer fields.
- Complex Structure: fields covering the supported data types.
To get an accurate measurement, we execute the serialize and de-serialize operation one million times, measuring the test total time to later get the time for one complete cycle.
- Windows 7 (64 bit)
- Linux – Fedora 19 (64 bit)
- Google Protocol Buffers 2.5.0 (optimize_for = SPEED)
- Apache Thrift 0.9.0 (TBinaryProtocol)
- eProsima Fast Buffers 0.2.0
- eProsima Dynamic Fast Buffers 0.2.0
- Intel Core i3-3240 (Ivy Bridge) 3.40GHz, 4 GB DDR3, HD 500 GB SATA2 7200
The results (Linux):
The results for the other kinds of structures are going to be very similar to this case: The performance of eProsima Fast buffers is better in all the cases.
Even when dynamic serialization support is used, eProsima Fast Buffers shows better performance for most of the cases. This is remarkable because in that case the serialization library has to interpretate the structure description at run-time.
A big surprise is the poor performance of the TBinaryProtocol of Apache Thrift. We analyzed the TBinaryProtocol and it is implemented following a design with several layers, resulting in many nested calls.
Similar result: In this case eProsima Fast Buffers is better in the case of static serialization support and very similar to Google Protocol Buffers in the dynamic case.
Apache Thrift vs Protocol Buffers vs Fast Buffers: Conclusions
This format of serialization is used in DDS, a middleware used for very exigent real-time applications.
Apache Thrift TBinaryProtocol is really slow compared to the rest of technologies analyzed in this article. Thrift format adds a number to identify each field, in order to allow optional and required fields, and also type extensibility. It seems the price in terms of performance is really high.
Google Protocol Buffers is fast, but not as fast as eProsima Fast Buffers, and similar to eProsima Dynamic Fast Buffers.
This is a surprise because eProsima Dynamic Fast Buffers does not require IDL and generates the serialization logic at runtime while with Google Protocol Buffers we create the serialization code parsing the IDL and later we compile. We would expect Protocol Buffers to be faster.
We think the main reason is the Varints. Protocol Buffers encode the integers using variable size integers (varints): the bytes used to code the integer depend on the value. This lead to a smaller encoding size for structs containing some integers but impacts the performance.
eProsima Dynamic Fast Buffers has a very good performance taking into account serialization and deserialization logic is not precompiled but generated at runtime. The framework creates dynamically "(de)serialization bytecode" from the data structure definition and interprets this bytecode.
The process to interpret this bytecode is just adding a performance overhead of around 20-30% compared to the traditional code generation. The dynamic serialization can be used when you don't know the structure of your data at compilation time (user defined structures, data visualization apps...), or you just don't want to maintain external IDL files.