|« Alchemy: Array / Vector Serialization||Alchemy: Arrays »|
A continuation of a series of blog entries that documents the design and implementation process of a library. The library is called, Network Alchemy[^]. Alchemy performs low-level data serialization with compile-time reflection. It is written in C++ using template meta-programming.
It's time to break some barriers that have existed within Alchemy since its inception, message with fixed sizes. While the storage policy concept that I use with the message buffer allows Alchemy to dynamically allocate memory for messages, the current structure of the library only allows messages whose size is known at compile-time.
There is already so much value in what Alchemy is capable of accomplishing, even with the static size limitation. However, the only way for Alchemy to expand and reach its potential, is to remove this limitation and provide support for dynamically sized messages. This entry will demonstrate the changes that were required to achieve this goal.
Why size matters
Size is a fundamental property in the realm of data serialization. This is especially true for wire-transfer protocols. All of the meta-functions originally developed for Alchemy depend upon the size of the field type in one way or another. Therefore, opening up the possibility for field lengths to differ from message to message means that basic assumptions about the amount of available space are no longer valid.
The size of fundamental types is known at compile-time due to the
sizeof operator. The offset of a field is known because all of the sizes of the fields located before the current field in the
TypeList are added together. The total message size is calculated by adding all of the messages field sizes together.
To be completely accurate, it is not the actual size of the message that we're concerned with, it is its ability to vary in size that concerns us.
These assumptions allow us to write the code in such a way that basic checks are placed up-front, and error handling can be largely ignored through the remainder of the processing. This has the effect of simplifying the code in the interior. Less clutter could mean more readable and maintainable code. Furthermore, less code would presumably mean faster execution times.
Finally, it is entirely possible for the compiler to take advantage of the fixed length and position of each field to perform optimizations. The only way one can be certain is to investigate what the compiler actually generates, which I have not done yet; at least with regards to optimizations. If a field inside the message could have a variable size, the compiler can no longer use fixed-length assumptions to optimize code. This also holds true for fields that are considered optional in a message; their presence is not required.
While working through this endeavor, I will remain completely cognizant of what needs to be altered, and what can remain unchanged. A task such as this can easily become a shotgun surgery operation, and that only invites chaos. Chaos is not welcomed in computing unless you're performing simulations, even then only pseudo-chaos is welcomed.
I would like to make a few brief comments regarding the
std::vector. This is the most flexible container in the standard library. If you are new to C++, or have had an aversion to angle-brackets through your C++ career, you owe it to yourself to learn this container.
std::vector has a very small memory foot-print. It is extremely efficient in practice due to its guarantee to be stored in contiguous memory, which improves cache coherence by spatial locality. It should already feel natural to anyone that is already familiar with basic arrays; in fact it can be used any place where an array can be used. Most of all, it can improve the reliability of your programs because it carries the added bonus of managing it's own dynamic memory allocation and can perform checked access operations.
Before I started my investigation for how to introduce a variable-sized field in Alchemy, I tried to consider how it would be used. I had only one or two simple use cases in mind because of questions colleagues had asked me about its capabilities. With very little known as what to expect, I quickly thought of the
std::vector and its versatility.
Although there were a few unknowns that still remain, I imagined that an interface modelled after the
std::vector would provide all of the flexibility that one would need for interacting with a variable-sized field. Modelling the
std::vector would also provide the additional bonus that most developers would already be familiar with the interface, which makes this decision seem all the more prudent.
When working with a native array or a
std::vector, the size of the structure is generally obtainable by some means associated with the data type itself. However, when a raw data buffer is received, there are very few assumptions that can be made about the contents of the data. In most cases, the data needs to be de-serialized and verified as it is read into useful data types.
How is Alchemy supposed to determine how many bytes to assign to a std::vector field before populating the fields that follow afterwards in the message?
I looked at many different message structure formats that I was familiar with. I recognized that there are generally two different methods the size of variable-length payloads are specified.
Explicit message field
Supplying a field that indicate the size of a payload or specific field in a message structure seems to be a method that appears frequently. This type of solution should be simple enough to accommodate.
To implement this, I imagined that I would supply a third field in the vector type declaration MACRO similar to the
HG_ARRAY MACRO. This third field would allow the user to specify the name of a field that has been defined earlier in the structure. Then rather using a fixed number like the array, a value could be read from a field at run-time. The value specified in the field would need to indicate the number of elements in the vector field, and not its raw size in bytes.
One possible complication that I can foresee, is what if the specified count is actually in a parent structure that contains a nested field with the vector type?
I have used many objects and libraries in my career where the developer implemented the bare minimum necessary to be able to make their library useful. However, they stopped short of providing a complete solution. To simply stop at the pre-defined field and state, "This is how the API must be used", would be succumbing to that same mentality. There are quite a few applications Alchemy could not support if it did not provide a way to determine the size of a dynamic buffer at run-time.
Well I don't want to create a half-assed API. There is no simple way to automatically handle this detail for the user; there are simply too many possibilities to attempt to provide a solution in this manner. While the original goal of Alchemy is to simplify and automate data serialization, that does not mean that every corner case must be handled automatically.
Therefore, I will not try to automatically solve all of the potential ways a user's buffer size may be specified. However, I will create a complete solution by providing a hook that the user can use to register a callback function if they want to provide this type of solution themselves. This allows Alchemy to still provide a complete solution. Furthermore, if the field size is actually specified in bytes, the user can supply a callback and convert the buffer size to an element count.
Dynamic-length declaration in Alchemy
In many ways the declaration for a vector in Alchemy is much simpler than the array, even though so much more is happening. The reason is because there is only one possible way to define a vector; by using the
HG_DYNAMIC MACRO. This MACRO has the same usage as the
HG_ARRAY, there are three fields: 1) type, 2) count and 3) name.
The only complication is the count field must either map to a previous field in the same structure that is an integer type, or it must be a callback function.
The vector field definition MACROs are used similarly to the
I have also provided a version of the MACRO that allows the user to specify an allocator to use with the dynamic field. The sample below depicts the user-define vector size callback being used to report the size.
This sample demonstrates extracting a collection of NULL terminated strings from a buffer, where the last string in the buffer is double-NULL terminated.
How does one MACRO handle both definitions?
A pair of private function templates are defined within the message structure that is created by the
HG_BEGIN_FORMAT definition. These functions are defined as template functions for two reasons:
- The correct function type will be instantiated based on the type of field that is declared.
- An instance of these functions will only be instantiated if a dynamic field is declared in the message.
Here are the definitions of the functions that are added to the end of the message declaration:
What could be so delicate?
In my experience, when facing a feature addition such as this, where an entirely new fundamental concept is added to existing software, it can quickly turn into shotgun surgery. I didn't want to jump in altering every file in order to weave the ability to manage dynamic allocation. As I mentioned earlier, I thought of a way that I could break my big problem into two smaller problems. One of which has already been solved, static-sized fields.
The question that remains, is how can I best differentiate between the different portions of my message. As much a possible, I still want to allow the compiler to resolve actions when the code is generated. Once again, a solution based on tag-dispatching seemed to be the most appropriate solution.
I could create a meta-function to determine if a field was a statically-sized field, or dynamically-sized. Furthermore, I could test each field in a message to determine if the entire message could have a variable size, or if it was always a fixed size. This solution would allow me to still take advantage of every possible optimization the compiler could muster for fixed-size messages, and still be able to accommodate dynamic messages.
Differentiate between static and dynamic sizes
If you recall my description of the tag-dispatch system I setup for the array type, I specified a
static_size_trait for the array type. I have also declared a similar type-trait for dynamic fields like the vector called,
dynamic_size_trait. I am not going to repeat all of the tag-dispatch definitions that I created for the vector, because they are very similar.
Instead, here is the collection of meta-functions that are used to determine the size type of a message. Here is the top-level meta function that reports the size trait for the message:
The previous meta-function is used in many locations where it is important for the selection of an algorithm or other meta-processing functions. Inside of the
std::conditional statement, you can see a meta-function called
has_dynamic. This function is simple enough. It actually calls another meta-function that I hide in the
namespace, and this function performs a recursive survey of the types in the
TypeList for dynamically-sized fields.
We are working with a diverse set of types. Some of the types a basic fundamental types, while others can be containers of types. Let's start with a simple default template that will handle all non-container types:
Let's handle the next simplest case, true for the vector. As the vector is currently the only type that is sized dynamically.
The last portion of this implementation requires a test for all other type containers. The only two type containers that remain are the
array and the
TypeList. The function below tests the type at the front of the list to determine if it is a
vector. If so, true is returned, and the function is complete. Otherwise, the first element in the list is removed, and the remainder of the list is tested for dynamically sized fields.
Calculate the size of a message
Now we have the ability to discriminate between fixed and variably sized messages. Previously, when a buffer was required for a fixed-size message, the
size() member function could be called to return the constant value that was pre-calculated from the
TypeList for which the message was instantiated. A new mechanism is required to incorporate dynamic size calculations, without disrupting the existing interface.
We can use the
has_dynamic meta-function to select the type of size calculation function to use when calculating the message size. The first step for integration is to replace the
size() member function to call a new function based on the result of
has_dynamic for the current message type.
Message::size() calls one of two external meta-functions that are implemented to calculate the size of a fixed-length message and a variable-length message. Here is the implementation for the fixed-length calculation. It is basically the implementation that was previously used in
Message::size(), when all messages were a fixed-length.
The variable-length function looks like it is quite a bit more complicated. However, it starts with the same implementation that is used to calculate a fixed-length message. A few
typedefs are created for clarity. Finally, a function designed to accumulate the sum of the length for all variable-length fields is called. This function is called
dynamic_size_of. The sum of the fixed-length and variable-length calculations is returned as the total size of this message.
dynamic_size_of is a function template, and not a meta-function. This function must be executed at run-time because it is used to inspect incoming data buffers to determine how much space to allocate for the vector fields.
Two implementations of this function exist. Both implementations end up calling the same internal worker function with the same set of parameters. Two versions of this function are required to be able to properly handle nested messages. A top-level message is instantiated inside of the
Hg::Message handles the memory management and provides the natural syntax interface for interacting with the child fields of a message.
When a message contains nested-messages, there is no need to wrap the nested version in its own
Hg::Message object. In fact, not only would it further complicate most of the logic to do so, it would also be less efficient because the internal memory buffer allocated by
Hg::Message would be a collection of smaller message buffers, rather than one continuous buffer shared by all message fields.
This concept is one of the trickier problems that I encounter in Alchemy. The problem did not appear until I started to really dig deep and create pathological unit-tests for the library. Furthermore, the solution was not immediately apparent. Because while I knew that a problem existed, I was not getting the calculated size results, I had a difficult time tracking down the cause of the problem.
The two definitions for
Hg::dynamic_size_of are listed below:
Finally we have reached the workhorse for calculating the length for all of the dynamically sized fields in a message. This function works very similarly to the byte-order conversion function that I demonstrated in a previous post. A functor that is designed to calculate the variable-length field size of a field is initialized. Then the meta-function
ForEachType is used to invoke this functor on each field in the message.
This function will only be called if a message is determined to have a dynamic-length field. This is another reason why there are two versions of
Hg::dynamic_size_of. A top-level
Hg::Message may have been determined to contain a variable-length field, and it may also contain nested fields. However, if there are no variable-length fields in any of the nested fields, it is not necessary to inspect them for the dynamic size calculation.
This layering approach allows the best function to be selected at every level of the calculation, based on it's type-traits. The extra calculations are only paid for where the feature is used. Because the compiler is setting up the dispatch of these calls, there is not even a compare and branch instruction to pay for due to a run-time
This is the worker function that invokes the
I am not going to demonstrate the contents of the functor because it is very similar in structure and content to the
One important difference to note, is that it is possible for a vector to contain a collection of nested message fields. This required me to introduce another layer of indirection to first detect this scenario, then properly process each nested message, only if they contained variable-length parameters themselves.
I am very pleased with the final results that I was able to achieve with the vector. The path to a robust solution was not immediately clear to me at first. With a little bit of thought I was able to discover a suitable approach. As always with solving programming problems, the problem must be broken down into a smaller set of problems.
The solution felt very attainable once I recognized that I would only need to create a break in my current implementation at each location in the message where a dynamically sized field occurred. Compare this to my original fears that I would need to convert the entire system to a run-time calculated solution.
Integrating memory support for a dynamically-sized field turned out to be simpler than I had feared. I was able to create an elegant solution that did not require any modifications to the other data types. This demonstrated to me that this portion of my library truly is orthogonal; a good sign for a maintainable code-base into the future.
There is one more step that I had to complete before I had full support for both arrays and vectors. This was the addition of
unpack serialization calls. The problems were not necessarily difficult to solve. The challenges started when I recognized the depth to which elements could be nested inside of both messages and the new sequence containers that I had created. Both containers required the same solution to this problem. Therefore, I will tackle the description for both the array and vector serialization logic in the next Alchemy entry.