« Alchemy: Array / Vector Serialization | Alchemy: Arrays » |
A continuation of a series of blog entries that documents the design and implementation process of a library. The library is called, Network Alchemy[^]. Alchemy performs low-level data serialization with compile-time reflection. It is written in C++ using template meta-programming.
It's time to break some barriers that have existed within Alchemy since its inception, message with fixed sizes. While the storage policy concept that I use with the message buffer allows Alchemy to dynamically allocate memory for messages, the current structure of the library only allows messages whose size is known at compile-time.
There is already so much value in what Alchemy is capable of accomplishing, even with the static size limitation. However, the only way for Alchemy to expand and reach its potential, is to remove this limitation and provide support for dynamically sized messages. This entry will demonstrate the changes that were required to achieve this goal.
Why size matters
Size is a fundamental property in the realm of data serialization. This is especially true for wire-transfer protocols. All of the meta-functions originally developed for Alchemy depend upon the size of the field type in one way or another. Therefore, opening up the possibility for field lengths to differ from message to message means that basic assumptions about the amount of available space are no longer valid.
The size of fundamental types is known at compile-time due to the sizeof
operator. The offset of a field is known because all of the sizes of the fields located before the current field in the TypeList
are added together. The total message size is calculated by adding all of the messages field sizes together.
To be completely accurate, it is not the actual size of the message that we're concerned with, it is its ability to vary in size that concerns us.
These assumptions allow us to write the code in such a way that basic checks are placed up-front, and error handling can be largely ignored through the remainder of the processing. This has the effect of simplifying the code in the interior. Less clutter could mean more readable and maintainable code. Furthermore, less code would presumably mean faster execution times.
Finally, it is entirely possible for the compiler to take advantage of the fixed length and position of each field to perform optimizations. The only way one can be certain is to investigate what the compiler actually generates, which I have not done yet; at least with regards to optimizations. If a field inside the message could have a variable size, the compiler can no longer use fixed-length assumptions to optimize code. This also holds true for fields that are considered optional in a message; their presence is not required.
While working through this endeavor, I will remain completely cognizant of what needs to be altered, and what can remain unchanged. A task such as this can easily become a shotgun surgery operation, and that only invites chaos. Chaos is not welcomed in computing unless you're performing simulations, even then only pseudo-chaos is welcomed.
std::vector
I would like to make a few brief comments regarding the std::vector
. This is the most flexible container in the standard library. If you are new to C++, or have had an aversion to angle-brackets through your C++ career, you owe it to yourself to learn this container.
The std::vector
has a very small memory foot-print. It is extremely efficient in practice due to its guarantee to be stored in contiguous memory, which improves cache coherence by spatial locality. It should already feel natural to anyone that is already familiar with basic arrays; in fact it can be used any place where an array can be used. Most of all, it can improve the reliability of your programs because it carries the added bonus of managing it's own dynamic memory allocation and can perform checked access operations.
Before I started my investigation for how to introduce a variable-sized field in Alchemy, I tried to consider how it would be used. I had only one or two simple use cases in mind because of questions colleagues had asked me about its capabilities. With very little known as what to expect, I quickly thought of the std::vector
and its versatility.
Although there were a few unknowns that still remain, I imagined that an interface modelled after the std::vector
would provide all of the flexibility that one would need for interacting with a variable-sized field. Modelling the std::vector
would also provide the additional bonus that most developers would already be familiar with the interface, which makes this decision seem all the more prudent.
Other considerations
When working with a native array or a std::vector
, the size of the structure is generally obtainable by some means associated with the data type itself. However, when a raw data buffer is received, there are very few assumptions that can be made about the contents of the data. In most cases, the data needs to be de-serialized and verified as it is read into useful data types.
How is Alchemy supposed to determine how many bytes to assign to a std::vector field before populating the fields that follow afterwards in the message?
I looked at many different message structure formats that I was familiar with. I recognized that there are generally two different methods the size of variable-length payloads are specified.
Explicit message field
Supplying a field that indicate the size of a payload or specific field in a message structure seems to be a method that appears frequently. This type of solution should be simple enough to accommodate.
To implement this, I imagined that I would supply a third field in the vector type declaration MACRO similar to the HG_ARRAY
MACRO. This third field would allow the user to specify the name of a field that has been defined earlier in the structure. Then rather using a fixed number like the array, a value could be read from a field at run-time. The value specified in the field would need to indicate the number of elements in the vector field, and not its raw size in bytes.
One possible complication that I can foresee, is what if the specified count is actually in a parent structure that contains a nested field with the vector type?
Run-time parsing
I have used many objects and libraries in my career where the developer implemented the bare minimum necessary to be able to make their library useful. However, they stopped short of providing a complete solution. To simply stop at the pre-defined field and state, "This is how the API must be used", would be succumbing to that same mentality. There are quite a few applications Alchemy could not support if it did not provide a way to determine the size of a dynamic buffer at run-time.
Well I don't want to create a half-assed API. There is no simple way to automatically handle this detail for the user; there are simply too many possibilities to attempt to provide a solution in this manner. While the original goal of Alchemy is to simplify and automate data serialization, that does not mean that every corner case must be handled automatically.
Therefore, I will not try to automatically solve all of the potential ways a user's buffer size may be specified. However, I will create a complete solution by providing a hook that the user can use to register a callback function if they want to provide this type of solution themselves. This allows Alchemy to still provide a complete solution. Furthermore, if the field size is actually specified in bytes, the user can supply a callback and convert the buffer size to an element count.
Dynamic-length declaration in Alchemy
In many ways the declaration for a vector in Alchemy is much simpler than the array, even though so much more is happening. The reason is because there is only one possible way to define a vector; by using the HG_DYNAMIC
MACRO. This MACRO has the same usage as the HG_ARRAY
, there are three fields: 1) type, 2) count and 3) name.
The only complication is the count field must either map to a previous field in the same structure that is an integer type, or it must be a callback function.
Definition MACROS
The vector field definition MACROs are used similarly to the HG_ARRAY
.
C++
HG_BEGIN_FORMAT(Msg_t) | |
... | |
HG_DATUM (size_t, len) | |
HG_DYNAMIC(char, len, path) | |
... | |
HG_END_FORMAT(Msg_t) |
I have also provided a version of the MACRO that allows the user to specify an allocator to use with the dynamic field. The sample below depicts the user-define vector size callback being used to report the size.
C++
// *************************************************** | |
// The function signature of the callback used to | |
// determine the size of a dynamic-length field. | |
// | |
// @param p The entire input message buffer. | |
// @param n The length of the input message. | |
// | |
// @return The calculated length of the dynamic field. | |
// | |
typedef | |
size_t (*pfnGetDatumSize)(const uint8_t* p, size_t n); |
This sample demonstrates extracting a collection of NULL terminated strings from a buffer, where the last string in the buffer is double-NULL terminated.
C++
// *************************************************** | |
size_t StringCount(const uint8_t* pBuffer, size_t length) | |
{ | |
size_t count = 0; | |
| |
char const* pCur = reinterpret_cast<char const*>(pBuffer); | |
char const* pLast = pCur; | |
std::advance(pLast, length); | |
| |
size_t len = 1; | |
while (pCur < pLast) | |
{ | |
len = std::char_traits<char>::length(pCur); | |
if (0 == len) | |
break; | |
| |
++count; | |
// Advance past this current item, including the NULL. | |
std::advance(pCur, len+1); | |
} | |
| |
return count; | |
} |
C++
typedef std::vector<char> char_str; | |
| |
HG_BEGIN_FORMAT(Msg_t) | |
... | |
HG_DYNAMIC(char_str, StringCount, items) | |
... | |
HG_END_FORMAT(Msg_t) |
MACRO implementation
How does one MACRO handle both definitions?
A pair of private function templates are defined within the message structure that is created by the HG_BEGIN_FORMAT
definition. These functions are defined as template functions for two reasons:
- The correct function type will be instantiated based on the type of field that is declared.
- An instance of these functions will only be instantiated if a dynamic field is declared in the message.
Here are the definitions of the functions that are added to the end of the message declaration:
C++
private: | |
template< typename T, | |
typename U | |
> | |
size_t DatumSize(T value, U*) | |
{ | |
return value; | |
} | |
| |
template< typename U > | |
size_t DatumSize(pfnGetDatumSize ftor, U* buffer) | |
{ | |
if (buffer->empty()) | |
{ | |
return 0; | |
} | |
| |
return ftor(buffer->data(), buffer->size()); | |
} |
Delicate integration
What could be so delicate?
In my experience, when facing a feature addition such as this, where an entirely new fundamental concept is added to existing software, it can quickly turn into shotgun surgery. I didn't want to jump in altering every file in order to weave the ability to manage dynamic allocation. As I mentioned earlier, I thought of a way that I could break my big problem into two smaller problems. One of which has already been solved, static-sized fields.
The question that remains, is how can I best differentiate between the different portions of my message. As much a possible, I still want to allow the compiler to resolve actions when the code is generated. Once again, a solution based on tag-dispatching seemed to be the most appropriate solution.
I could create a meta-function to determine if a field was a statically-sized field, or dynamically-sized. Furthermore, I could test each field in a message to determine if the entire message could have a variable size, or if it was always a fixed size. This solution would allow me to still take advantage of every possible optimization the compiler could muster for fixed-size messages, and still be able to accommodate dynamic messages.
Differentiate between static and dynamic sizes
If you recall my description of the tag-dispatch system I setup for the array type, I specified a static_size_trait
for the array type. I have also declared a similar type-trait for dynamic fields like the vector called, dynamic_size_trait
. I am not going to repeat all of the tag-dispatch definitions that I created for the vector, because they are very similar.
Instead, here is the collection of meta-functions that are used to determine the size type of a message. Here is the top-level meta function that reports the size trait for the message:
C++
// *************************************************** | |
/// Indicates the type size the specified | |
/// message is, static or dynamic. | |
/// | |
/// @paramt T A TypeList definition. | |
/// | |
/// @return A typedef called *type* is defined | |
/// to return the size trait. | |
/// - static_size_trait: a fixed-size message. | |
/// - dynamic_size_trait: Dynamically-sized | |
/// message. At least some portion | |
/// requires runtime processing to | |
/// determine the full size of the message. | |
/// | |
template< typename T > | |
struct message_size_trait | |
: std::conditional | |
< has_dynamic<T>::value, | |
dynamic_size_trait, | |
static_size_trait | |
> | |
{ }; |
The previous meta-function is used in many locations where it is important for the selection of an algorithm or other meta-processing functions. Inside of the std::conditional
statement, you can see a meta-function called has_dynamic
. This function is simple enough. It actually calls another meta-function that I hide in the Hg::detail
namespace
, and this function performs a recursive survey of the types in the TypeList
for dynamically-sized fields.
C++
// *************************************************** | |
template< typename T > | |
struct has_dynamic | |
: detail::HasDynamic_Impl<T, | |
type_container<T>::value | |
> | |
{ }; |
HasDynamic Implementation
We are working with a diverse set of types. Some of the types a basic fundamental types, while others can be containers of types. Let's start with a simple default template that will handle all non-container types:
C++
template< typename T, | |
bool isContainer = false | |
> | |
struct HasDynamic_Impl | |
: std::false_type | |
{ }; |
Let's handle the next simplest case, true for the vector. As the vector is currently the only type that is sized dynamically.
C++
// Explicit HasDynamic implementation for vectors | |
template< class T, | |
typename A, | |
bool isContainerT | |
> | |
struct HasDynamic_Impl< std::vector<T,A>, | |
isContainerT | |
> | |
: std::true_type | |
{ }; |
The last portion of this implementation requires a test for all other type containers. The only two type containers that remain are the array
and the TypeList
. The function below tests the type at the front of the list to determine if it is a vector
. If so, true is returned, and the function is complete. Otherwise, the first element in the list is removed, and the remainder of the list is tested for dynamically sized fields.
C++
// HasDynamic implementation for type_containers | |
template< typename T> | |
struct HasDynamic_Impl<T, true> | |
: value_if< vector_value< typename front<T>::type>::value, | |
bool, | |
true, | |
has_dynamic<typename pop_front<T>::type>::value | |
> | |
{ }; |
Calculate the size of a message
Now we have the ability to discriminate between fixed and variably sized messages. Previously, when a buffer was required for a fixed-size message, the size()
member function could be called to return the constant value that was pre-calculated from the TypeList
for which the message was instantiated. A new mechanism is required to incorporate dynamic size calculations, without disrupting the existing interface.
We can use the has_dynamic
meta-function to select the type of size calculation function to use when calculating the message size. The first step for integration is to replace the size()
member function to call a new function based on the result of has_dynamic
for the current message type.
C++
size_t Message::size() const | |
{ | |
return Msg_size<this_type, | |
k_has_dynamic>::calculate(*this); | |
} |
Message::size()
calls one of two external meta-functions that are implemented to calculate the size of a fixed-length message and a variable-length message. Here is the implementation for the fixed-length calculation. It is basically the implementation that was previously used in Message::size()
, when all messages were a fixed-length.
C++
template< class HgMessageT > | |
struct Msg_size<HgMessageT, false> | |
{ | |
typedef HgMessageT message_t; | |
| |
static size_t calculate(const message_t &msg) | |
{ | |
return Hg::SizeOf<typename HgMessageT::format_type>::value; | |
} | |
}; |
The variable-length function looks like it is quite a bit more complicated. However, it starts with the same implementation that is used to calculate a fixed-length message. A few typedef
s are created for clarity. Finally, a function designed to accumulate the sum of the length for all variable-length fields is called. This function is called dynamic_size_of
. The sum of the fixed-length and variable-length calculations is returned as the total size of this message.
C++
template< class HgMessageT, | |
bool has_dynamic | |
> | |
struct Msg_size | |
{ | |
typedef HgMessageT message_t; | |
| |
static size_t calculate(const message_t &msg) | |
{ | |
typedef typename | |
message_t::message_type message_type; | |
typedef typename | |
message_t::byte_order_type byte_order_type; | |
typedef typename | |
message_t::storage_type storage_type; | |
| |
size_t fixed_size = | |
Hg::SizeOf<typename HgMessageT::format_type>::value; | |
size_t dynamic_size = dynamic_size_of<message_type, | |
byte_order_type, | |
storage_type | |
>(msg); | |
return fixed_size + dynamic_size; | |
} | |
}; |
Hg::dynamic_size_of
dynamic_size_of
is a function template, and not a meta-function. This function must be executed at run-time because it is used to inspect incoming data buffers to determine how much space to allocate for the vector fields.
Two implementations of this function exist. Both implementations end up calling the same internal worker function with the same set of parameters. Two versions of this function are required to be able to properly handle nested messages. A top-level message is instantiated inside of the Hg::Message
class. Hg::Message
handles the memory management and provides the natural syntax interface for interacting with the child fields of a message.
When a message contains nested-messages, there is no need to wrap the nested version in its own Hg::Message
object. In fact, not only would it further complicate most of the logic to do so, it would also be less efficient because the internal memory buffer allocated by Hg::Message
would be a collection of smaller message buffers, rather than one continuous buffer shared by all message fields.
This concept is one of the trickier problems that I encounter in Alchemy. The problem did not appear until I started to really dig deep and create pathological unit-tests for the library. Furthermore, the solution was not immediately apparent. Because while I knew that a problem existed, I was not getting the calculated size results, I had a difficult time tracking down the cause of the problem.
The two definitions for Hg::dynamic_size_of
are listed below:
C++
// *************************************************** | |
template< typename T > | |
size_t dynamic_size_of(const T& msg) | |
{ | |
return detail::DynamicSizeWorker< T, | |
has_dynamic<T | |
>::value>().size(msg); | |
} | |
| |
// *************************************************** | |
template< typename MessageT, | |
typename ByteOrderT, | |
typename StorageT | |
> | |
size_t dynamic_size_of(const Message< MessageT, | |
ByteOrderT, | |
StorageT | |
>& msg) | |
{ | |
return detail::DynamicSizeWorker | |
< MessageT, | |
has_dynamic< | |
typename MessageT::format_type>::value | |
>().size(msg); | |
} |
DynamicSizeWorker
Finally we have reached the workhorse for calculating the length for all of the dynamically sized fields in a message. This function works very similarly to the byte-order conversion function that I demonstrated in a previous post. A functor that is designed to calculate the variable-length field size of a field is initialized. Then the meta-function ForEachType
is used to invoke this functor on each field in the message.
This function will only be called if a message is determined to have a dynamic-length field. This is another reason why there are two versions of Hg::dynamic_size_of
. A top-level Hg::Message
may have been determined to contain a variable-length field, and it may also contain nested fields. However, if there are no variable-length fields in any of the nested fields, it is not necessary to inspect them for the dynamic size calculation.
This layering approach allows the best function to be selected at every level of the calculation, based on it's type-traits. The extra calculations are only paid for where the feature is used. Because the compiler is setting up the dispatch of these calls, there is not even a compare and branch instruction to pay for due to a run-time if
statement.
This is the worker function that invokes the ForEachType
meta-function.
C++
template< typename MessageT, | |
bool IsDynamicT | |
> | |
struct DynamicSizeWorker | |
{ | |
static | |
size_t size(const MessageT& msg) | |
{ | |
typedef MessageT message_type; | |
typedef typename | |
message_type::format_type format_type; | |
| |
// Initialize a functor to query for the | |
// dynamic size of each field. | |
detail::DynamicSizeFunctor< message_type > ftor(msg); | |
Hg::ForEachType < 0, | |
Hg::length<format_type>::value - 1, | |
format_type | |
> (ftor); | |
| |
return ftor.size(); | |
} | |
}; |
I am not going to demonstrate the contents of the functor because it is very similar in structure and content to the ByteOrderCoverter
functor.
One important difference to note, is that it is possible for a vector to contain a collection of nested message fields. This required me to introduce another layer of indirection to first detect this scenario, then properly process each nested message, only if they contained variable-length parameters themselves.
Here is a link to the file that holds this logic if you would like to take a closer look for yourself: Alchemy/Hg/detail/message_dynamic_detail.h[^]
Results
I am very pleased with the final results that I was able to achieve with the vector. The path to a robust solution was not immediately clear to me at first. With a little bit of thought I was able to discover a suitable approach. As always with solving programming problems, the problem must be broken down into a smaller set of problems.
The solution felt very attainable once I recognized that I would only need to create a break in my current implementation at each location in the message where a dynamically sized field occurred. Compare this to my original fears that I would need to convert the entire system to a run-time calculated solution.
Summary
Integrating memory support for a dynamically-sized field turned out to be simpler than I had feared. I was able to create an elegant solution that did not require any modifications to the other data types. This demonstrated to me that this portion of my library truly is orthogonal; a good sign for a maintainable code-base into the future.
There is one more step that I had to complete before I had full support for both arrays and vectors. This was the addition of pack
and unpack
serialization calls. The problems were not necessarily difficult to solve. The challenges started when I recognized the depth to which elements could be nested inside of both messages and the new sequence containers that I had created. Both containers required the same solution to this problem. Therefore, I will tackle the description for both the array and vector serialization logic in the next Alchemy entry.
Recent Comments