« Number Base Conversion | C++: Rvalue References » |
A continuation of a series of blog entries that documents the design and implementation process of a library. The library is called, Network Alchemy[^]. Alchemy performs automated data serialization with compile-time reflection. It is written in C++ using template meta-programming.
My previous entry was a condensed overview on rvalue references. I described the differences between value expressions and types. I also summarized as much wisdom as I could collect regarding how to effectively use move semantics and perfect-forwarding. After I completed the essay, I was eager to integrate move semantics for my serialization objects in Alchemy. This entry is a journal of my experience optimizing my library with rvalue references.
Motivation
My initial motivation for learning and writing about rvalue references was because I was finding it difficult to improve the performance of Alchemy with them. In fact, I tended to make the performance of my library much worse that without move semantics.
The first thing this says to me, is that the compilers are doing a pretty damn good job of optimizing my code without the move constructors. The next conclusion was shrouded in mystery.
"Am I using this correctly?""... I followed the advice in Effective Modern C++, why is it getting worse?"
"... Maybe I have done so well that all of the copied have been elided by the compiler?!"
After a few fruitless attempts to improve the performance in the evenings, I decided I better dig in and truly understand what is happening before I revisit Alchemy and move semantics. I learned a lot simply trying to write a few compelling programs to demonstrate the concepts.
Now it was time to return back to integrate move semantics into my library.
Measuring Success
I have developed a benchmark application that is included as part of the Alchemy source on GitHub. As part of my initial tuning of Alchemy I was able to achieve a modicum of success in Classic C++ (C++98). If you are unfamiliar with the details of the benchmark application, you can learn more from this previous post: Alchemy: Benchmarks[^].
Here is a summary of the machine and compiler settings displayed in this post:
- Machine:
- Intel Core i7-4790L @ 4.00GHz
- 16 GB RAM
- Windows 8.1
- Compiler - 64-bit Visual Studio 2013:
- Maximize Speed: /O2
- Enable Intrinsic (Yes): /Oi
- Favor Speed: /Ot
- Omit Frame Pointers (Yes): /Oy
- Whole Program Optimization (Yes): /GL
- Data set size: 512 MB
The numbers below, reflect what I started with when began to add the move operations to Hg.
Test | Control | Hg | Diff | Percent |
Basic: | 0.4133 s | 0.3193 s | -0.0940 s | 22.74% |
Packed: | 0.3959 s | 0.3519 s | -0.0440 s | 11.12% |
Unaligned: | 0.4391 s | 0.4425 s | 0.0034 s | -0.773% |
Complex: | 0.7485 s | 0.7654 s | 0.0169 s | -2.261% |
Array: | 0.5141 s | 0.1409 s | -0.3732 s | 72.59% |
Total: | 2.511 s | 2.0574 s | -0.3732 s | 18.06% |
The thing that was most challenging to improve up to this point was the speed of the Complex test. This test uses a nested message structure, that contains an array and all of the message structures from the other tests. I found many temporary copies, which I eliminated to reach this point. However, the relatively low performance number of this test compared to the others indicated that I most likely had other temporary copies the were lurking within the code.
What did I discover?
The very first thing that I did when I returned to Alchemy, was search every class for a destructor, copy constructor, or assignment operator. The new rules the compiler uses to automatically generate special member-functions will halt the use of move semantics dead in its tracks if you are not paying attention.
There are very few destructors in Alchemy. The majority of the data is transient, and actually managed by the caller. There is the Hg::MsgBuffer
, which uses a std::vector
to store serialized message data. This class has a destructor. All of the others manage a single field of data that is composited within the class.
On the other hand, Alchemy has plenty of copy constructors and assignment operators. Alchemy provides value semantics for all of the message sub-fields. These fields behave much like the fundamental types that they encapsulate. To provide a natural syntax, there are generally two types of assignment operators in each Datum
and Proxy
class. The first is the assignment operator for the object, and the second accepts the value_type
the object represents.
I discovered places where I missed supplying a move constructor for all of the value types that had some form of copy operation. I also found a few places where I supplied copy and move constructors for sub-types of a class, but not for the class itself.
C++
// ****************************************************** | |
/// Copy Constructor | |
Message(const message_type& rhs) | |
{ | |
*static_cast< message_type *>(this) = rhs; | |
} | |
| |
// ****************************************************** | |
/// Move Constructor | |
Message(message_type&& rhs) | |
: base_type(std::move(rhs)) | |
{ } | |
| |
// ****************************************************** | |
// Discovered that I was missing these versions: | |
// Message(const Message& rhs) | |
// Message(Message&& rhs) | |
// |
Results
These changes had very little effect on the results. Setting break points within my code showed that the move constructors were now being called. So why didn't the results change? I kept searching.
FieldTypes
I started to scrutinize the inheritance hierarchy of the Hg::Message
class and the Hg::Datum
types that it contained. I verified that I was calling the proper base class operations and moving the rvalue expressions into these function calls.
Then I reached the end of the inheritance hierarchy, which existed a very simple class, FieldTypes
. FieldTypes
can be found in ./Alchemy/Hg/Datum/basic_datum.h
.
This class provides a generic structure to provide a common way to hold and access the actual data storage created for each message datum. The TraitT
type allows for tag-dispatching for special cases such as a std::vector
or a nested message type.
C++
template< typename FieldT, | |
typename TraitT = | |
typename Hg::detail::deduce_type_trait<FieldT>::type | |
> | |
struct FieldTypes | |
{ | |
using index_type = FieldT; | |
using value_type = typename field_data_t<index_type>::value_type; | |
| |
value_type& reference() | |
{ | |
return m_data; | |
} | |
| |
const value_type& data() const | |
{ | |
return m_data; | |
} | |
| |
protected: | |
value_type m_data; | |
}; |
I took a look at the specialization that I had created for the nested message type. Here is the declaration of that class:
C++
template< typename FieldT > | |
struct FieldTypes <FieldT, nested_trait> | |
: public field_data_t<FieldT>::value_type | |
{ | |
using index_type = FieldT; | |
using value_type = typename field_data_t<index_type>::value_type; | |
| |
// ... | |
}; |
There is one interesting line that caught my attention, and helped me narrow down the cause of the issue:
C++
: public field_data_t<FieldT>::value_type |
The reason this is interesting, is because I was searching for how the storage was represented and accessed in a nested type. In this case, rather than containing a member data field, the data is provided by derivation. For a nested type, the base class is the Alchemy format definition. This class then contains a ProxyDatum
, which derives from a Datum
, which derives from a FieldType
and brings us back to the level we are currently inspecting.
It's not the nested type after all...
After looking at this, it occurred to me that the default generated move operations were clearly being used by the compiler. I have not added any code that would prevent this in the message definitions and nested fields. However, that did not prevent entire sets of fields from being moved and copied.
I had created plenty of value constructors and assignment operators to accept value types, but I had not made any attempt to optimize the movement of these fields within the most basic structures that managed these data fields. So I added copy and move constructor implementations to the FieldTypes
class to allow every possible conversion to be considered when moving the fundamental, and container data types.
This brought me my first big gain in performance.
... or maybe it is the nested type?
Unfortunately, there still seemed to be a major problem with the nested types.
I followed the execution path of the benchmark program in the debugger. I was looking for any clues where a copy was being performed rather than a move. Then I located the next big gain in performance.
In the byte-order processing code, I discovered a difference in processing for the nested field-types compared to all of the other types of data. The implementation constructs a Hg::Message
object from that raw Alchemy message format that is used to represent the nested field type. This is repackaged in order to recursively use the same processing logic that converts the other field-types performed by convert_byte_order
.
C++
struct ConvertEndianess<T, StorageT, nested_trait> | |
{ | |
template <typename NestedValueT> | |
void operator()(const NestedValueT &input, | |
NestedValueT &output) | |
{ | |
// Construct a shallow message wrapper around the nested data. | |
from_type from(input); | |
to_type to; | |
| |
// Pass this message to be byte-order swapped. | |
output = convert_byte_order<from_type, | |
to_order>(from, to).values(); | |
} | |
}; |
You can see that I pass the output
field as an input parameter to this function call. In order to improve the speed of this function I needed to figure out how to more efficiently construct this temporary message, or alter the processing logic to accept this nested field-type.
Moving local resources... don't do it
This was my first attempt to improve the performance of this temporary instance of the message looked like this:
C++
template <typename NestedValueT> | |
void operator()(const NestedValueT &input, | |
NestedValueT &output) | |
{ | |
// Pass this message to be byte-order swapped. | |
output = convert_byte_order<from_type, | |
to_order>(from_type(input), to_type()).values(); | |
} |
That helped, but not like I had hoped. Then I made this adjustment:
C++
output = std::move(convert_byte_order(/*params*/)); |
That was what I needed!
Then I ran the unit-tests, and a number of them failed. The ones that contained a vector actually crashed. The output of byte-order conversion was showing me that the data was being destroyed. I followed the logic in the debugger and discovered my problem.
I am creating the instance of to_type
locally. This is passed into the conversion function and all is well. Then the data is moved into output
and what appears is garbage. I was confused at first. Then I watched the addresses of all of the data items.
to_type
is being created on the stack, and that portion of the stack is destroyed before it has a chance to be moved to output
. I tried many different variations of the this approach. My final conclusion is that I would not be able to achieve what I was after without moving the creation of the to_type
object outside of the function call.
However, I could not do that because that would change the usage of the library. I want to keep the interaction as simple as possible for the user. Therefore, I reworked the structure of the code just a little and this is the final implementation:
C++
template <typename NestedValueT> | |
void operator()(const NestedValueT &input, | |
NestedValueT &output) | |
{ | |
// Pass this message to be byte-order swapped. | |
to_type to(output); | |
convert_byte_order<from_type, to_order>(from_type(input), | |
to); | |
output = std::move(to); | |
} |
convert_byte_order
no longer returns a value. That is because the value it would return is the to_type
object that is created, and we still have access to that instance because it is passed-by-reference. Therefore, when I am done with it, it is moved into the output
parameter, which also is passed-by-reference into the current function.
Final Results
Here are where the current performance results stand based on the adjustments that I made to optimize with move semantics:
Test | Control | Hg | Diff | Percent |
Basic: | 0.4133 s | 0.2423 s | -0.1710 s | 41.37% |
Packed: | 0.3959 s | 0.3403 s | -0.0556 s | 14.05% |
Unaligned: | 0.4391 s | 0.2295 s | -0.2096 s | 47.74% |
Complex: | 0.7485 s | 0.5573 s | -0.1912 s | 25.54% |
Array: | 0.5141 s | 0.1376 s | -0.3765 s | 73.23% |
Total: | 2.5109 s | 1.5071 s | -1.0039 s | 39.98% |
Here is a table that compares the final times for each test with the previous implementation of Alchemy, as well as the results after adding move semantics. I think you will agree that the small amount of effort required to add move semantics is well worth spending.
Test | Alchemy | w/ Move Semantics |
Diff | Overall Change |
Basic: | 0.3193 s | 0.2423 s | -0.0770 s | 24% |
Packed: | 0.3519 s | 0.3403 s | -0.0116 s | 3% |
Unaligned: | 0.4425 s | 0.2295 s | -0.2130 s | 48% |
Complex: | 0.7654 s | 0.5573 s | -0.2081 s | 27% |
Array: | 0.1409 s | 0.1376 s | -0.0033 s | 2% |
Total: | 2.0574 s | 1.5071 s | -0.5503 s | 27% |
Summary
So far I have learned that move semantics is a somewhat finicky feature that is new with Modern C++. However, the rules are not difficult to learn. After an afternoon's worth of effort applying the principles of rvalue references to Alchemy I was able to realize a 25% increase in performance. This was time well spent.
Recent Comments