In the previous posts we have been discussing about enormous memory overhead of std::string for short strings. To recap, if we have to store 1 character string using std::string, it will occupy at least 40 bytes. Of course this is not a bug, but a feature. Additional internal data such as pointer, size are required to perform string manipulation operations efficiently and in convenient way.

The most common approach is presented in this short codelet:

Take a look, that we do not need to manipulate the strings. We just need to store them, and later maybe read or compare. Another example are keys in maps. Of course we do not want to change them. They are constant, so do we really need that additional internal data, useful  only for string manipulations? Of course not.

When we have to work on millions of short strings, then the memory consumption may be an issue. Using 40 MB to store 1MB of data is quite a significant waste of memory. Therefore, it is worth considering if there are another ways of storing strings in memory.

First of all, let’s take a look what does make the overhead when using std::string class  (GNU ISO C++, G++ 4.6.2 -O2 optimizations, 32 bit binary, Linux Fedora 16 x86)::

  1. pointer to the text data (4 bytes),
  2. length of the data (4 bytes),
  3. length of the allocated data (4 bytes),
  4. reference counter (4 bytes)
Additional data: 16 bytes. If created on the heap, additional 4 bytes will be occupied to store size of allocated memory. In total – 24 bytes. Also, memory alignment may be applied. In my case, allocated memory size is rounded to 8 bytes, but it is at least 16 bytes. Take into account that it is OS and compiler specific, though.

How about size of the actual text data? Let’s consider the smallest non-empty string, eq. “a”. It will consume 1 byte + null terminating character. Take a look that it is always allocated on the heap, therefore extra size data will have to be stored, as well as alignment applied. In total, in my environment it is 16 bytes.

Let’s sum up. The “j” string stored as std::string object, when allocated using new operator will consume 40 bytes. It is 40 times more than our one-byte “j” character. It is 3900 MB of overhead per 100 MB of actual data. Pretty much of overhead, isn’t it?

So, how to do it better?

We have to limit amount of additional data. Let’s go through the list of additional data above and consider what can we remove, and what will be the consequences.

  1. Pointer to the text data (4 bytes).
    • If removed, then size of the data will have to be a member of the class. Therefore it’s size will have to be known in compile-time. It is a very strict assumption.
  2. Length of the data (4 bytes).
    • Can be removed, but we getting length of the array will be O(n) instead of O(1) in term of complexity.
  3. Length of the allocated data (4 bytes).
    • If we don’t want to perform string manipulation, then we can remove this member.
  4. reference counter (4 bytes)
    • This is just used for a speed optimization. If we do not want to manipulate the string very often, then this member is unnecessary.

One more important thing is that if we decide to remove the pointer to data, then no additional 4 bytes of memory will be used to store size of allocated data. If array of such lightweight strings will be created, then each element will not be aligned. Therefore we will save at least 4 bytes per each object. Take into account, that we can still create that lightweight-class object on the heap and additional 4 bytes + alignment will occur, but only once in contrast to creating std::string (twice).

So, it seems that we can get rid of all the additional data, but as a result constant string size will have to be assumed. Of course, if we use null as a string terminator, then our strings can be shorter, but the memory requirement for each string object will be constant and equal to the size declared in the code.

In this article we will use boost::array to solve this problem. We could also use std::array from c++11, however it is not widely implemented, yet.

Using boost::array in std::vector, std::list, std::set, etc, but not maps.

This is actually very simple. The boost::array wrapper contains elems public field which can be accessed directly. It is declared as:  Type boost::array::elems[size];. In std::array this member is called data.

First benchmark.

dyerware.com


As you can see, pretty good optimization. The small but existing overhead is in my opinion completely acceptable.

Using boost::array in maps, as values and keys.

Situation is quite similar. If we use boost::array as values only, then there are no additional requirements, it will work out-of-the-box. However, in most cases the string is a key. In such case, we have to implement hashing support for boost::array. By the way I have no idea, why it is not yet implemented in boost/functional/hash/extensions.hpp, where hashing functions for every common container are provided. In this case, I am convinced that we are eligible to define it in boost namespace:

Hashing functions for other hash map implementations are provided in the code listing on the end of this article.

The final benchmark.

Couple of days ago I have published article about comparison of several hash map implementations, when used with std::string as keys and values. There was no winner, because of huge overhead coming from std::string. In this benchmark let’s do the comparison one more time.

dyerware.com


dyerware.com


dyerware.com


dyerware.com


As  you can see, the memory improvement by using boost::array over std::string is very significant. For lots of small strings, where memory consumption is an issue it is worth considering. We can’t use it every time, though. The maximum size of each string must be declared a priori. If too long, then the memory consumption may be higher than by using std::string.

How about the speed, is it decreased? Nope. Comparing with my previous benchmark (check my post - http://jovislab.com/blog/?p=46) it is even faster – around 2 times.

When to use boost::array / std::array:

  1. Memory consumption is an issue and
  2. Every string is shorter than ~100 bytes (the smaller the better) and
  3. Strings’ lengths distribution looks like exponential or (better) is constant.

When to use std::string:

  1. In any other case.

 

Share and Enjoy

Share →

6 Responses to Memory efficient way to store strings in hash maps using boost::array

  1. [...] http://jovislab.com/blog/?p=89#.T43CQGMSaOU.reddit [...]
  2. fileoffset says:
    Rather than: std::copy(example.begin(), example.end(), new_str.begin()); Is it possible to assign to new_str? new_str.assign(example.begin(), example.end()); Also: s/threw/through/g
    • asanoki says:
      Hi fileoffset, thanks for your comment. Unfortunately it is not possible. boost::array::assign fills the array with one value, works like memset. As for std::array, on Visual Studio it works like boost, on G++ 4.6.2 such method is not available at all. "Threw" fixed. Thanks.
      • fileoffset says:
        Ah interesting, thanks!Keep up the empirical work ;) Also you don't mention the compiler flags you used to generate your tests, it would be good to see them in the article as they can make a large difference to execution speed.
  3. [...] Memory efficient way to store strings in hash maps using boost::array | Jovislab dev blog. [...]