In the previous posts we have been discussing about enormous memory overhead of std::string for short strings. To recap, if we have to store 1 character string using std::string, it will occupy at least 40 bytes. Of course this is not a bug, but a feature. Additional internal data such as pointer, size are required to perform string manipulation operations efficiently and in convenient way.
The most common approach is presented in this short codelet:
|
1 2 3 4 5 6 7 8 9 10 11 |
#include <vector>
#include <string>
int main() {
std::vector<std::string> data;
for (int index = 0; index < 1000000; index++) {
std::string example = "jovislab";
data.push_back(example);
}
return 0;
} |
Take a look, that we do not need to manipulate the strings. We just need to store them, and later maybe read or compare. Another example are keys in maps. Of course we do not want to change them. They are constant, so do we really need that additional internal data, useful only for string manipulations? Of course not.
When we have to work on millions of short strings, then the memory consumption may be an issue. Using 40 MB to store 1MB of data is quite a significant waste of memory. Therefore, it is worth considering if there are another ways of storing strings in memory.
First of all, let’s take a look what does make the overhead when using std::string class (GNU ISO C++, G++ 4.6.2 -O2 optimizations, 32 bit binary, Linux Fedora 16 x86)::
- pointer to the text data (4 bytes),
- length of the data (4 bytes),
- length of the allocated data (4 bytes),
- reference counter (4 bytes)
How about size of the actual text data? Let’s consider the smallest non-empty string, eq. “a”. It will consume 1 byte + null terminating character. Take a look that it is always allocated on the heap, therefore extra size data will have to be stored, as well as alignment applied. In total, in my environment it is 16 bytes.
Let’s sum up. The “j” string stored as std::string object, when allocated using new operator will consume 40 bytes. It is 40 times more than our one-byte “j” character. It is 3900 MB of overhead per 100 MB of actual data. Pretty much of overhead, isn’t it?
So, how to do it better?
We have to limit amount of additional data. Let’s go through the list of additional data above and consider what can we remove, and what will be the consequences.
- Pointer to the text data (4 bytes).
- If removed, then size of the data will have to be a member of the class. Therefore it’s size will have to be known in compile-time. It is a very strict assumption.
- Length of the data (4 bytes).
- Can be removed, but we getting length of the array will be O(n) instead of O(1) in term of complexity.
- Length of the allocated data (4 bytes).
- If we don’t want to perform string manipulation, then we can remove this member.
- reference counter (4 bytes)
- This is just used for a speed optimization. If we do not want to manipulate the string very often, then this member is unnecessary.
One more important thing is that if we decide to remove the pointer to data, then no additional 4 bytes of memory will be used to store size of allocated data. If array of such lightweight strings will be created, then each element will not be aligned. Therefore we will save at least 4 bytes per each object. Take into account, that we can still create that lightweight-class object on the heap and additional 4 bytes + alignment will occur, but only once in contrast to creating std::string (twice).
So, it seems that we can get rid of all the additional data, but as a result constant string size will have to be assumed. Of course, if we use null as a string terminator, then our strings can be shorter, but the memory requirement for each string object will be constant and equal to the size declared in the code.
In this article we will use boost::array to solve this problem. We could also use std::array from c++11, however it is not widely implemented, yet.
Using boost::array in std::vector, std::list, std::set, etc, but not maps.
This is actually very simple. The boost::array wrapper contains elems public field which can be accessed directly. It is declared as: Type boost::array::elems[size];. In std::array this member is called data.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
#include <vector>
#include <string>
#include <boost/array.hpp>
int main() {
std::vector<boost::array<char, 8> > data;
std::string example = "jovislab";
boost::array<char, 8> new_str;
// Call new_str.fill(0) if necessary!
std::copy(example.begin(), example.end(), new_str.begin());
for (int index = 0; index < 1000000; index++) {
data.push_back(new_str);
}
return 0;
} |
First benchmark.
As you can see, pretty good optimization. The small but existing overhead is in my opinion completely acceptable.
Using boost::array in maps, as values and keys.
Situation is quite similar. If we use boost::array as values only, then there are no additional requirements, it will work out-of-the-box. However, in most cases the string is a key. In such case, we have to implement hashing support for boost::array. By the way I have no idea, why it is not yet implemented in boost/functional/hash/extensions.hpp, where hashing functions for every common container are provided. In this case, I am convinced that we are eligible to define it in boost namespace:
|
1 2 3 4 5 6 7 8 |
namespace boost {
template<class T, std::size_t size>
inline std::size_t hash_value(boost::array<T, size> const& a) {
return hash_range(a.begin(), a.end());
}
} |
Hashing functions for other hash map implementations are provided in the code listing on the end of this article.
The final benchmark.
Couple of days ago I have published article about comparison of several hash map implementations, when used with std::string as keys and values. There was no winner, because of huge overhead coming from std::string. In this benchmark let’s do the comparison one more time.
As you can see, the memory improvement by using boost::array over std::string is very significant. For lots of small strings, where memory consumption is an issue it is worth considering. We can’t use it every time, though. The maximum size of each string must be declared a priori. If too long, then the memory consumption may be higher than by using std::string.
When to use boost::array / std::array:
- Memory consumption is an issue and
- Every string is shorter than ~100 bytes (the smaller the better) and
- Strings’ lengths distribution looks like exponential or (better) is constant.
When to use std::string:
- In any other case.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
/*
* maps.cpp
*
* Created on: Apr 16, 2012
* Author: asanoki
*/
#include <iostream>
#include <cstddef>
#include <boost/progress.hpp>
#include <boost/array.hpp>
#include <array>
#include <unordered_map>
#include <ext/hash_map>
#include <boost/unordered_map.hpp>
#include <google/sparse_hash_map>
#include <google/dense_hash_map>
#include <map>
#include <boost/functional/hash/extensions.hpp>
#define MAP_CLASS std::map
#define KEY_SIZE 3
#define VALUE_SIZE 3
#define KEY_TYPE boost::array<char, KEY_SIZE>
#define VALUE_TYPE boost::array<char, VALUE_SIZE>
#define HASH_TYPE
#define SET_SIZE 1000000
// Only for __gnu_cxx::hash_map
namespace __gnu_cxx {
template<> struct hash<KEY_TYPE> {
size_t operator()(const KEY_TYPE& a) const {
return boost::hash_range(a.begin(), a.end());
}
};
}
// Only for std::unordered_map
namespace std {
template<> struct hash<KEY_TYPE> {
size_t operator()(const KEY_TYPE& a) const {
return boost::hash_range(a.begin(), a.end());
}
};
}
// Only for boost::unordered_map
namespace boost {
template<class T, std::size_t size>
inline std::size_t hash_value(boost::array<T, size> const& a) {
return hash_range(a.begin(), a.end());
}
}
// Only for google::sparse_hash_map and google::dense_hash_map
namespace std {
namespace tr1 {
template<> struct hash<KEY_TYPE> {
size_t operator()(const KEY_TYPE& a) const {
return boost::hash_range(a.begin(), a.end());
}
};
}
}
int main() {
// Check memory
std::cout << "Check memory consumption and hit Enter." << std::endl;
{
std::string dummy;
std::getline(std::cin, dummy);
}
MAP_CLASS <KEY_TYPE, VALUE_TYPE > map_memory;
MAP_CLASS <KEY_TYPE, VALUE_TYPE > map;
std::vector<KEY_TYPE > keys;
std::vector<VALUE_TYPE > values;
KEY_TYPE key_buffer;
VALUE_TYPE value_buffer;
srand(time(NULL));
// Only for google::sparse_hash_map
// key_buffer[0] = 0;
// map.set_deleted_key(key_buffer);
// Only for google::dense_hash_map
// key_buffer[0] = 0;
// key_buffer[1] = 1;
// map_memory.set_empty_key(key_buffer);
// map.set_empty_key(key_buffer);
// key_buffer[0] = 0;
// key_buffer[1] = 2;
// map.set_deleted_key(key_buffer);
// Insert random. Slow, but we are checking memory.
std::cout << "Memory" << std::endl;
{
boost::progress_timer t;
for (int index = 0; index < SET_SIZE; index++) {
for (int k = 0; k < KEY_SIZE; k++) {
key_buffer[k] = 1 + (rand() % 0xff - 1);
}
for (int k = 0; k < VALUE_SIZE; k++) {
value_buffer[k] = 1 + (rand() % 0xff - 1);
}
map_memory[key_buffer] = value_buffer;
}
}
// Check memory
std::cout << "Check memory consumption and hit Enter." << std::endl;
{
std::string dummy;
std::getline(std::cin, dummy);
}
// Prepare keys, because we will measure speed.
std::cout << "Preparing" << std::endl;
{
boost::progress_timer t;
for (int index = 0; index < SET_SIZE; index++) {
for (int k = 0; k < KEY_SIZE; k++) {
key_buffer[k] = 1 + (rand() % 0xff - 1);
}
keys.push_back(key_buffer);
for (int k = 0; k < VALUE_SIZE; k++) {
value_buffer[k] = 1 + (rand() % 0xff - 1);
}
values.push_back(value_buffer);
}
}
// Find
std::cout << "Insert" << std::endl;
{
boost::progress_timer t;
for (int index = 0; index < SET_SIZE; index++) {
map[keys[index]] = values[index];
}
}
// Find
std::cout << "Find" << std::endl;
{
boost::progress_timer t;
for (int index = 0; index < SET_SIZE; index++) {
VALUE_TYPE &result = map[keys[index]];
}
}
// Delete
std::cout << "Delete" << std::endl;
{
boost::progress_timer t;
for (int index = 0; index < SET_SIZE; index++) {
map.erase(keys[index]);
}
}
return 0;
} |


6 Responses to Memory efficient way to store strings in hash maps using boost::array