Memory-mapped BLOBs in action (C++/Python)

> Coding, hacking, computer graphics, game dev, and such...
User avatar
fips
Site Admin
Posts: 166
Joined: Wed Nov 12, 2008 9:49 pm
Location: Prague
Contact:

Memory-mapped BLOBs in action (C++/Python)

Post by fips »

I've spent some more time playing with memory-mapped BLOBs with very satisfactory results. Not only It's a great win in terms of performance, but it also helps one to appreciate pure data rather than various code-driven dynamic data structures and other complicated abstractions. This change in mindset seems to be crucial in these days of highly concurrent systems. Moreover, carefully designed data will surely outlive all that fragile code around. A nice pack of data is also easier to visualize and reason about. So to me, it seems like a pretty good long-term investment too.

As I'm just implementing an asynchronous read-only file system or file cache, which I call just Bundle, I've come up with a BLOB to represent its directory structure. Using a Python script similar to the one in my previous post I get a BLOB like this:

Image

Which corresponds to this formal definition:

Code: Select all

/---------------------\
| [blob]              |
+ - - - - - - - - - - +
| tag        (4 * u8) | ... BLOB signature ~ '4DIR'
| size          (u32) | ... size of the whole BLOB incl. [blob]
>---------------------<
| [header_block]      |
+ - - - - - - - - - - +
| size          (u32) | ... size of [header]
| offset        (u32) | ... offset of [header] from [blob]
>---------------------<
| [filelist_block]    |
+ - - - - - - - - - - +
| size          (u32) | ... size of [filelist]
| offset        (u32) | ... offset of [filelist] form [blob]
>---------------------<
| [filenames_block]   |
+ - - - - - - - - - - +
| size          (u32) | ... size of [filenames]
| offset        (u32) | ... offset of [filenames] from [blob]
>---------------------<
| [header]            |
+ - - - - - - - - - - +
| version        (u8) | ... 1
| reserved0  (7 * u8) | ... 0, 0, 0, 0, 0, 0, 0
>---------------------<
| [filelist]          |      >---------------------<
+ - - - - - - - - - - +   ,->| [file]              |       
| num_files     (u32) |  /   + - - - - - - - - - - +
| files    (n * File) |-'    | fname_offset  (u32) | ... offset from [filenames]
>---------------------<      | fname_length  (u32) | ... length of a fname string
| [filenames]         |      >---------------------<
+ - - - - - - - - - - +      >---------------------<
| <data>              |------|"foo0bar/baz0..."    | ... zero-terminated
\---------------------/      >---------------------<     fname strings
*note that all that structures and attributes are carefully aligned to 64-bits

Which translates directly to C++ like this:
VIEW THE CODE BELOW IN FULL-SCREEN (bu_directory_blob.h)

Code: Select all

/*---{fips-meta:file-header:public}-------------------------------------*\
| o     _______ __         _______  _   ______           __            o |
|      |    ___|__|.-----.|     __|| | |      |.-----.--|  |.-----.      |
|      |    ___|  ||  _  ||__     ||_| |   ---||  _  |  _  ||  -__|      |
|      |___|   |__||   __||_______|    |______||_____|_____||_____|      |
|       FipS' CODE |__| > > > > > > > > > > > > > > > > > > > > > >      |
|    (c) 2004-13 +++ Filip Stoklas, aka FipS, http://www.4FipS.com +++   |
| o         THIS CODE IS FREE - LICENSED UNDER THE MIT LICENSE         o |
\*----------------------------------------------------------------------*/
#ifndef FS_HEADER__sources__fs_2_bundle__bu_directory_blob_h__GUARD
#define FS_HEADER__sources__fs_2_bundle__bu_directory_blob_h__GUARD

#include "bu_config.h"

namespace fs { namespace bundle { namespace blob { namespace directory {

using namespace fo::blob;

struct Header
{
    Uint8 version;
    Uint8 reserved0[7];
};

struct File
{
    Uint32 fname_offset;
    Uint32 fname_length;
};

struct Filelist
{
    Uint32 num_files;
    File files[1];
};

struct Filenames
{
    char data[1];
};

struct Directory
{
    Blob blob;
    Block header_block;
    Block filelist_block;
    Block filenames_block;
    // [header]
    // [filelist]
    // [filenames]
};

}}}} // namespace fs::bundle::blob::directory

#endif // FS_HEADER__sources__fs_2_bundle__bu_directory_blob_h__GUARD
Than I use the following lightweight C++ 'reference type' to do the mappings:
VIEW THE CODE BELOW IN FULL-SCREEN (bu_directory_ref.h)

Code: Select all

/*---{fips-meta:file-header:public}-------------------------------------*\
| o     _______ __         _______  _   ______           __            o |
|      |    ___|__|.-----.|     __|| | |      |.-----.--|  |.-----.      |
|      |    ___|  ||  _  ||__     ||_| |   ---||  _  |  _  ||  -__|      |
|      |___|   |__||   __||_______|    |______||_____|_____||_____|      |
|       FipS' CODE |__| > > > > > > > > > > > > > > > > > > > > > >      |
|    (c) 2004-13 +++ Filip Stoklas, aka FipS, http://www.4FipS.com +++   |
| o         THIS CODE IS FREE - LICENSED UNDER THE MIT LICENSE         o |
\*----------------------------------------------------------------------*/
#ifndef FS_HEADER__sources__fs_2_bundle__bu_directory_ref_h__GUARD
#define FS_HEADER__sources__fs_2_bundle__bu_directory_ref_h__GUARD

#include "bu_directory_blob.h"

namespace fs { namespace bundle {

/// A lightweight directory reference type, which maps a directory BLOB.
class Directory_ref
{
 public:

    typedef fo::Table_index_pod<size_t, Directory_ref> Index;

    Directory_ref(fo::Bytes_ref dir_blob);

    size_t num_files() const;
    fo::String_ref file(Index index) const;
    bool has_file(fo::String_ref fname) const;
    Index find_file(fo::String_ref fname) const;

 private:

    // external references to 'dir_blob'
    const blob::directory::Filelist *_filelist;
    const blob::directory::Filenames *_filenames;
};

}} // namespace fs::bundle

#endif // FS_HEADER__sources__fs_2_bundle__bu_directory_ref_h__GUARD
VIEW THE CODE BELOW IN FULL-SCREEN (bu_directory_ref.cpp)

Code: Select all

/*---{fips-meta:file-header:public}-------------------------------------*\
| o     _______ __         _______  _   ______           __            o |
|      |    ___|__|.-----.|     __|| | |      |.-----.--|  |.-----.      |
|      |    ___|  ||  _  ||__     ||_| |   ---||  _  |  _  ||  -__|      |
|      |___|   |__||   __||_______|    |______||_____|_____||_____|      |
|       FipS' CODE |__| > > > > > > > > > > > > > > > > > > > > > >      |
|    (c) 2004-13 +++ Filip Stoklas, aka FipS, http://www.4FipS.com +++   |
| o         THIS CODE IS FREE - LICENSED UNDER THE MIT LICENSE         o |
\*----------------------------------------------------------------------*/
#include "bu_directory_ref.h"

namespace fs { namespace bundle {

Directory_ref::Directory_ref(fo::Bytes_ref dir_blob):
_filelist(nullptr),
_filenames(nullptr)
{
    using namespace blob::directory;

    FS_ASSERT(!dir_blob.empty());
    const Directory *dir = reinterpret_cast<const Directory *>(dir_blob.data());
    const Blob &blob = dir->blob;
    FS_ASSERT(tag(blob) == Tag('4', 'D', 'I', 'R'));
    FS_ASSERT(blob.size == dir_blob.size());
    const Block &header_block = dir->header_block;
    const Header *header = reinterpret_cast<const Header *>(dir_blob.data() + header_block.offset);
    FS_ASSERT(header->version == 1);
    const Block &filelist_block = dir->filelist_block;
    const Filelist *filelist = reinterpret_cast<const Filelist *>(dir_blob.data() + filelist_block.offset);
    const Block &filenames_block = dir->filenames_block;
    const Filenames *filenames = reinterpret_cast<const Filenames *>(dir_blob.data() + filenames_block.offset);

    _filelist = filelist;
    _filenames = filenames;
}

size_t Directory_ref::num_files() const
{
    return _filelist->num_files;
}

fo::String_ref Directory_ref::file(Index index) const
{
    FS_ASSERT(index.in_range(0, num_files()));
    const size_t fname_offset = _filelist->files[index.pos()].fname_offset;
    const char *fname = _filenames->data + fname_offset;
    return fname;
}

bool Directory_ref::has_file(fo::String_ref fname) const
{
    return find_file(fname).valid();
}

Directory_ref::Index Directory_ref::find_file(fo::String_ref fname) const
{
    using namespace blob::directory;

    // the linear search below, will be later replaced by a binary search,
    // using std::lower_bound(), but we need to implement an iterator for
    // 'blob::File' first; the sequence is always sorted...

    const char *fnames = _filenames->data;
    for(size_t i = 0, n = _filelist->num_files; i < n; ++i)
    {
        const File &cur_file = _filelist->files[i];
        const char *cur_fname = fnames + cur_file.fname_offset;
        if(strcmp(cur_fname, fname.cstr()) == 0)
            return Index::make(i);
    }

    return Index::invalid();
}

}} // namespace fs::bundle
So as you can see, with this approach a pretty complicated data structure can be fetched in one run, using just a single memory allocation. We can completely get rid of allocation-hungry STL containers, but still preserve the ability to use STL algorithms (on condition that custom iterators are provided, which is not that difficult). A carefully 64-bit aligned BLOB is perfectly suited for both 32 and 64-bit code running over it. Data can be easily arranged into Structure of Arrays (SOA) to separate hot and cold pieces in order to provide better cache locality. All this offers high-performance concurrent access, especially when dealing with read-only data.