Internals¶

mmapdict files are pickle files, containing a dictionary, but with a special format. The main idea is to have a file of predictable structure, to be able to compute the offsets for the memory maps. Moreover, a way to disable a specific key is required, either to replace it or to delete it without changing the offsets of the file.

For example, for the following dictionary:

{'key': 'value', 'test': array([1, 2, 3], dtype=uint8)}

The normal pickle module would output:

\x80 PROTO      4
\x95 FRAME      172
}    EMPTY_DICT
\x94 MEMOIZE
(    MARK
\x8c     SHORT_BINUNICODE 'test'
\x94     MEMOIZE
\x8c     SHORT_BINUNICODE 'numpy.core.multiarray'
\x94     MEMOIZE
\x8c     SHORT_BINUNICODE '_reconstruct'
\x94     MEMOIZE
\x93     STACK_GLOBAL
\x94     MEMOIZE
\x8c     SHORT_BINUNICODE 'numpy'
\x94     MEMOIZE
\x8c     SHORT_BINUNICODE 'ndarray'
\x94     MEMOIZE
\x93     STACK_GLOBAL
\x94     MEMOIZE
K        BININT1    0
\x85     TUPLE1
\x94     MEMOIZE
C        SHORT_BINBYTES b'b'
\x94     MEMOIZE
\x87     TUPLE3
\x94     MEMOIZE
R        REDUCE
\x94     MEMOIZE
(        MARK
K            BININT1    1
K            BININT1    3
\x85         TUPLE1
\x94         MEMOIZE
\x8c         SHORT_BINUNICODE 'numpy'
\x94         MEMOIZE
\x8c         SHORT_BINUNICODE 'dtype'
\x94         MEMOIZE
\x93         STACK_GLOBAL
\x94         MEMOIZE
\x8c         SHORT_BINUNICODE 'u1'
\x94         MEMOIZE
K            BININT1    0
K            BININT1    1
\x87         TUPLE3
\x94         MEMOIZE
R            REDUCE
\x94         MEMOIZE
(            MARK
K                BININT1    3
\x8c             SHORT_BINUNICODE '|'
\x94             MEMOIZE
N                NONE
N                NONE
N                NONE
J                BININT     -1
J                BININT     -1
K                BININT1    0
t                TUPLE      (MARK at 132)
\x94         MEMOIZE
b            BUILD
\x89         NEWFALSE
C            SHORT_BINBYTES b'\x01\x02\x03'
\x94         MEMOIZE
t            TUPLE      (MARK at 94)
\x94     MEMOIZE
b        BUILD
\x8c     SHORT_BINUNICODE 'key'
\x94     MEMOIZE
\x8c     SHORT_BINUNICODE 'value'
\x94     MEMOIZE
u        SETITEMS   (MARK at 13)
.    STOP
highest protocol among opcodes = 4

This works fine, but doesn’t allow random access.

Let’s look at what a mmappickle.dict file looks like, for the same data:

\x80 PROTO      4
\x95 FRAME      13
J    BININT     1
0    POP
J    BININT     2
0    POP
(    MARK
\x95     FRAME      20
\x8c     SHORT_BINUNICODE 'key'
\x8c     SHORT_BINUNICODE 'value'
J        BININT     1
0        POP
\x88     NEWTRUE
0        POP
\x95     FRAME      110
\x8c     SHORT_BINUNICODE 'test'
\x8c     SHORT_BINUNICODE 'numpy.core.fromnumeric'
\x8c     SHORT_BINUNICODE 'reshape'
\x93     STACK_GLOBAL
\x8c     SHORT_BINUNICODE 'numpy.core.multiarray'
\x8c     SHORT_BINUNICODE 'fromstring'
\x93     STACK_GLOBAL
\x8e     BINBYTES8  b'\x01\x02\x03'
\x8c     SHORT_BINUNICODE 'uint8'
\x86     TUPLE2
R        REDUCE
K        BININT1    3
\x85     TUPLE1
\x86     TUPLE2
R        REDUCE
J        BININT     0
0        POP
\x88     NEWTRUE
0        POP
\x95     FRAME      2
d        DICT       (MARK at 23)
.    STOP
highest protocol among opcodes = 4

We can note the following changes:

There are hidden values at the beginning (version = 1, file revision = 2)

Each key-value couple is in an individual frame, which contains a hidden int (memo max index), finally a hidden TRUE.

The numpy array is created using numpy.core.fromnumeric.reshape(numpy.core.multiarray.from_string(data, dtype), shape) instead of the “traditionnal” way

The version field is used to allow further developments, and is fixed to 1 at present. The file revision is increased each time a key of the dictionary is changed, to allow caching when there is concurrent access. Memo max index is used because there may be MEMOIZE/GET/PUT to renumber when pickling values. This is a cache to avoid having to parse all the file.

Finally, the hidden TRUE is a “hack” to allow removing a key. In fact, it is not possible to move data when it’s memmap’ed. To avoid this, the first TRUE is replaced by a POP when deleting the key. In summary, the stack is working in the following way:

Key exists: KEY, VALUE, memo max index, POP, TRUE, POP. (reduced as KEY, VALUE)

Key doesn’t exist: KEY, VALUE, memo max index, POP, POP, POP. (disappears when reduced)

We can see that the file is composed of three differents parts, which are documented below:

The header (mmappickle.dict._header)

Storage of each key-value couple (mmappickle.dict._kvdata)

A terminator (mmappickle.dict._terminator)

Extending `mmappickle`¶

To add support for a new memory mapped value type, one should create a new subclass mmappickle.picklers.base.

This requires some knowledge of the Python internal pickle format, but should be straightforward, using the numpy picklers as inspiration. Feel free to open an issue if more details are required.

Internal API Documentation¶

class mmappickle.dict._header(mmapdict, _real_header_starts_at=0)[source]¶

The file header is at the beginning of the file.

It consists in the following pickle ops:

PROTO 4                                (pickle version 4 header)
FRAME <length>
BININT <_file_version_number:32> POP   (version of the pickle dict, 1)
BININT <_file_commit_number:32> POP    (commit id of the pickle dict, incremented every time something changes)
<additional data depending on the _file_version_number> (none, for version 1)
MARK                                   (start of the dictionnary)

__init__(mmapdict, _real_header_starts_at=0)[source]¶

Parameters:	mmapdict – mmapdict object containing the data _real_header_starts_at – Offset of the header (normally not used)

exists¶

Returns:	True if file contains something

write_initial()[source]¶: Write the initial header to the file

is_valid()[source]¶

Returns:	True if file has a valid mmapdict pickle header, False otherwise.

commit_number¶: Commit number (revision) in the file

__len__()[source]¶

Returns:	the total length of the header.

__weakref__¶: list of weak references to the object (if defined)

class mmappickle.dict._terminator(mmapdict)[source]¶

Terminator is the suffix at the end of the mmapdict file.

It consists is the following pickle ops:

FRAME 2
DICT (make the dictionnary)
STOP (end of the file)

__init__(mmapdict)[source]¶

Parameters:	mmapdict – mmapdict object containing the data

__len__()[source]¶

Returns:	the length of the terminator

exists¶

Returns:	True if the file ends with the terminator, False otherwise

write()[source]¶: Write the terminator at the end of the file, if it doesn’t exist

__weakref__¶: list of weak references to the object (if defined)

class mmappickle.dict._kvdata(mmapdict, offset)[source]¶

kvdata is the structure holding a key-value data entry.

The trick is that it should be either two values, key and value, or nothing, if the value is deleted.

To do this, we put the key and the value on the stack. Then we either push a NEWTRUE+POP (which results in a NO-OP), or we push a POP+POP (which removes both the key and the value). Since NEWTRUE and POP both have length 1, it is easy to make the substitution.

Another trick is to cache the maximum value of the memoization index (for GET and PUT), to ensure that we have no duplicates.

The _kvdata structure has the following pickle ops:

FRAME <length>
SHORT_BINUNICODE <length> <key bytes>
<<< data >>>
BININT <max memo idx> POP (max memo index of this part)
NEWTRUE|POP POP (if NEWTRUE POP: entry is valid, else entry is deactivated.)

__init__(mmapdict, offset)[source]¶

Parameters:	mmapdict – mmapdict object containing the data offset – Offset of the key-value data

__len__()[source]¶

Returns:	the length of the key-value data

offset¶

Returns:	the offset in the file of the key-value data

end_offset¶

Returns:	the end-offset in the file of the key-value data

_frame_length¶

Returns:	the frame length for this _kvdata.

This is done either by reading it in the file, or by computing it if it doesn’t exist

_exists_initial¶

Returns:	True if the file contains the header of the frame

data_offset¶

Returns:	the offset of the pickled data

key_length¶

Returns:	the binary length of the key

_valid_offset¶

Returns:	the offset of the valid byte

_memomaxidx_offset¶

Returns:	the offset of the max memo index

data_length¶

Returns:	True if the file contains the header of the frame

key¶

Returns:	the key as an unicode string

memomaxidx¶

Returns:	the (cached) max memo index

valid¶

Returns:	True if the key-value couple is valid, False otherwise (i.e. key was deleted)

_write_if_allowed()[source]¶: Write to file, if it is possible to do so

__weakref__¶: list of weak references to the object (if defined)

class mmappickle.picklers.base.BasePickler(parent_object)[source]¶

Bases: object

Picklers will be attempted in decreasing priority order

__init__(parent_object)[source]¶: Initialize self. See help(type(self)) for accurate signature.

is_valid(offset, length)[source]¶

Return True if object starting at offset in f is valid.

File position is kept.

is_picklable(obj)[source]¶: Return True if object can be pickled with this pickler

read(offset, length)[source]¶: Return the unpickled object read from offset, and the length read. The file position is kept.

write(obj, offset, memo_start_idx=0)[source]¶

Write the pickled object to the file stream, the file position is kept.

Returns a tuple (number of bytes, last memo index)

__weakref__¶: list of weak references to the object (if defined)

class mmappickle.picklers.base.GenericPickler(parent_object)[source]¶

Bases: mmappickle.picklers.base.BasePickler

priority¶

int(x=0) -> integer int(x, base=10) -> integer

Convert a number or string to an integer, or return 0 if no arguments are given. If x is a number, return x.__int__(). For floating point numbers, this truncates towards zero.

If x is not a number or if base is given, then x must be a string, bytes, or bytearray instance representing an integer literal in the given base. The literal can be preceded by ‘+’ or ‘-‘ and be surrounded by whitespace. The base defaults to 10. Valid bases are 0 and 2-36. Base 0 means to interpret the base from the string as an integer literal. >>> int(‘0b100’, base=0) 4

is_valid(offset, length)[source]¶

Return True if object starting at offset in f is valid.

File position is kept.

is_picklable(obj)[source]¶: Return True if object can be pickled with this pickler

read(offset, length)[source]¶: Return the unpickled object read from offset, and the length read. The file position is kept.

write(obj, offset, memo_start_idx=0)[source]¶

Write the pickled object to the file stream, the file position is kept.

Returns a tuple (number of bytes, last memo index)

mmappickle.utils.save_file_position(f)[source]¶: Decorator to save the object._file stream position before calling the method

mmappickle.utils.require_writable(f)[source]¶: Require the object’s _file to be writable, otherwise raise an exception.

mmappickle.utils.lock(f)[source]¶: Lock the file during the execution of this method. This is a re-entrant lock.

Internals¶

Extending mmappickle¶

Internal API Documentation¶

Extending `mmappickle`¶