Internals¶
mmapdict
files are pickle
files, containing a dictionary, but with a special format. The main idea is to have a file of predictable structure, to be able to compute the offsets for the memory maps. Moreover, a way to disable a specific key is required, either to replace it or to delete it without changing the offsets of the file.
For example, for the following dictionary:
{'key': 'value', 'test': array([1, 2, 3], dtype=uint8)}
The normal pickle module would output:
0: \x80 PROTO 4
2: \x95 FRAME 172
11: } EMPTY_DICT
12: \x94 MEMOIZE
13: ( MARK
14: \x8c SHORT_BINUNICODE 'test'
20: \x94 MEMOIZE
21: \x8c SHORT_BINUNICODE 'numpy.core.multiarray'
44: \x94 MEMOIZE
45: \x8c SHORT_BINUNICODE '_reconstruct'
59: \x94 MEMOIZE
60: \x93 STACK_GLOBAL
61: \x94 MEMOIZE
62: \x8c SHORT_BINUNICODE 'numpy'
69: \x94 MEMOIZE
70: \x8c SHORT_BINUNICODE 'ndarray'
79: \x94 MEMOIZE
80: \x93 STACK_GLOBAL
81: \x94 MEMOIZE
82: K BININT1 0
84: \x85 TUPLE1
85: \x94 MEMOIZE
86: C SHORT_BINBYTES b'b'
89: \x94 MEMOIZE
90: \x87 TUPLE3
91: \x94 MEMOIZE
92: R REDUCE
93: \x94 MEMOIZE
94: ( MARK
95: K BININT1 1
97: K BININT1 3
99: \x85 TUPLE1
100: \x94 MEMOIZE
101: \x8c SHORT_BINUNICODE 'numpy'
108: \x94 MEMOIZE
109: \x8c SHORT_BINUNICODE 'dtype'
116: \x94 MEMOIZE
117: \x93 STACK_GLOBAL
118: \x94 MEMOIZE
119: \x8c SHORT_BINUNICODE 'u1'
123: \x94 MEMOIZE
124: K BININT1 0
126: K BININT1 1
128: \x87 TUPLE3
129: \x94 MEMOIZE
130: R REDUCE
131: \x94 MEMOIZE
132: ( MARK
133: K BININT1 3
135: \x8c SHORT_BINUNICODE '|'
138: \x94 MEMOIZE
139: N NONE
140: N NONE
141: N NONE
142: J BININT -1
147: J BININT -1
152: K BININT1 0
154: t TUPLE (MARK at 132)
155: \x94 MEMOIZE
156: b BUILD
157: \x89 NEWFALSE
158: C SHORT_BINBYTES b'\x01\x02\x03'
163: \x94 MEMOIZE
164: t TUPLE (MARK at 94)
165: \x94 MEMOIZE
166: b BUILD
167: \x8c SHORT_BINUNICODE 'key'
172: \x94 MEMOIZE
173: \x8c SHORT_BINUNICODE 'value'
180: \x94 MEMOIZE
181: u SETITEMS (MARK at 13)
182: . STOP
highest protocol among opcodes = 4
This works fine, but doesn’t allow random access.
Let’s look at what a mmappickle.dict
file looks like, for the same data:
0: \x80 PROTO 4
2: \x95 FRAME 13
11: J BININT 1
16: 0 POP
17: J BININT 2
22: 0 POP
23: ( MARK
24: \x95 FRAME 20
33: \x8c SHORT_BINUNICODE 'key'
38: \x8c SHORT_BINUNICODE 'value'
45: J BININT 1
50: 0 POP
51: \x88 NEWTRUE
52: 0 POP
53: \x95 FRAME 110
62: \x8c SHORT_BINUNICODE 'test'
68: \x8c SHORT_BINUNICODE 'numpy.core.fromnumeric'
92: \x8c SHORT_BINUNICODE 'reshape'
101: \x93 STACK_GLOBAL
102: \x8c SHORT_BINUNICODE 'numpy.core.multiarray'
125: \x8c SHORT_BINUNICODE 'fromstring'
137: \x93 STACK_GLOBAL
138: \x8e BINBYTES8 b'\x01\x02\x03'
150: \x8c SHORT_BINUNICODE 'uint8'
157: \x86 TUPLE2
158: R REDUCE
159: K BININT1 3
161: \x85 TUPLE1
162: \x86 TUPLE2
163: R REDUCE
164: J BININT 0
169: 0 POP
170: \x88 NEWTRUE
171: 0 POP
172: \x95 FRAME 2
181: d DICT (MARK at 23)
182: . STOP
highest protocol among opcodes = 4
We can note the following changes:
- There are hidden values at the beginning (
version = 1
,file revision = 2
)- Each key-value couple is in an individual frame, which contains a hidden int (memo max index), finally a hidden TRUE.
- The numpy array is created using
numpy.core.fromnumeric.reshape(numpy.core.multiarray.from_string(data, dtype), shape)
instead of the “traditionnal” way
The version
field is used to allow further developments, and is fixed to 1 at present.
The file revision is increased each time a key of the dictionary is changed, to allow caching when there is concurrent access.
Memo max index is used because there may be MEMOIZE/GET/PUT to renumber when pickling values. This is a cache to avoid having to parse all the file.
Finally, the hidden TRUE is a “hack” to allow removing a key. In fact, it is not possible to move data when it’s memmap’ed. To avoid this, the first TRUE is replaced by a POP when deleting the key. In summary, the stack is working in the following way:
- Key exists:
KEY, VALUE, memo max index, POP, TRUE, POP.
(reduced asKEY, VALUE
)- Key doesn’t exist:
KEY, VALUE, memo max index, POP, POP, POP.
(disappears when reduced)
We can see that the file is composed of three differents parts, which are documented below:
- The header (
mmappickle.dict._header
)- Storage of each key-value couple (
mmappickle.dict._kvdata
)- A terminator (
mmappickle.dict._terminator
)
Extending mmappickle
¶
To add support for a new memory mapped value type, one should create a new subclass mmappickle.picklers.base
.
This requires some knowledge of the Python internal pickle format, but should be straightforward, using the numpy picklers as inspiration. Feel free to open an issue if more details are required.
Internal API Documentation¶
-
class
mmappickle.dict.
_header
(mmapdict, _real_header_starts_at=0)[source]¶ The file header is at the beginning of the file.
It consists in the following pickle ops:
PROTO 4 (pickle version 4 header) FRAME <length> BININT <_file_version_number:32> POP (version of the pickle dict, 1) BININT <_file_commit_number:32> POP (commit id of the pickle dict, incremented every time something changes) <additional data depending on the _file_version_number> (none, for version 1) MARK (start of the dictionnary)
-
__init__
(mmapdict, _real_header_starts_at=0)[source]¶ Parameters: - mmapdict – mmapdict object containing the data
- _real_header_starts_at – Offset of the header (normally not used)
-
exists
¶ Returns: True if file contains something
-
commit_number
¶ Commit number (revision) in the file
-
__weakref__
¶ list of weak references to the object (if defined)
-
-
class
mmappickle.dict.
_terminator
(mmapdict)[source]¶ Terminator is the suffix at the end of the mmapdict file.
It consists is the following pickle ops:
FRAME 2 DICT (make the dictionnary) STOP (end of the file)
-
exists
¶ Returns: True if the file ends with the terminator, False otherwise
-
__weakref__
¶ list of weak references to the object (if defined)
-
-
class
mmappickle.dict.
_kvdata
(mmapdict, offset)[source]¶ kvdata is the structure holding a key-value data entry.
The trick is that it should be either two values, key and value, or nothing, if the value is deleted.
To do this, we put the key and the value on the stack. Then we either push a NEWTRUE+POP (which results in a NO-OP), or we push a POP+POP (which removes both the key and the value). Since NEWTRUE and POP both have length 1, it is easy to make the substitution.
Another trick is to cache the maximum value of the memoization index (for GET and PUT), to ensure that we have no duplicates.
The _kvdata structure has the following pickle ops:
FRAME <length> SHORT_BINUNICODE <length> <key bytes> <<< data >>> BININT <max memo idx> POP (max memo index of this part) NEWTRUE|POP POP (if NEWTRUE POP: entry is valid, else entry is deactivated.)
-
__init__
(mmapdict, offset)[source]¶ Parameters: - mmapdict – mmapdict object containing the data
- offset – Offset of the key-value data
-
offset
¶ Returns: the offset in the file of the key-value data
-
end_offset
¶ Returns: the end-offset in the file of the key-value data
-
_frame_length
¶ Returns: the frame length for this _kvdata. This is done either by reading it in the file, or by computing it if it doesn’t exist
-
_exists_initial
¶ Returns: True if the file contains the header of the frame
-
data_offset
¶ Returns: the offset of the pickled data
-
key_length
¶ Returns: the binary length of the key
-
_valid_offset
¶ Returns: the offset of the valid byte
-
_memomaxidx_offset
¶ Returns: the offset of the max memo index
-
data_length
¶ Returns: True if the file contains the header of the frame
-
key
¶ Returns: the key as an unicode string
-
memomaxidx
¶ Returns: the (cached) max memo index
-
valid
¶ Returns: True if the key-value couple is valid, False otherwise (i.e. key was deleted)
-
__weakref__
¶ list of weak references to the object (if defined)
-
-
class
mmappickle.picklers.base.
BasePickler
(parent_object)[source]¶ Bases:
object
Picklers will be attempted in decreasing priority order
-
is_valid
(offset, length)[source]¶ Return True if object starting at offset in f is valid.
File position is kept.
-
read
(offset, length)[source]¶ Return the unpickled object read from offset, and the length read. The file position is kept.
-
write
(obj, offset, memo_start_idx=0)[source]¶ Write the pickled object to the file stream, the file position is kept.
Returns a tuple (number of bytes, last memo index)
-
__weakref__
¶ list of weak references to the object (if defined)
-
-
class
mmappickle.picklers.base.
GenericPickler
(parent_object)[source]¶ Bases:
mmappickle.picklers.base.BasePickler
-
priority
¶ int(x=0) -> integer int(x, base=10) -> integer
Convert a number or string to an integer, or return 0 if no arguments are given. If x is a number, return x.__int__(). For floating point numbers, this truncates towards zero.
If x is not a number or if base is given, then x must be a string, bytes, or bytearray instance representing an integer literal in the given base. The literal can be preceded by ‘+’ or ‘-‘ and be surrounded by whitespace. The base defaults to 10. Valid bases are 0 and 2-36. Base 0 means to interpret the base from the string as an integer literal. >>> int(‘0b100’, base=0) 4
-
is_valid
(offset, length)[source]¶ Return True if object starting at offset in f is valid.
File position is kept.
-
-
mmappickle.utils.
save_file_position
(f)[source]¶ Decorator to save the object._file stream position before calling the method