Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decoder: Tuple cannot be used as dict key #50

Open
navytux opened this issue Sep 21, 2018 · 2 comments
Open

decoder: Tuple cannot be used as dict key #50

navytux opened this issue Sep 21, 2018 · 2 comments

Comments

@navytux
Copy link
Collaborator

navytux commented Sep 21, 2018

If tuple is used directly, for example:

In [11]: s = "((tNd."

In [19]: dis(s)
    0: (    MARK
    1: (        MARK
    2: t            TUPLE      (MARK at 1)
    3: N        NONE
    4: d        DICT       (MARK at 0)
    5: .    STOP
highest protocol among opcodes = 0

In [20]: loads(s)
Out[20]: {(): None}

our Decoder gives:

pickle: loadDict: invalid key type ogórek.Tuple

However the pickle is valid by Python rules because there tuple is immutable (and thus hashable).

However the check inside Decoder is not complete - for example if we hook Tuple into another type whos is comparable (e.g. Ref here):

In [11]: s = "((tQNd."

In [15]: dis(s)
    0: (    MARK
    1: (        MARK
    2: t            TUPLE      (MARK at 1)
    3: Q        BINPERSID
    4: N        NONE
    5: d        DICT       (MARK at 0)
    6: .    STOP
highest protocol among opcodes = 1

our Decoder crashes:

panic: runtime error: hash of unhashable type ogórek.Tuple
  
goroutine 1 [running]:
github.com/kisielk/og-rek.(*Decoder).loadDict(0xc00009efd0, 0xc00000a264, 0x0)
        /tmp/go-fuzz-build914098789/gopath/src/github.com/kisielk/og-rek/ogorek.go:819 +0x32b
github.com/kisielk/og-rek.(*Decoder).Decode(0xc00009efd0, 0xc00009aa10, 0xc00009efd0, 0x464f39, 0x4)
        /tmp/go-fuzz-build914098789/gopath/src/github.com/kisielk/og-rek/ogorek.go:242 +0x14fa
github.com/kisielk/og-rek.Fuzz(0x7f62339cc000, 0x6, 0x200000, 0x3)
        /tmp/go-fuzz-build914098789/gopath/src/github.com/kisielk/og-rek/fuzz.go:15 +0xdf
go-fuzz-dep.Main(0x524df8)
        /tmp/go-fuzz-build914098789/goroot/src/go-fuzz-dep/main.go:49 +0xad
main.main()
        /tmp/go-fuzz-build914098789/gopath/src/github.com/kisielk/og-rek/go.fuzz.main/main.go:10 +0x2d
exit status 2
@navytux
Copy link
Collaborator Author

navytux commented Sep 23, 2018

About the second crash -- see #30 (comment) -- it seems likely we will have to rework our "value can be used as dict key" logic from reflect.Type.Comparable into using panic/recover.

navytux added a commit to navytux/og-rek that referenced this issue Sep 24, 2018
Even though we tried to catch whether dict keys are ok to be used via
reflect.TypeOf(key).Comparable() (see da5f034 "decoder: Fix crashes
found by fuzzer (kisielk#32)"), that turned out to be not enough. For example
if key is a struct, e.g. of the following type

	type Ref struct {
		Pid interface{}
	}

it will be comparable. But the comparision, depending on dynamic .Pid
type, might panic. This is what was actually cauht by fuzz-testing
recently:

	kisielk#50 (second part of the report)

So instead of recursively walking a key type and checking each subfield
with reflect.TypeOf().Comparable(), switch for using panic/recover for
detecting the "unhashable key" situation.

This slows down decoding a bit (only cumulative figure for all-test-vectors decoding):

	name          old time/op    new time/op    delta
	DecodeLong-4     361ns ± 0%     362ns ± 0%    ~     (p=0.238 n=5+4)
	Decode-4        93.2µs ± 0%    95.6µs ± 0%  +2.54%  (p=0.008 n=5+5)
	Encode-4        16.5µs ± 0%    16.6µs ± 0%    ~     (p=0.841 n=5+5)

but that is the price of correctness. And with manually recursively walking key
type I doubt it would be faster.

The defer overhead should be less once golang/go#14939 is fixed.

Updates: kisielk#30
kisielk pushed a commit that referenced this issue Sep 25, 2018
Even though we tried to catch whether dict keys are ok to be used via
reflect.TypeOf(key).Comparable() (see da5f034 "decoder: Fix crashes
found by fuzzer (#32)"), that turned out to be not enough. For example
if key is a struct, e.g. of the following type

	type Ref struct {
		Pid interface{}
	}

it will be comparable. But the comparision, depending on dynamic .Pid
type, might panic. This is what was actually cauht by fuzz-testing
recently:

	#50 (second part of the report)

So instead of recursively walking a key type and checking each subfield
with reflect.TypeOf().Comparable(), switch for using panic/recover for
detecting the "unhashable key" situation.

This slows down decoding a bit (only cumulative figure for all-test-vectors decoding):

	name          old time/op    new time/op    delta
	DecodeLong-4     361ns ± 0%     362ns ± 0%    ~     (p=0.238 n=5+4)
	Decode-4        93.2µs ± 0%    95.6µs ± 0%  +2.54%  (p=0.008 n=5+5)
	Encode-4        16.5µs ± 0%    16.6µs ± 0%    ~     (p=0.841 n=5+5)

but that is the price of correctness. And with manually recursively walking key
type I doubt it would be faster.

The defer overhead should be less once golang/go#14939 is fixed.

Updates: #30
navytux added a commit to navytux/og-rek that referenced this issue Aug 19, 2024
Ogórek currently represents unpickled dict via map[any]any, which is
logical, but also exhibits issues because builtin Go map behaviour is
different from Python's dict behaviour. For example:

- Python's dict allows tuples to be used in keys, while Go map
  does not (kisielk#50),

- Python's dict allows both long and int to be used interchangeable as
  keys, while Go map does not handle *big.Int as key with the same
  semantic (kisielk#55)

- Python's dict allows to use numbers interchangeable in keys - all int
  and float, but on Go side int(1) and float64(1.0) are considered by
  builtin map as different keys.

- In Python world bytestring (str from py2) is considered to be related
  to both unicode (str on py3) and bytes, but builtin map considers all
  string, Bytes and ByteString as different keys.

- etc...

All in all there are many differences in behaviour in builtin Python
dict and Go map that result in generally different semantics when
decoding pickled data. Those differences can be fixed only if we add
custom dict implementation that mirrors what Python does.

-> Do that: add custom Dict that implements key -> value mapping with
   mirroring Python behaviour.

For now we are only adding the Dict class itself and its tests.
Later we will use this new Dict to handle decoding dictionaries from the pickles.

For the implementation we use github.com/aristanetworks/gomap which
provides extraction of builtin go map code wrapped into generic type
Map[Key,Value] that accepts custom equal and hash functions. And it is
those equal and hash functions via which we define equality behaviour to
be the same as on Python side.

Please see documentation in the added code and tests for details.
navytux added a commit to navytux/og-rek that referenced this issue Aug 19, 2024
Similarly to StrictUnicode mode (see b28613c) add new opt-in mode that
requests to decode Python dicts as ogórek.Dict instead of builtin map.
As explained in recent patch "Add custom Dict that mirrors Python dict
behaviour" this is needed to fix decoding issues that can be there due
to different behaviour of Python dict and builtin Go map:

    ---- 8< ----
    Ogórek currently represents unpickled dict via map[any]any, which is
    logical, but also exhibits issues because builtin Go map behaviour is
    different from Python's dict behaviour. For example:

    - Python's dict allows tuples to be used in keys, while Go map
      does not (kisielk#50),

    - Python's dict allows both long and int to be used interchangeable as
      keys, while Go map does not handle *big.Int as key with the same
      semantic (kisielk#55)

    - Python's dict allows to use numbers interchangeable in keys - all int
      and float, but on Go side int(1) and float64(1.0) are considered by
      builtin map as different keys.

    - In Python world bytestring (str from py2) is considered to be related
      to both unicode (str on py3) and bytes, but builtin map considers all
      string, Bytes and ByteString as different keys.

    - etc...

    All in all there are many differences in behaviour in builtin Python
    dict and Go map that result in generally different semantics when
    decoding pickled data. Those differences can be fixed only if we add
    custom dict implementation that mirrors what Python does.

    -> Do that: add custom Dict that implements key -> value mapping with
       mirroring Python behaviour.

    For now we are only adding the Dict class itself and its tests.
    Later we will use this new Dict to handle decoding dictionaries from the pickles.
    ---- 8< ----

In this patch we add new Decoder option to activate PyDict mode
decoding, teach encoder to also support encoding of Dict and adjust
tests.

The behaviour of new system is explained by the following doc.go
excerpt:

    For dicts there are two modes. In the first, default, mode Python dicts are
    decoded into standard Go map. This mode tries to use builtin Go type, but
    cannot mirror py behaviour fully because e.g. int(1), big.Int(1) and
    float64(1.0) are all treated as different keys by Go, while Python treats
    them as being equal. It also does not support decoding dicts with tuple
    used in keys:

         dict    ↔  map[any]any                       PyDict=n mode, default
                 ←  ogórek.Dict

    With PyDict=y mode, however, Python dicts are decoded as ogórek.Dict which
    mirrors behaviour of Python dict with respect to keys equality, and with
    respect to which types are allowed to be used as keys.

         dict    ↔  ogórek.Dict                       PyDict=y mode
                 ←  map[any]any
@navytux navytux mentioned this issue Aug 19, 2024
navytux added a commit to navytux/og-rek that referenced this issue Sep 19, 2024
Ogórek currently represents unpickled dict via map[any]any, which is
logical, but also exhibits issues because builtin Go map behaviour is
different from Python's dict behaviour. For example:

- Python's dict allows tuples to be used in keys, while Go map
  does not (kisielk#50),

- Python's dict allows both long and int to be used interchangeable as
  keys, while Go map does not handle *big.Int as key with the same
  semantic (kisielk#55)

- Python's dict allows to use numbers interchangeable in keys - all int
  and float, but on Go side int(1) and float64(1.0) are considered by
  builtin map as different keys.

- In Python world bytestring (str from py2) is considered to be related
  to both unicode (str on py3) and bytes, but builtin map considers all
  string, Bytes and ByteString as different keys.

- etc...

All in all there are many differences in behaviour in builtin Python
dict and Go map that result in generally different semantics when
decoding pickled data. Those differences can be fixed only if we add
custom dict implementation that mirrors what Python does.

-> Do that: add custom Dict that implements key -> value mapping with
   mirroring Python behaviour.

For now we are only adding the Dict class itself and its tests.
Later we will use this new Dict to handle decoding dictionaries from the pickles.

For the implementation we use github.com/aristanetworks/gomap which
provides extraction of builtin go map code wrapped into generic type
Map[Key,Value] that accepts custom equal and hash functions. And it is
those equal and hash functions via which we define equality behaviour to
be the same as on Python side.

Please see documentation in the added code and tests for details.
navytux added a commit to navytux/og-rek that referenced this issue Sep 19, 2024
Similarly to StrictUnicode mode (see b28613c) add new opt-in mode that
requests to decode Python dicts as ogórek.Dict instead of builtin map.
As explained in recent patch "Add custom Dict that mirrors Python dict
behaviour" this is needed to fix decoding issues that can be there due
to different behaviour of Python dict and builtin Go map:

    ---- 8< ----
    Ogórek currently represents unpickled dict via map[any]any, which is
    logical, but also exhibits issues because builtin Go map behaviour is
    different from Python's dict behaviour. For example:

    - Python's dict allows tuples to be used in keys, while Go map
      does not (kisielk#50),

    - Python's dict allows both long and int to be used interchangeable as
      keys, while Go map does not handle *big.Int as key with the same
      semantic (kisielk#55)

    - Python's dict allows to use numbers interchangeable in keys - all int
      and float, but on Go side int(1) and float64(1.0) are considered by
      builtin map as different keys.

    - In Python world bytestring (str from py2) is considered to be related
      to both unicode (str on py3) and bytes, but builtin map considers all
      string, Bytes and ByteString as different keys.

    - etc...

    All in all there are many differences in behaviour in builtin Python
    dict and Go map that result in generally different semantics when
    decoding pickled data. Those differences can be fixed only if we add
    custom dict implementation that mirrors what Python does.

    -> Do that: add custom Dict that implements key -> value mapping with
       mirroring Python behaviour.

    For now we are only adding the Dict class itself and its tests.
    Later we will use this new Dict to handle decoding dictionaries from the pickles.
    ---- 8< ----

In this patch we add new Decoder option to activate PyDict mode
decoding, teach encoder to also support encoding of Dict and adjust
tests.

The behaviour of new system is explained by the following doc.go
excerpt:

    For dicts there are two modes. In the first, default, mode Python dicts are
    decoded into standard Go map. This mode tries to use builtin Go type, but
    cannot mirror py behaviour fully because e.g. int(1), big.Int(1) and
    float64(1.0) are all treated as different keys by Go, while Python treats
    them as being equal. It also does not support decoding dicts with tuple
    used in keys:

         dict    ↔  map[any]any                       PyDict=n mode, default
                 ←  ogórek.Dict

    With PyDict=y mode, however, Python dicts are decoded as ogórek.Dict which
    mirrors behaviour of Python dict with respect to keys equality, and with
    respect to which types are allowed to be used as keys.

         dict    ↔  ogórek.Dict                       PyDict=y mode
                 ←  map[any]any
@navytux
Copy link
Collaborator Author

navytux commented Sep 23, 2024

I think to the possible extent this issue is fixed by PyDict mode: 3bf6c92 (#75).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant