Skip to content

bpo-23493: json: Change sort_keys in Python encoder same to C#8131

Merged
methane merged 3 commits into
python:masterfrom
methane:json-sort-keys
Jul 6, 2018
Merged

bpo-23493: json: Change sort_keys in Python encoder same to C#8131
methane merged 3 commits into
python:masterfrom
methane:json-sort-keys

Conversation

@methane

@methane methane commented Jul 6, 2018

Copy link
Copy Markdown
Member

Stop using key=lambda idiom. This behavior is same to
C version encoder.

https://bugs.python.org/issue23493

methane added 2 commits July 6, 2018 16:48
Stop using key=lambda idiom.  This behavior is same to
C version encoder.
@methane methane added the performance Performance or resource usage label Jul 6, 2018

@serhiy-storchaka serhiy-storchaka left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirm that there is a benefit from this change.

@vstinner

vstinner commented Jul 6, 2018

Copy link
Copy Markdown
Member

Is it possible that sorted() compares two values and that the comparison raises a type error with this change? (Is it possible to have a dictionary with two keys with have the same order?)

@vstinner vstinner left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM since the C code calls list(dct.items()).sort() (no key argument).

items = list(dct.items())
items.sort()

and

items = sorted(dct.items())

are equivalent and sorted() is shorter, so LGTM :-)

@methane

methane commented Jul 6, 2018

Copy link
Copy Markdown
Member Author

Is it possible that sorted() compares two values and that the comparison raises a type error with this change? (Is it possible to have a dictionary with two keys with have the same order?)

It happens with C encoder.

$ python3
Python 3.7.0 (default, Jun 28 2018, 15:11:21)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> class D(dict):
...     def items(self):
...         return [('foo', 42), ('foo', 'bar')]
...
>>> d = D({'foo': None})
>>> d
{'foo': None}
>>> import json
>>> json.dumps(d)
'{"foo": 42, "foo": "bar"}'
>>> json.dumps(d, sort_keys=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/inada-n/pyenv/versions/3.7.0/lib/python3.7/json/__init__.py", line 238, in dumps
    **kw).encode(obj)
  File "/home/inada-n/pyenv/versions/3.7.0/lib/python3.7/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/home/inada-n/pyenv/versions/3.7.0/lib/python3.7/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
TypeError: '<' not supported between instances of 'str' and 'int'

>>> json.encoder.c_make_encoder = None
>>> json.dumps(d)
'{"foo": 42, "foo": "bar"}'

@tirkarthi

Copy link
Copy Markdown
Member

Some numbers with ./configure --enable-optimizations

Before patch :

cpython git:(master)   rlwrap ./python
Python 3.8.0a0 (heads/master:c929df3, Jul  6 2018, 11:38:33)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> import json
>>> timeit.timeit('sorted(d.items())', setup='d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.3216385352425277
>>> timeit.timeit('sorted(d.items())', setup='d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.38790418207645416
>>> timeit.timeit('sorted(d.items())', setup='d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.3015676769427955
>>> timeit.timeit('sorted(d.items())', setup='d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.3441896280273795

After patch :

➜  cpython git:(master) ✗ rlwrap ./python
Python 3.8.0a0 (heads/master-dirty:c929df3, Jul  6 2018, 11:44:57)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> import json
>>> timeit.timeit('sorted(d.items())', setup='d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.3076137569732964
>>> timeit.timeit('sorted(d.items())', setup='d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.3111478630453348
>>> timeit.timeit('sorted(d.items())', setup='d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.3163580968976021
>>> timeit.timeit('sorted(d.items())', setup='d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.3175586462020874

>>> # itemgetter
>>> import operator
>>> timeit.timeit('sorted(d.items(), key=operator.itemgetter(0))', setup='import operator; d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.37424366315826774
>>> timeit.timeit('sorted(d.items(), key=operator.itemgetter(0))', setup='import operator; d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.37700869189575315
>>> timeit.timeit('sorted(d.items(), key=operator.itemgetter(0))', setup='import operator; d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.3674369710497558

Little confused that the original issue proposed operator.itemgetter which was the fastest while here it's the slowest. Any information on the difference between the proposed change vs the actual one ? Maybe things have changed from Python 3.4 when it was initially posted and key was removed in the later versions?

Thanks

@methane

methane commented Jul 6, 2018

Copy link
Copy Markdown
Member Author

https://tools.ietf.org/html/rfc7159#section-4
RFC says "The names within an object SHOULD be unique."

@methane

methane commented Jul 6, 2018

Copy link
Copy Markdown
Member Author

@tirkarthi You compared sorted(d.items()) with sorted(d.items(), key=itemgetter(0)).
Original issue compared sorted(d.items(), key=lambda x: x[0]) with sorted(d.items(), key=itemgetter(0))

@vstinner

vstinner commented Jul 6, 2018

Copy link
Copy Markdown
Member
>>> class D(dict):
...     def items(self):
...         return [('foo', 42), ('foo', 'bar')]

Honestly, if you use such weird dictionary, you are going to get worse issues than TypeError on sorting, no?

Ignore my question, your PR is fine.

@tirkarthi

Copy link
Copy Markdown
Member

Thanks @methane missed it. Makes sense now

➜  cpython git:(master) ✗ rlwrap ./python
Python 3.8.0a0 (heads/master-dirty:c929df3, Jul  6 2018, 11:44:57)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import timeit
>>> import json
>>> timeit.timeit('sorted(d.items(), key=lambda x: x[0])', setup='import operator; d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.41919384989887476
>>> timeit.timeit('sorted(d.items(), key=lambda x: x[0])', setup='import operator; d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.4452404109761119
>>> timeit.timeit('sorted(d.items(), key=lambda x: x[0])', setup='import operator; d = dict.fromkeys(map(str, range(1000)))', number=1000)
0.42263348307460546

@vstinner

vstinner commented Jul 6, 2018

Copy link
Copy Markdown
Member

Some numbers with ./configure --enable-optimizations (...) import timeit

Noooo, please! If you spend hours to compile Python with PGO, don't use the timeit module which is not reliable! Use perf timeit:
http://perf.readthedocs.io/en/latest/cli.html#timeit-versus-perf-timeit

Your results have a large deviation:

0.3216385352425277
0.38790418207645416
0.3015676769427955
0.3441896280273795

perf timeit should help to reduce the deviation. My doc to get reliable benchmarks:
http://perf.readthedocs.io/en/latest/run_benchmark.html#how-to-get-reproductible-benchmark-results

@tirkarthi

tirkarthi commented Jul 6, 2018

Copy link
Copy Markdown
Member

Thanks @vstinner . The after patch doesn't make sense and I misread the posted benchmark that it's related to sorted function instead of related to JSON module and was comparing sorted function. Sorry for the noise.

@methane

methane commented Jul 6, 2018

Copy link
Copy Markdown
Member Author

Even with perf, performance of json.dumps(d, sort_keys=True, indent=" ") is unstable.

$ ./python -m perf timeit -s 'import json; d = {"k%d" % i: True for i in range(100)}' -- 'json.dumps(d, indent=" ", sort_keys=True)'
.....................
Mean +- std dev: 64.7 us +- 4.5 us

$ ./python -m perf timeit -s 'import json; d = {"k%d" % i: True for i in range(100)}' -- 'json.dumps(d, indent=" ", sort_keys=True)'
.....................
WARNING: the benchmark result may be unstable
* the standard deviation (8.89 us) is 13% of the mean (68.1 us)

Try to rerun the benchmark with more runs, values and/or loops.
Run 'python -m perf system tune' command to reduce the system jitter.
Use perf stats, perf dump and perf hist to analyze results.
Use --quiet option to hide these warnings.

Mean +- std dev: 68.1 us +- 8.9 us

$ ./python -m perf timeit -s 'import json; d = {"k%d" % i: True for i in range(100)}' -- 'json.dumps(d, indent=" ", sort_keys=True)'
.....................
WARNING: the benchmark result may be unstable
* the standard deviation (8.02 us) is 12% of the mean (67.7 us)

Try to rerun the benchmark with more runs, values and/or loops.
Run 'python -m perf system tune' command to reduce the system jitter.
Use perf stats, perf dump and perf hist to analyze results.
Use --quiet option to hide these warnings.

Mean +- std dev: 67.7 us +- 8.0 us

It seems sorting is not bottleneck. We can close the issue without merging this.

@methane methane closed this Jul 6, 2018
@methane methane deleted the json-sort-keys branch July 6, 2018 12:26
@vstinner

vstinner commented Jul 6, 2018

Copy link
Copy Markdown
Member

It seems sorting is not bottleneck. We can close the issue without merging this.

I suggest to merge this change anyway, to respect the PEP 399: Python and C implementation must behave the same. Currently, the Python implementation is not the default but behaves differently :-(

@methane methane restored the json-sort-keys branch July 6, 2018 17:15
@methane methane added skip news and removed performance Performance or resource usage labels Jul 6, 2018
@methane methane reopened this Jul 6, 2018
@methane methane changed the title bpo-23493: json: Optimize sort_keys in Python encoder bpo-23493: json: Make sort_keys in Python encoder same to C Jul 6, 2018
@methane methane changed the title bpo-23493: json: Make sort_keys in Python encoder same to C bpo-23493: json: Change sort_keys in Python encoder same to C Jul 6, 2018
@methane methane merged commit e25399b into python:master Jul 6, 2018
@methane methane deleted the json-sort-keys branch July 6, 2018 23:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants