Skip to content

bpo-40688: Use the correct parser in the peg_generator scripts#20235

Merged
pablogsal merged 8 commits into
python:masterfrom
lysnikolaou:fix-peg-scripts
May 25, 2020
Merged

bpo-40688: Use the correct parser in the peg_generator scripts#20235
pablogsal merged 8 commits into
python:masterfrom
lysnikolaou:fix-peg-scripts

Conversation

@lysnikolaou

@lysnikolaou lysnikolaou commented May 19, 2020

Copy link
Copy Markdown
Member

The scripts in Tools/peg_generator/scripts mostly assume that
ast.parse and compile use the old parser, since this was the
state of things, while we were developing them. They need to be
updated to always use the correct parser. _peg_parser is being
extended to support both parsing and compiling with both parsers.

Also deleted the parse_file function from the _peg_parser extension
module, since it wasn't up-to-date with all the things parse_string
supported.

https://bugs.python.org/issue40688

The scripts in `Tools/peg_generator/scripts` mostly assume that
`ast.parse` and `compile` use the old parser, since this was the
state of things, while we were developing them. They need to be
updated to always use the correct parser. `_peg_parser` is being
extended to support both parsing and compiling with both parsers.

@gvanrossum gvanrossum left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I don’t love that the return type varies based on a flag value.

@lysnikolaou

Copy link
Copy Markdown
Member Author

Hm, I don’t love that the return type varies based on a flag value.

Would you prefer having parse_string and compile_string?

@gvanrossum

Copy link
Copy Markdown
Member

Yes, that’s nice.

@gvanrossum gvanrossum left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a bad feeling about the logic in test_parse_directory.py -- it seems to be getting worse every time we think of something new... :-(

Comment thread Modules/_peg_parser.c Outdated
Comment thread Modules/_peg_parser.c Outdated
Comment thread Modules/_peg_parser.c Outdated
Comment thread Tools/peg_generator/scripts/test_parse_directory.py Outdated
Comment thread Tools/peg_generator/scripts/test_parse_directory.py Outdated
Comment thread Tools/peg_generator/scripts/test_parse_directory.py Outdated
@lysnikolaou

Copy link
Copy Markdown
Member Author

I have a bad feeling about the logic in test_parse_directory.py -- it seems to be getting worse every time we think of something new... :-(

That's definitely true. Would rewriting it to be specific to testing again the stdlib be a good idea in your opinion? And maybe some of the arguments can also be removed, since they're no longer really relevant and do not need to be tested.

@lysnikolaou lysnikolaou reopened this May 20, 2020
@lysnikolaou

Copy link
Copy Markdown
Member Author

Closed and reopened for the CI to re-run.

Comment thread Tools/peg_generator/Makefile Outdated
Comment on lines 78 to 84
time_stdlib: time_stdlib_compile

time_stdlib_compile: venv peg_extension/parse.c data/xxl.py
time_stdlib_compile: venv data/xxl.py
$(VENVPYTHON) scripts/benchmark.py --parser=cpython --target=xxl compile

time_stdlib_parse: venv peg_extension/parse.c data/xxl.py
time_stdlib_parse: venv data/xxl.py
$(VENVPYTHON) scripts/benchmark.py --parser=cpython --target=xxl parse

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the name stdlib in these targets horribly confusing. Maybe we can change this to old? Assuming these targets can be made to always use the old parser, and the default targets can be made to always use the new parser. (The perf results I reported last week may have been wrong due to this confusion.)

FWIW I'm not too keen on test_local and test_global below either. (I had somehow thought that time_stdlib ran what is actually called test_global.)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed time_stdlib* to time_old*, test_local to time_peg_dir and test_global to time_stdlib. What do you think?

Comment on lines -44 to -46
command_check = subcommands.add_parser(
"check", help="Benchmark parsing and throwing the tree away"
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting this broke the make bench target.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted the make bench target, since if we were to change it to something like $(VENVPYTHON) scripts/benchmark.py --parser=pegen --target=stdlib parse, it would be the same thing with time_stdlib (the new one).

Comment thread Tools/peg_generator/scripts/test_parse_directory.py Outdated
Comment thread Tools/peg_generator/scripts/test_parse_directory.py Outdated
Comment thread Tools/peg_generator/scripts/test_pypi_packages.py Outdated
@lysnikolaou lysnikolaou requested a review from gvanrossum May 23, 2020 10:19
@lysnikolaou

lysnikolaou commented May 24, 2020

Copy link
Copy Markdown
Member Author

I pushed fixes for bugs in test_parse_directory, which prevented it from timing parsing correctly. In my local machine, it now needs 24.544 seconds (mean from three trials), while the all-time minimum was around 5.5 seconds. Could you maybe run the timings as well? Should we be alarmed?

@gvanrossum

Copy link
Copy Markdown
Member

Let me investigate later today.

@pablogsal

pablogsal commented May 24, 2020

Copy link
Copy Markdown
Member

I pushed fixes for bugs in test_parse_directory, which prevented it from timing parsing correctly. In my local machine, it now needs 24.544 seconds (mean from three trials), while the all-time minimum was around 5.5 seconds. Could you maybe run the timings as well? Should we be alarmed?

I did some quick experiment comparing commit c5fc156 with the current master (I copied the version of test_parse_directory.py in this PR to the root to run always the same code):

Commit c5fc156

$ ./python test_parse_directory.py --grammar-file Grammar/python.gram --tokens-file Grammar/Tokens -d Lib

Checked 1,689 files, 772,894 lines, 28,464,113 bytes in 4.559 seconds.
That's 169,532 lines/sec, or 6,243,503 bytes/sec.

Current master

$ ./python test_parse_directory.py --grammar-file Grammar/python.gram --tokens-file Grammar/Tokens -d Lib

Checked 1,707 files, 779,893 lines, 28,702,650 bytes in 4.837 seconds.
That's 161,223 lines/sec, or 5,933,523 bytes/sec.

So is a bit slower than it used to be but certainly not that slow as the readings you are getting. Are you by any change using a debug build of Python?

@lysnikolaou

lysnikolaou commented May 24, 2020

Copy link
Copy Markdown
Member Author

Are you by any change using a debug build of Python?

That must have been it. I'm not back down to around 6 seconds. I was quite sure I was using the correct Python, but it turns out I was using a debug build.

FWIW I think the minor slow-down is reasonable because of all the invalid_* rules and that's why I've been trying to add cut points to the rules, whose last alternative is invalid_*.

Sorry for the noise.

@pablogsal

Copy link
Copy Markdown
Member

FWIW I think the minor slow-down is reasonable because of all the invalid_* rules and that's why I've been trying to add cut points to the rules, whose last alternative is invalid_*.

It may also be related to the more correct error handing we have been adding since them. Btw, last time I benchmarked the parser a couple of weeks ago I learned that significant time is spent on the linked list for inspecting the memorization cache (checking if a token is currently memorized).

@gvanrossum

Copy link
Copy Markdown
Member

There's no reason to expect the parser to be slower due to this PR -- if it reports different numbers it must be due to a change in the test code.

I happen to have a debug build, and I do see significant slower performance reported with this PR than without it. Current master:

$ time make test_global
../../python.exe scripts/test_parse_directory.py \
		--grammar-file ../../Grammar/python.gram \
		--tokens-file ../../Grammar/Tokens \
		-d ../../Lib \
		--short \
		--exclude "*/test2to3/*" \
		--exclude "*/test2to3/**/*" \
		--exclude "*/bad*" \
		--exclude "*/lib2to3/tests/data/*"
../../Lib/test/test_modulefinder.py:281: DeprecationWarning: invalid escape sequence \u
  b"""\
Checked 1,687 files, 775,179 lines, 28,545,376 bytes in 12.791 seconds.
That's 60,604 lines/sec, or 2,231,688 bytes/sec.

real	0m17.172s
user	0m16.196s
sys	0m0.514s

With this PR, and current master merged back in:

$ time make time_stdlib
./venv/bin/python scripts/test_parse_directory.py \
		--grammar-file ../../Grammar/python.gram \
		--tokens-file ../../Grammar/Tokens \
		-d ../../Lib \
		--short \
		--exclude "*/test2to3/*" \
		--exclude "*/test2to3/**/*" \
		--exclude "*/bad*" \
		--exclude "*/lib2to3/tests/data/*"
../../Lib/test/test_modulefinder.py:281: DeprecationWarning: invalid escape sequence \u
  b"""\
Checked 1,687 files, 775,179 lines, 28,545,376 bytes in 23.936 seconds.
That's 32,386 lines/sec, or 1,192,577 bytes/sec.
Memory stats:
  rss         :         65 MiB
  vms         :       4251 MiB
  maxrss      :         74 MiB

real	0m26.537s
user	0m25.388s
sys	0m0.457s

I haven't looked in the code to try and explain the difference yet. I'll see what the difference is for a non-debug build next.

@gvanrossum

Copy link
Copy Markdown
Member

Hm, without debug mode it's still a significant different with the old measuring tool.

Master:

$ time make test_global
../../python.exe scripts/test_parse_directory.py \
		--grammar-file ../../Grammar/python.gram \
		--tokens-file ../../Grammar/Tokens \
		-d ../../Lib \
		--short \
		--exclude "*/test2to3/*" \
		--exclude "*/test2to3/**/*" \
		--exclude "*/bad*" \
		--exclude "*/lib2to3/tests/data/*"
Checked 1,687 files, 775,179 lines, 28,545,376 bytes in 2.969 seconds.
That's 261,064 lines/sec, or 9,613,474 bytes/sec.

real	0m10.069s
user	0m9.451s
sys	0m0.430s

This PR, with master merged:

$ time make time_stdlib
./venv/bin/python scripts/test_parse_directory.py \
		--grammar-file ../../Grammar/python.gram \
		--tokens-file ../../Grammar/Tokens \
		-d ../../Lib \
		--short \
		--exclude "*/test2to3/*" \
		--exclude "*/test2to3/**/*" \
		--exclude "*/bad*" \
		--exclude "*/lib2to3/tests/data/*"
Checked 1,687 files, 775,179 lines, 28,545,376 bytes in 5.705 seconds.
That's 135,877 lines/sec, or 5,003,569 bytes/sec.
Memory stats:
  rss         :         75 MiB
  vms         :       4234 MiB
  maxrss      :         77 MiB

real	0m7.006s
user	0m6.089s
sys	0m0.343s

So nearly double.

Now I'll go read the source.

@gvanrossum gvanrossum left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh.

short,
None,
0,
1,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. This seems the problem: by passing mode=1 here, we will call parse_string() (on line 174) which returns an AST. This requires converting the C-level AST (which is allocated using an arena) to a Python-usable AST (built out of Python objects) and that's known to be much slower. Originally this passed mode=0, which meant "don't return anything". It seems we no longer have that option, which is fine, but then we should at least pass mode=2.

Again, we should refactor a bunch of stuff here so it's more readable.

@lysnikolaou lysnikolaou May 24, 2020

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, we should refactor a bunch of stuff here so it's more readable.

I've been working on this. I'll open a new PR for the refactorings, after we've merged this one, if you're okay with it.

Comment thread Tools/peg_generator/scripts/test_parse_directory.py Outdated

@gvanrossum gvanrossum left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let 'er rip!

@lysnikolaou

Copy link
Copy Markdown
Member Author

Can we merge this, so that I can merge it into the branch with the script refactorings?

@pablogsal pablogsal merged commit 9645930 into python:master May 25, 2020
@miss-islington

Copy link
Copy Markdown
Contributor

Thanks @lysnikolaou for the PR, and @pablogsal for merging it 🌮🎉.. I'm working now to backport this PR to: 3.9.
🐍🍒⛏🤖

@pablogsal

Copy link
Copy Markdown
Member

Can we merge this, so that I can merge it into the branch with the script refactorings?

Granted ✨

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request May 25, 2020
…nGH-20235)

The scripts in `Tools/peg_generator/scripts` mostly assume that
`ast.parse` and `compile` use the old parser, since this was the
state of things, while we were developing them. They need to be
updated to always use the correct parser. `_peg_parser` is being
extended to support both parsing and compiling with both parsers.
(cherry picked from commit 9645930)

Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com>
@bedevere-bot

Copy link
Copy Markdown

GH-20396 is a backport of this pull request to the 3.9 branch.

@lysnikolaou lysnikolaou deleted the fix-peg-scripts branch May 25, 2020 19:54
miss-islington added a commit that referenced this pull request May 25, 2020
)

The scripts in `Tools/peg_generator/scripts` mostly assume that
`ast.parse` and `compile` use the old parser, since this was the
state of things, while we were developing them. They need to be
updated to always use the correct parser. `_peg_parser` is being
extended to support both parsing and compiling with both parsers.
(cherry picked from commit 9645930)

Co-authored-by: Lysandros Nikolaou <lisandrosnik@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants