From 96d2c9115158c7cd4d7635ae93515f5855f76791 Mon Sep 17 00:00:00 2001 From: Nick Coghlan Date: Fri, 8 Jun 2018 23:46:24 +1000 Subject: [PATCH 1/5] bpo-33409: Clarify PEP 538/540 relationship While locale coercion and UTF-8 mode turned out to be complementary ideas rather than competing ones, it isn't immediately obvious why it's useful to have both, or how they interact at runtime. This updates both the Python 3.7 What's New doc and the PYTHONCOERCECLOCALE and PYTHONUTF8 documentation in an attempt to clarify that relationship. --- Doc/using/cmdline.rst | 29 +++++++++++++++++-- Doc/whatsnew/3.7.rst | 29 +++++++++++++++---- .../2018-06-08-23-46-01.bpo-33409.r4z9MM.rst | 2 ++ 3 files changed, 52 insertions(+), 8 deletions(-) create mode 100644 Misc/NEWS.d/next/Documentation/2018-06-08-23-46-01.bpo-33409.r4z9MM.rst diff --git a/Doc/using/cmdline.rst b/Doc/using/cmdline.rst index e72dea907580270..c8b7b57b5c8686c 100644 --- a/Doc/using/cmdline.rst +++ b/Doc/using/cmdline.rst @@ -439,7 +439,7 @@ Miscellaneous options ``True`` * ``-X utf8`` enables the UTF-8 mode, whereas ``-X utf8=0`` disables the - UTF-8 mode. + UTF-8 mode. See :envvar:`PYTHONUTF8` for more details. It also allows passing arbitrary values and retrieving them through the :data:`sys._xoptions` dictionary. @@ -819,6 +819,16 @@ conflict. activates, or else if a locale that *would* have triggered coercion is still active when the Python runtime is initialized. + Note that setting ``LC_ALL`` also implicitly turns of locale coercion, as + that setting will always override any ``LC_CTYPE`` setting the interpreter + may provide. + + Also note that even when locale coercion is disabled, or when it fails to + find a suitable target locale, :envvar:`PYTHONUTF8` will still activate by + default in legacy ASCII-based C locales. Both features must be disabled in + order to force the interpreter to use ``ASCII`` instead of ``UTF-8`` for + system interfaces. + Availability: \*nix .. versionadded:: 3.7 @@ -834,10 +844,23 @@ conflict. .. envvar:: PYTHONUTF8 - If set to ``1``, enable the UTF-8 mode. If set to ``0``, disable the UTF-8 - mode. Any other non-empty string cause an error. + If set to ``1``, enable the UTF-8 mode, such that the interpreter uses + ``UTF-8`` as the text encoding for system interfaces, regardless of the + current locale setting. If set to ``0``, disable the UTF-8 mode. Any + other non-empty string causes an error during interpreter initialisation. + + If this environment variable is not set, then the interpreter defaults to + using the current locale settings, *unless* that setting is identified as + a legacy C locale (as descibed for :envvar:`PYTHONCOERCECLOCALE`), and locale + coercion is either disabled or fails. In such legacy ``ASCII``-based locales, + the interpreter will default to enabling UTF-8 mode. + + Also available as the :option:`-X` ``utf8`` option. + + Availability: \*nix .. versionadded:: 3.7 + See :pep:`540` for more details. Debug-mode variables diff --git a/Doc/whatsnew/3.7.rst b/Doc/whatsnew/3.7.rst index 9a6f542ec479ec9..5d739b1da288663 100644 --- a/Doc/whatsnew/3.7.rst +++ b/Doc/whatsnew/3.7.rst @@ -97,9 +97,10 @@ Significant improvements in the standard library: CPython implementation improvements: +* Avoiding the use of ASCII as a default text encoding: + * :ref:`PEP 538 `, legacy C locale coercion + * :ref:`PEP 540 `, forced UTF-8 runtime mode * :ref:`PEP 552 `, deterministic .pycs -* :ref:`PEP 538 `, legacy C locale coercion -* :ref:`PEP 540 `, forced UTF-8 runtime mode * :ref:`the new development runtime mode ` * :ref:`PEP 565 `, improved :exc:`DeprecationWarning` handling @@ -184,7 +185,8 @@ PEP 538: Legacy C Locale Coercion An ongoing challenge within the Python 3 series has been determining a sensible default strategy for handling the "7-bit ASCII" text encoding assumption -currently implied by the use of the default C locale on non-Windows platforms. +currently implied by the use of the default C or POSIX locale on non-Windows +platforms. :pep:`538` updates the default interpreter command line interface to automatically coerce that locale to an available UTF-8 based locale as @@ -209,6 +211,14 @@ locale related integration problems, explicit warnings (emitted directly on This setting will also cause the Python runtime to emit a warning if the legacy C locale remains active when the core interpreter is initialized. +While :pep:`538`'s locale coercion has the benefit of also affecting extension +modules (such as GNU ``readline``), as well as child processes (including those +running non-Python applications and older versions of Python), it has the +downside of requiring that a suitable target locale be present on the running +system. To better handle the case where no suitable target locale is available +(as occurs on RHEL/CentOS 7, for example), Python 3.7 also implements +:ref:`whatsnew37-pep540`. + .. seealso:: :pep:`538` -- Coercing the legacy C locale to a UTF-8 based locale @@ -231,8 +241,17 @@ The forced UTF-8 mode can be used to change the text handling behavior in an embedded Python interpreter without changing the locale settings of an embedding application. -The UTF-8 mode is enabled by default when the locale is "C". See -:ref:`whatsnew37-pep538` for details. +While :pep:`540`'s UTF-8 mode has the benefit of working regardless of which +locales are available on the running system, it has the downside of having no +effect on extension modules, child processes running non-Python applications, +and child processes running older versions of Python. To reduce the risk of +corrupting text data when communicating with such components, Python 3.7 also +implements :ref:`whatsnew37-pep540`). + +The UTF-8 mode is enabled by default when the locale is "C", and the :pep:`538` +locale coercion feature fails to change it to a UTF-8 based alternative +(whether that failure is due to ``PYTHONCOERCECLOCALE=0`` being set, +``LC_ALL`` being set, or the lack of a suitable target locale). .. seealso:: diff --git a/Misc/NEWS.d/next/Documentation/2018-06-08-23-46-01.bpo-33409.r4z9MM.rst b/Misc/NEWS.d/next/Documentation/2018-06-08-23-46-01.bpo-33409.r4z9MM.rst new file mode 100644 index 000000000000000..5b1a018df55ae14 --- /dev/null +++ b/Misc/NEWS.d/next/Documentation/2018-06-08-23-46-01.bpo-33409.r4z9MM.rst @@ -0,0 +1,2 @@ +Clarified the relationship between PEP 538's PYTHONCOERCECLOCALE and PEP +540's PYTHONUTF8 mode. From 9e893261a040b15883946ffe795fd95b9bc3a1a7 Mon Sep 17 00:00:00 2001 From: Nick Coghlan Date: Sat, 9 Jun 2018 12:20:02 +1000 Subject: [PATCH 2/5] Further doc updates - mention GNU readline in the PEP 540 What's New section - mention UTF-8 mode change the stdin and stderr error handler - improve wording conistenct between the PYTHONCOERCECLOCALE and PYTHONUTF8MODE docs when they cover the same thing (mostly related to legacy locale detection and setting the standard stream error handler) - move the reference to LC_ALL turning off locale coercion into the description of how locale coercion activates in the first place - port the full description of the UTF-8 mode behaviour changes from PEP 540 into the PYTHONUTF8 documentation --- Doc/using/cmdline.rst | 84 +++++++++++++++++++++++++++++-------------- Doc/whatsnew/3.7.rst | 18 +++++----- 2 files changed, 67 insertions(+), 35 deletions(-) diff --git a/Doc/using/cmdline.rst b/Doc/using/cmdline.rst index c8b7b57b5c8686c..40580590df065ce 100644 --- a/Doc/using/cmdline.rst +++ b/Doc/using/cmdline.rst @@ -789,14 +789,16 @@ conflict. .. envvar:: PYTHONCOERCECLOCALE If set to the value ``0``, causes the main Python command line application - to skip coercing the legacy ASCII-based C locale to a more capable UTF-8 - based alternative. + to skip coercing the legacy ASCII-based C and POSIX locales to a more + capable UTF-8 based alternative. - If this variable is *not* set, or is set to a value other than ``0``, and - the current locale reported for the ``LC_CTYPE`` category is the default - ``C`` locale, then the Python CLI will attempt to configure the following - locales for the ``LC_CTYPE`` category in the order listed before loading the - interpreter runtime: + If this variable is *not* set (or is set to a value other than ``0``) the + ``LC_ALL`` locale override environment variable is also not set, and the + current locale reported for the ``LC_CTYPE`` category is either the default + ``C`` locale, or else the explicitly ASCII-based ``POSIX`` locale, then the + Python CLI will attempt to configure the following locales for the + ``LC_CTYPE`` category in the order listed before loading the interpreter + runtime: * ``C.UTF-8`` * ``C.utf8`` @@ -807,25 +809,23 @@ conflict. environment before the Python runtime is initialized. This ensures the updated setting is seen in subprocesses, as well as in operations that query the environment rather than the current C locale (such as Python's - own :func:`locale.getdefaultlocale`). + own :func:`locale.getdefaultlocale`, or the GNU `readline` library). Configuring one of these locales (either explicitly or via the above - implicit locale coercion) will automatically set the error handler for - :data:`sys.stdin` and :data:`sys.stdout` to ``surrogateescape``. This - behavior can be overridden using :envvar:`PYTHONIOENCODING` as usual. + implicit locale coercion) automatically enables the ``surrogateescape`` + :ref:`error handler ` for :data:`sys.stdin` and + :data:`sys.stdout` (:data:`sys.stderr` continues to use ``backslashreplace`` + as it does in any other locale). This stream handling behavior can be + overridden using :envvar:`PYTHONIOENCODING` as usual. For debugging purposes, setting ``PYTHONCOERCECLOCALE=warn`` will cause Python to emit warning messages on ``stderr`` if either the locale coercion activates, or else if a locale that *would* have triggered coercion is still active when the Python runtime is initialized. - Note that setting ``LC_ALL`` also implicitly turns of locale coercion, as - that setting will always override any ``LC_CTYPE`` setting the interpreter - may provide. - Also note that even when locale coercion is disabled, or when it fails to find a suitable target locale, :envvar:`PYTHONUTF8` will still activate by - default in legacy ASCII-based C locales. Both features must be disabled in + default in legacy ASCII-based locales. Both features must be disabled in order to force the interpreter to use ``ASCII`` instead of ``UTF-8`` for system interfaces. @@ -844,16 +844,48 @@ conflict. .. envvar:: PYTHONUTF8 - If set to ``1``, enable the UTF-8 mode, such that the interpreter uses - ``UTF-8`` as the text encoding for system interfaces, regardless of the - current locale setting. If set to ``0``, disable the UTF-8 mode. Any - other non-empty string causes an error during interpreter initialisation. - - If this environment variable is not set, then the interpreter defaults to - using the current locale settings, *unless* that setting is identified as - a legacy C locale (as descibed for :envvar:`PYTHONCOERCECLOCALE`), and locale - coercion is either disabled or fails. In such legacy ``ASCII``-based locales, - the interpreter will default to enabling UTF-8 mode. + If set to ``1``, enables the interpreter's UTF-8 mode, where ``UTF-8`` is + used as the text encoding for system interfaces, regardless of the + current locale setting. + + This means that: + + * :func:`sys.getfilesystemencoding()` returns ``'UTF-8'``. + * :func:`locale.getpreferredencoding()` returns ``'UTF-8'`` (the locale + encoding is ignored, and the function's ``do_setlocale`` parameter has no + effect). + * :data:`sys.stdin`, :data:`sys.stdout`, and :data:`sys.stderr` all use + UTF-8 as their text encoding, with the ``surrogateescape`` + :ref:`error handler ` being enabled for :data:`sys.stdin` + and :data:`sys.stdout` (:data:`sys.stderr` continues to use + ``backslashreplace`` as it does in the default locale-aware mode) + + As a consequence of the changes in those lower level APIs, other higher + level APIs also exhibit different default behaviours: + + * Command line arguments, environment variables and filenames are decoded + to text using the UTF-8 encoding. + * :func:`os.fsdecode()` and :func:`os.fsencode()` use the UTF-8 encoding. + * :func:`open()`, :func:`io.open()`, and :func:`codecs.open()` use the UTF-8 + encoding by default. However, they still use the strict error handler by + default so that attempting to open a binary file in text mode is likely + to raise an exception rather than producing nonsense data. + + Note that the standard stream settings in UTF-8 mode can be overridden by + :envvar:`PYTHONIOENCODING` (just as they can be in the default locale-aware + mode). + + If set to ``0``, the interpreter runs in its default locale-aware mode. + + Setting any other non-empty string causes an error during interpreter + initialisation. + + If this environment variable is not set at all, then the interpreter defaults + to using the current locale settings, *unless* the current locale is + identified as a legacy ASCII-based locale + (as descibed for :envvar:`PYTHONCOERCECLOCALE`), and locale coercion is + either disabled or fails. In such legacy locales, the interpreter will + default to enabling UTF-8 mode unless explicitly instructed not to do so. Also available as the :option:`-X` ``utf8`` option. diff --git a/Doc/whatsnew/3.7.rst b/Doc/whatsnew/3.7.rst index 5d739b1da288663..2bf807110d53044 100644 --- a/Doc/whatsnew/3.7.rst +++ b/Doc/whatsnew/3.7.rst @@ -207,7 +207,7 @@ continues to be ``backslashreplace``, regardless of locale. Locale coercion is silent by default, but to assist in debugging potentially locale related integration problems, explicit warnings (emitted directly on -:data:`~sys.stderr` can be requested by setting ``PYTHONCOERCECLOCALE=warn``. +:data:`~sys.stderr`) can be requested by setting ``PYTHONCOERCECLOCALE=warn``. This setting will also cause the Python runtime to emit a warning if the legacy C locale remains active when the core interpreter is initialized. @@ -243,14 +243,14 @@ an embedding application. While :pep:`540`'s UTF-8 mode has the benefit of working regardless of which locales are available on the running system, it has the downside of having no -effect on extension modules, child processes running non-Python applications, -and child processes running older versions of Python. To reduce the risk of -corrupting text data when communicating with such components, Python 3.7 also -implements :ref:`whatsnew37-pep540`). - -The UTF-8 mode is enabled by default when the locale is "C", and the :pep:`538` -locale coercion feature fails to change it to a UTF-8 based alternative -(whether that failure is due to ``PYTHONCOERCECLOCALE=0`` being set, +effect on extension modules (such as GNU ``readline``), child processes running +non-Python applications, and child processes running older versions of Python. +To reduce the risk of corrupting text data when communicating with such +components, Python 3.7 also implements :ref:`whatsnew37-pep540`). + +The UTF-8 mode is enabled by default when the locale is ``C`` or ``POSIX``, and +the :pep:`538` locale coercion feature fails to change it to a UTF-8 based +alternative (whether that failure is due to ``PYTHONCOERCECLOCALE=0`` being set, ``LC_ALL`` being set, or the lack of a suitable target locale). .. seealso:: From 1771b6293dc1e9ba66ae86a5cf405069707b81de Mon Sep 17 00:00:00 2001 From: Nick Coghlan Date: Sat, 9 Jun 2018 13:33:43 +1000 Subject: [PATCH 3/5] Fix markup --- Doc/using/cmdline.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/using/cmdline.rst b/Doc/using/cmdline.rst index 40580590df065ce..f5a7e1b0ddac6bb 100644 --- a/Doc/using/cmdline.rst +++ b/Doc/using/cmdline.rst @@ -809,7 +809,7 @@ conflict. environment before the Python runtime is initialized. This ensures the updated setting is seen in subprocesses, as well as in operations that query the environment rather than the current C locale (such as Python's - own :func:`locale.getdefaultlocale`, or the GNU `readline` library). + own :func:`locale.getdefaultlocale`, or the GNU ``readline`` library). Configuring one of these locales (either explicitly or via the above implicit locale coercion) automatically enables the ``surrogateescape`` From aa87e1c7f08d4d4c14deceec071ae800b84415ce Mon Sep 17 00:00:00 2001 From: Nick Coghlan Date: Sat, 9 Jun 2018 15:23:17 +1000 Subject: [PATCH 4/5] Further cleanups - more explicit -X utf8 description - GNU readline reads the current locale, not the environment - other minor tweaks --- Doc/using/cmdline.rst | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/Doc/using/cmdline.rst b/Doc/using/cmdline.rst index f5a7e1b0ddac6bb..80cd161ae9a608a 100644 --- a/Doc/using/cmdline.rst +++ b/Doc/using/cmdline.rst @@ -438,8 +438,10 @@ Miscellaneous options * Set the :attr:`~sys.flags.dev_mode` attribute of :attr:`sys.flags` to ``True`` - * ``-X utf8`` enables the UTF-8 mode, whereas ``-X utf8=0`` disables the - UTF-8 mode. See :envvar:`PYTHONUTF8` for more details. + * ``-X utf8`` enables UTF-8 mode for operating system interfaces, overriding + the default locale-aware mode. ``-X utf8=0`` explicitly disables UTF-8 + mode (even when it would otherwise activate automatically). + See :envvar:`PYTHONUTF8` for more details. It also allows passing arbitrary values and retrieving them through the :data:`sys._xoptions` dictionary. @@ -792,7 +794,7 @@ conflict. to skip coercing the legacy ASCII-based C and POSIX locales to a more capable UTF-8 based alternative. - If this variable is *not* set (or is set to a value other than ``0``) the + If this variable is *not* set (or is set to a value other than ``0``), the ``LC_ALL`` locale override environment variable is also not set, and the current locale reported for the ``LC_CTYPE`` category is either the default ``C`` locale, or else the explicitly ASCII-based ``POSIX`` locale, then the @@ -806,10 +808,13 @@ conflict. If setting one of these locale categories succeeds, then the ``LC_CTYPE`` environment variable will also be set accordingly in the current process - environment before the Python runtime is initialized. This ensures the - updated setting is seen in subprocesses, as well as in operations that - query the environment rather than the current C locale (such as Python's - own :func:`locale.getdefaultlocale`, or the GNU ``readline`` library). + environment before the Python runtime is initialized. This ensures that in + addition to being seen by both the interpreter itself and other locale-aware + components running in the same process (such as the GNU ``readline`` + library), the updated setting is also seen in subprocesses (regardless of + whether or not those processes are running a Python interpreter), as well as + in operations that query the environment rather than the current C locale + (such as Python's own :func:`locale.getdefaultlocale`). Configuring one of these locales (either explicitly or via the above implicit locale coercion) automatically enables the ``surrogateescape`` @@ -850,7 +855,8 @@ conflict. This means that: - * :func:`sys.getfilesystemencoding()` returns ``'UTF-8'``. + * :func:`sys.getfilesystemencoding()` returns ``'UTF-8'``(the locale + encoding is ignored). * :func:`locale.getpreferredencoding()` returns ``'UTF-8'`` (the locale encoding is ignored, and the function's ``do_setlocale`` parameter has no effect). From 0e8ce62a918d17d68ff3782bca4c7b49cd484857 Mon Sep 17 00:00:00 2001 From: Nick Coghlan Date: Sat, 9 Jun 2018 15:54:55 +1000 Subject: [PATCH 5/5] Markup fix --- Doc/using/cmdline.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/using/cmdline.rst b/Doc/using/cmdline.rst index 80cd161ae9a608a..c6bb0be6bc4cf96 100644 --- a/Doc/using/cmdline.rst +++ b/Doc/using/cmdline.rst @@ -855,7 +855,7 @@ conflict. This means that: - * :func:`sys.getfilesystemencoding()` returns ``'UTF-8'``(the locale + * :func:`sys.getfilesystemencoding()` returns ``'UTF-8'`` (the locale encoding is ignored). * :func:`locale.getpreferredencoding()` returns ``'UTF-8'`` (the locale encoding is ignored, and the function's ``do_setlocale`` parameter has no