Skip to content

Enhance zipfile to support decoding comments using metadata_encoding #152929

Description

@danny0838

Feature or enhancement

Proposal:

In the current implementation, ZipFile accepts a metadata_encoding parameter to decode member filenames automatically into strings. However, {ZipFile,ZipInfo}.comment remains as raw bytes. This creates an API inconsistency and leaves the burden of proper decoding to the user.

According to the ZIP specification, a member's comment must be decoded using UTF-8 if its EFS flag bit is set. Otherwise, it should fallback to cp437 (per spec) or the local encoding (in practice), which typically aligns with the encoding used for filenames.

Currently, to properly read comments, users must write redundant boilerplate to manually check the EFS flag and apply the appropriate codec. This severely defeats the convenience introduced by the metadata_encoding parameter.

Proposed Solution

Prerequisite

For ZipInfo objects to handle comment encoding/decoding properly, there is a need to access the specified encoding context from the parent archive.

  • Approach A: Pass down the context as an internal attribute, such as _metadata_encoding. This aligns with the strategy used in PR gh-152845: Keep EFS flag for a file loaded from the archive #152846, which preserves the EFS flag during archive modification.
  • Approach B: Bind the parent ZipFile object via an internal attribute like _zipfile. This dynamically links the active encoding and enables future structural guardrails, such as detecting and warning against unsafe cross-archive metadata mutations, though care must be taken regarding pickling behavior.

Implementation

While updating {ZipFile,ZipInfo}.comment to be string-based would be the cleanest API, it is likely undesired due to breaking backward compatibility.

Alternatively, we can introduce a comment_text property (with getter, setter, and deleter) for both ZipFile and ZipInfo to manage string-based comments safely:

1. ZipFile.comment_text
  • get: Unlike member comments, the archive-level comment has no EFS flag in the ZIP specification. The getter should try decoding with UTF-8 first, then fallback to metadata_encoding (or 'cp437' if not provided).
  • set: Encodes with metadata_encoding (or 'cp437' if not provided). Falls back to UTF-8 if encoding fails, optionally issuing a warning.
  • delete: Clears the comment by resetting the underlying archive comment bytes to b''.
2. ZipInfo.comment_text
  • get: Takes the Unicode comment extra field (0x6375) if viable. Otherwise, decodes with UTF-8 if the EFS flag is set. Otherwise, uses the bound metadata_encoding (or 'cp437' if not provided).
  • set: Encodes with UTF-8 if the EFS flag is set. Otherwise, uses the bound metadata_encoding (or 'cp437' if not provided). If it fails, it can force the EFS flag and use UTF-8. An even simpler alternative is to always enforce EFS and UTF-8 when modifying via this property. Optionally clear or update the Unicode comment extra field.
  • delete: Clears the comment by resetting .comment to b''. Optionally clear the associated Unicode comment extra field.

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibStandard Library Python modules in the Lib/ directorytype-featureA feature request or enhancement

    Fields

    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions