You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the current implementation, ZipFile accepts a metadata_encoding parameter to decode member filenames automatically into strings. However, {ZipFile,ZipInfo}.comment remains as raw bytes. This creates an API inconsistency and leaves the burden of proper decoding to the user.
According to the ZIP specification, a member's comment must be decoded using UTF-8 if its EFS flag bit is set. Otherwise, it should fallback to cp437 (per spec) or the local encoding (in practice), which typically aligns with the encoding used for filenames.
Currently, to properly read comments, users must write redundant boilerplate to manually check the EFS flag and apply the appropriate codec. This severely defeats the convenience introduced by the metadata_encoding parameter.
Proposed Solution
Prerequisite
For ZipInfo objects to handle comment encoding/decoding properly, there is a need to access the specified encoding context from the parent archive.
Approach B: Bind the parent ZipFile object via an internal attribute like _zipfile. This dynamically links the active encoding and enables future structural guardrails, such as detecting and warning against unsafe cross-archive metadata mutations, though care must be taken regarding pickling behavior.
Implementation
While updating {ZipFile,ZipInfo}.comment to be string-based would be the cleanest API, it is likely undesired due to breaking backward compatibility.
Alternatively, we can introduce a comment_text property (with getter, setter, and deleter) for both ZipFile and ZipInfo to manage string-based comments safely:
1. ZipFile.comment_text
get: Unlike member comments, the archive-level comment has no EFS flag in the ZIP specification. The getter should try decoding with UTF-8 first, then fallback to metadata_encoding (or 'cp437' if not provided).
set: Encodes with metadata_encoding (or 'cp437' if not provided). Falls back to UTF-8 if encoding fails, optionally issuing a warning.
delete: Clears the comment by resetting the underlying archive comment bytes to b''.
2. ZipInfo.comment_text
get: Takes the Unicode comment extra field (0x6375) if viable. Otherwise, decodes with UTF-8 if the EFS flag is set. Otherwise, uses the bound metadata_encoding (or 'cp437' if not provided).
set: Encodes with UTF-8 if the EFS flag is set. Otherwise, uses the bound metadata_encoding (or 'cp437' if not provided). If it fails, it can force the EFS flag and use UTF-8. An even simpler alternative is to always enforce EFS and UTF-8 when modifying via this property. Optionally clear or update the Unicode comment extra field.
delete: Clears the comment by resetting .comment to b''. Optionally clear the associated Unicode comment extra field.
Feature or enhancement
Proposal:
In the current implementation,
ZipFileaccepts ametadata_encodingparameter to decode member filenames automatically into strings. However,{ZipFile,ZipInfo}.commentremains as raw bytes. This creates an API inconsistency and leaves the burden of proper decoding to the user.According to the ZIP specification, a member's comment must be decoded using UTF-8 if its EFS flag bit is set. Otherwise, it should fallback to cp437 (per spec) or the local encoding (in practice), which typically aligns with the encoding used for filenames.
Currently, to properly read comments, users must write redundant boilerplate to manually check the EFS flag and apply the appropriate codec. This severely defeats the convenience introduced by the
metadata_encodingparameter.Proposed Solution
Prerequisite
For
ZipInfoobjects to handle comment encoding/decoding properly, there is a need to access the specified encoding context from the parent archive._metadata_encoding. This aligns with the strategy used in PR gh-152845: Keep EFS flag for a file loaded from the archive #152846, which preserves the EFS flag during archive modification.ZipFileobject via an internal attribute like_zipfile. This dynamically links the active encoding and enables future structural guardrails, such as detecting and warning against unsafe cross-archive metadata mutations, though care must be taken regarding pickling behavior.Implementation
While updating
{ZipFile,ZipInfo}.commentto be string-based would be the cleanest API, it is likely undesired due to breaking backward compatibility.Alternatively, we can introduce a
comment_textproperty (with getter, setter, and deleter) for bothZipFileandZipInfoto manage string-based comments safely:1.
ZipFile.comment_textmetadata_encoding(or 'cp437' if not provided).metadata_encoding(or 'cp437' if not provided). Falls back to UTF-8 if encoding fails, optionally issuing a warning.b''.2.
ZipInfo.comment_text0x6375) if viable. Otherwise, decodes with UTF-8 if the EFS flag is set. Otherwise, uses the boundmetadata_encoding(or 'cp437' if not provided).metadata_encoding(or 'cp437' if not provided). If it fails, it can force the EFS flag and use UTF-8. An even simpler alternative is to always enforce EFS and UTF-8 when modifying via this property. Optionally clear or update the Unicode comment extra field..commenttob''. Optionally clear the associated Unicode comment extra field.Has this already been discussed elsewhere?
No response given
Links to previous discussion of this feature:
No response