public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed
* [PATCH] BaseTools: fix decoding issue in file operation
@ 2020-10-16  7:41 Wang, Jian J
  2020-10-19  8:55 ` 回复: [edk2-devel] " fengyunhua
  0 siblings, 1 reply; 3+ messages in thread
From: Wang, Jian J @ 2020-10-16  7:41 UTC (permalink / raw)
  To: devel; +Cc: Bob Feng, Liming Gao, Yuwei Chen

The build tool reports failure upon file read, such as calling trim
to clean preprocessed source files, if the tool is running on OS with
non-western code-page and the source file has non-ascii characters.

Even if utf-8 has also problem when encountering some characters
encoded in cp1252 (such 0x92, 0x96, 0xa0, etc).

Currently, the safest way to read file in python code is using
'latin-1' (iso-8859-1) because it uses every byte between 00-FF
and then won't cause encoding/decoding issue. It behaves almost
the same as reading file in binary mode.

cp1252 is similar to latin-1 but it doesn't support encoding '\x80'
to '\xff' and doesn't support decoding following bytes:

  '\x81', '\x8d', '\x8f', '\x90', '\x9d'

So if there're utf-8/16 encoded characters in file, it will fail
sometimes.

Refer to following links for details:
  https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block)
  https://en.wikipedia.org/wiki/Windows-1252
  https://kb.iu.edu/d/aepu
  https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html

One can use following python code to verify this.

for i in range(0x100):
    try:
        chr(i).encode('latin-1')
    except:
        print("    %s cannot encode %02x" % ('latin-1', i))

for i in range(0x100):
    try:
        b = bytes([i])
        b.decode('latin-1')
    except:
        print("    %s cannot decode %02x" % ('latin-1', i))

This patch add code to enforce using 'latin-1' as encoding argument
of open() in function OpenLongFilePath(), if the open mode is for
text file only. This can solve the file decoding issue completely.

The possible related BZs:
    https://bugzilla.tianocore.org/show_bug.cgi?id=1434
    https://bugzilla.tianocore.org/show_bug.cgi?id=1637
    https://bugzilla.tianocore.org/show_bug.cgi?id=2578
    https://bugzilla.tianocore.org/show_bug.cgi?id=2709
    https://bugzilla.tianocore.org/show_bug.cgi?id=2829

Cc: Bob Feng <bob.c.feng@intel.com>
Cc: Liming Gao <gaoliming@byosoft.com.cn>
Cc: Yuwei Chen <yuwei.chen@intel.com>
Signed-off-by: Jian J Wang <jian.j.wang@intel.com>
---
 BaseTools/Source/Python/Common/LongFilePathSupport.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/BaseTools/Source/Python/Common/LongFilePathSupport.py b/BaseTools/Source/Python/Common/LongFilePathSupport.py
index 38c4396544..c8dce077f2 100644
--- a/BaseTools/Source/Python/Common/LongFilePathSupport.py
+++ b/BaseTools/Source/Python/Common/LongFilePathSupport.py
@@ -30,7 +30,8 @@ def LongFilePath(FileName):
 # wrap open to support opening a long file path
 #
 def OpenLongFilePath(FileName, Mode='r', Buffer= -1):
-    return open(LongFilePath(FileName), Mode, Buffer)
+    Encoding = None if 'b' in Mode else 'latin-1'
+    return open(LongFilePath(FileName), Mode, Buffer, Encoding)
 
 def CodecOpenLongFilePath(Filename, Mode='rb', Encoding=None, Errors='strict', Buffering=1):
     return codecs.open(LongFilePath(Filename), Mode, Encoding, Errors, Buffering)
-- 
2.24.0.windows.2


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-10-20  4:35 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-10-16  7:41 [PATCH] BaseTools: fix decoding issue in file operation Wang, Jian J
2020-10-19  8:55 ` 回复: [edk2-devel] " fengyunhua
2020-10-20  4:35   ` Bob Feng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox