public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed
From: "Wang, Jian J" <jian.j.wang@intel.com>
To: devel@edk2.groups.io
Cc: Bob Feng <bob.c.feng@intel.com>,
	Liming Gao <gaoliming@byosoft.com.cn>,
	Yuwei Chen <yuwei.chen@intel.com>
Subject: [PATCH] BaseTools: fix decoding issue in file operation
Date: Fri, 16 Oct 2020 15:41:24 +0800	[thread overview]
Message-ID: <20201016074124.831-1-jian.j.wang@intel.com> (raw)

The build tool reports failure upon file read, such as calling trim
to clean preprocessed source files, if the tool is running on OS with
non-western code-page and the source file has non-ascii characters.

Even if utf-8 has also problem when encountering some characters
encoded in cp1252 (such 0x92, 0x96, 0xa0, etc).

Currently, the safest way to read file in python code is using
'latin-1' (iso-8859-1) because it uses every byte between 00-FF
and then won't cause encoding/decoding issue. It behaves almost
the same as reading file in binary mode.

cp1252 is similar to latin-1 but it doesn't support encoding '\x80'
to '\xff' and doesn't support decoding following bytes:

  '\x81', '\x8d', '\x8f', '\x90', '\x9d'

So if there're utf-8/16 encoded characters in file, it will fail
sometimes.

Refer to following links for details:
  https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block)
  https://en.wikipedia.org/wiki/Windows-1252
  https://kb.iu.edu/d/aepu
  https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html

One can use following python code to verify this.

for i in range(0x100):
    try:
        chr(i).encode('latin-1')
    except:
        print("    %s cannot encode %02x" % ('latin-1', i))

for i in range(0x100):
    try:
        b = bytes([i])
        b.decode('latin-1')
    except:
        print("    %s cannot decode %02x" % ('latin-1', i))

This patch add code to enforce using 'latin-1' as encoding argument
of open() in function OpenLongFilePath(), if the open mode is for
text file only. This can solve the file decoding issue completely.

The possible related BZs:
    https://bugzilla.tianocore.org/show_bug.cgi?id=1434
    https://bugzilla.tianocore.org/show_bug.cgi?id=1637
    https://bugzilla.tianocore.org/show_bug.cgi?id=2578
    https://bugzilla.tianocore.org/show_bug.cgi?id=2709
    https://bugzilla.tianocore.org/show_bug.cgi?id=2829

Cc: Bob Feng <bob.c.feng@intel.com>
Cc: Liming Gao <gaoliming@byosoft.com.cn>
Cc: Yuwei Chen <yuwei.chen@intel.com>
Signed-off-by: Jian J Wang <jian.j.wang@intel.com>
---
 BaseTools/Source/Python/Common/LongFilePathSupport.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/BaseTools/Source/Python/Common/LongFilePathSupport.py b/BaseTools/Source/Python/Common/LongFilePathSupport.py
index 38c4396544..c8dce077f2 100644
--- a/BaseTools/Source/Python/Common/LongFilePathSupport.py
+++ b/BaseTools/Source/Python/Common/LongFilePathSupport.py
@@ -30,7 +30,8 @@ def LongFilePath(FileName):
 # wrap open to support opening a long file path
 #
 def OpenLongFilePath(FileName, Mode='r', Buffer= -1):
-    return open(LongFilePath(FileName), Mode, Buffer)
+    Encoding = None if 'b' in Mode else 'latin-1'
+    return open(LongFilePath(FileName), Mode, Buffer, Encoding)
 
 def CodecOpenLongFilePath(Filename, Mode='rb', Encoding=None, Errors='strict', Buffering=1):
     return codecs.open(LongFilePath(Filename), Mode, Encoding, Errors, Buffering)
-- 
2.24.0.windows.2


             reply	other threads:[~2020-10-16  7:41 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-16  7:41 Wang, Jian J [this message]
2020-10-19  8:55 ` 回复: [edk2-devel] [PATCH] BaseTools: fix decoding issue in file operation fengyunhua
2020-10-20  4:35   ` Bob Feng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201016074124.831-1-jian.j.wang@intel.com \
    --to=devel@edk2.groups.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox