From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by mx.groups.io with SMTP id smtpd.web11.10025.1602834093820351510 for ; Fri, 16 Oct 2020 00:41:34 -0700 Authentication-Results: mx.groups.io; dkim=missing; spf=pass (domain: intel.com, ip: 192.55.52.120, mailfrom: jian.j.wang@intel.com) IronPort-SDR: iYAiqFuOh2vAJ7Q5VFYBOMWWFuHoRkwguk+184JCX9jrJCm36qv4K0li4XAAzYWShdunCd1CJC IImIoBT6A41g== X-IronPort-AV: E=McAfee;i="6000,8403,9775"; a="163933661" X-IronPort-AV: E=Sophos;i="5.77,382,1596524400"; d="scan'208";a="163933661" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Oct 2020 00:41:30 -0700 IronPort-SDR: NS4G038UWC2oYCwFWG8O2Nlag0tDZm+N2SO6/kV70OSvGcfgKHmZqOOoiPDKt/x7ouxWUVdecy Y8TxPiCy6X7g== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.77,382,1596524400"; d="scan'208";a="346453445" Received: from shwdeopensfp777.ccr.corp.intel.com ([10.239.158.78]) by fmsmga004.fm.intel.com with ESMTP; 16 Oct 2020 00:41:25 -0700 From: "Wang, Jian J" To: devel@edk2.groups.io Cc: Bob Feng , Liming Gao , Yuwei Chen Subject: [PATCH] BaseTools: fix decoding issue in file operation Date: Fri, 16 Oct 2020 15:41:24 +0800 Message-Id: <20201016074124.831-1-jian.j.wang@intel.com> X-Mailer: git-send-email 2.24.0.windows.2 MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable The build tool reports failure upon file read, such as calling trim to clean preprocessed source files, if the tool is running on OS with non-western code-page and the source file has non-ascii characters. Even if utf-8 has also problem when encountering some characters encoded in cp1252 (such 0x92, 0x96, 0xa0, etc). Currently, the safest way to read file in python code is using 'latin-1' (iso-8859-1) because it uses every byte between 00-FF and then won't cause encoding/decoding issue. It behaves almost the same as reading file in binary mode.=0D =0D cp1252 is similar to latin-1 but it doesn't support encoding '\x80'=0D to '\xff' and doesn't support decoding following bytes:=0D =0D '\x81', '\x8d', '\x8f', '\x90', '\x9d' =0D So if there're utf-8/16 encoded characters in file, it will fail=0D sometimes.=0D =0D Refer to following links for details:=0D https://en.wikipedia.org/wiki/Latin-1_Supplement_(Unicode_block)=0D https://en.wikipedia.org/wiki/Windows-1252=0D https://kb.iu.edu/d/aepu=0D https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html=0D One can use following python code to verify this. for i in range(0x100): try: chr(i).encode('latin-1') except: print(" %s cannot encode %02x" % ('latin-1', i)) for i in range(0x100): try: b =3D bytes([i]) b.decode('latin-1') except: print(" %s cannot decode %02x" % ('latin-1', i)) This patch add code to enforce using 'latin-1' as encoding argument of open() in function OpenLongFilePath(), if the open mode is for text file only. This can solve the file decoding issue completely. =0D The possible related BZs:=0D https://bugzilla.tianocore.org/show_bug.cgi?id=3D1434=0D https://bugzilla.tianocore.org/show_bug.cgi?id=3D1637=0D https://bugzilla.tianocore.org/show_bug.cgi?id=3D2578=0D https://bugzilla.tianocore.org/show_bug.cgi?id=3D2709=0D https://bugzilla.tianocore.org/show_bug.cgi?id=3D2829=0D Cc: Bob Feng Cc: Liming Gao Cc: Yuwei Chen Signed-off-by: Jian J Wang --- BaseTools/Source/Python/Common/LongFilePathSupport.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/BaseTools/Source/Python/Common/LongFilePathSupport.py b/BaseTo= ols/Source/Python/Common/LongFilePathSupport.py index 38c4396544..c8dce077f2 100644 --- a/BaseTools/Source/Python/Common/LongFilePathSupport.py +++ b/BaseTools/Source/Python/Common/LongFilePathSupport.py @@ -30,7 +30,8 @@ def LongFilePath(FileName): # wrap open to support opening a long file path=0D #=0D def OpenLongFilePath(FileName, Mode=3D'r', Buffer=3D -1):=0D - return open(LongFilePath(FileName), Mode, Buffer)=0D + Encoding =3D None if 'b' in Mode else 'latin-1'=0D + return open(LongFilePath(FileName), Mode, Buffer, Encoding)=0D =0D def CodecOpenLongFilePath(Filename, Mode=3D'rb', Encoding=3DNone, Errors= =3D'strict', Buffering=3D1):=0D return codecs.open(LongFilePath(Filename), Mode, Encoding, Errors, Buf= fering)=0D --=20 2.24.0.windows.2