From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: mx.groups.io; dkim=missing; spf=pass (domain: redhat.com, ip: 209.132.183.28, mailfrom: lersek@redhat.com) Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by groups.io with SMTP; Fri, 02 Aug 2019 16:57:48 -0700 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 1FE3436899; Fri, 2 Aug 2019 23:57:48 +0000 (UTC) Received: from lacos-laptop-7.usersys.redhat.com (ovpn-116-36.ams2.redhat.com [10.36.116.36]) by smtp.corp.redhat.com (Postfix) with ESMTP id B4C6F5D9E2; Fri, 2 Aug 2019 23:57:46 +0000 (UTC) Subject: Re: [Patch 3/3] BaseTools/PatchCheck: Disable text conversion in 'git show' To: Jordan Justen , Michael D Kinney , devel@edk2.groups.io Cc: Bob Feng , Liming Gao References: <20190802001314.25980-1-michael.d.kinney@intel.com> <20190802001314.25980-4-michael.d.kinney@intel.com> <156477054101.20290.17514635538979393303@jljusten-skl> From: "Laszlo Ersek" Message-ID: <8d511f89-f2a8-4d29-65e3-cc76fee12017@redhat.com> Date: Sat, 3 Aug 2019 01:57:45 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: <156477054101.20290.17514635538979393303@jljusten-skl> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.30]); Fri, 02 Aug 2019 23:57:48 +0000 (UTC) Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit On 08/02/19 20:29, Jordan Justen wrote: > First, I hope we don't add lots of large .pdf files into the source > tree. I see two duplicated > 200k .pdf files in edk2, which seems like > a waste of space in the edk2 tree. > > BaseTools/Source/C/BrotliCompress/docs/brotli-comparison-study-2015-09-22.pdf > MdeModulePkg/Library/BrotliCustomDecompressLib/docs/brotli-comparison-study-2015-09-22.pdf > > Second, I'm not too sure about this change. It seems like it might > have unintended consequences. > > One thought I had is that it might break UTF-16 unicode files diffs, > but luckily I think we've pretty much gotten rid of UTF-16 files at > this point. > > I also wonder if adding a .pdf attribute like Laszlo recommends for > .efi might be an alternative solution. > > https://github.com/tianocore/tianocore.github.io/wiki/Laszlo's-unkempt-git-guide-for-edk2-contributors-and-maintainers#contrib-09 > > We could actually add these setting in the git tree in a > .gitattributes file, right? Has this been suggested? I think Laszlo > documented this before we had converted edk2 to git. This is a complex question. There are files which are binary for all intents and purposes, so you never want to see a "textualized" version of them. Git can attempt recognizing these files, but it can take hints too. For these files, once they are safely classified as "binary", it is a separate question whether you want a changeset shown (or email-formatted) for these files to contain base64-like binary patches, or just a statement that "binary files differ". Then there are files that are binary, but can be "textualized" for the purposes of diffing and even merging. Automated UCS2<->UTF8 conversion is an example, and I used that while edk2 still had UCS2-encoded UNI files. Some examples (not claiming this is going to be an exhaustive list -- for that, see gitattributes(5)), to be placed in .git/info/attributes (local clone only) or .gitattributes (tracked in the repo): *.efi -diff This classifies *.efi as binary (turns off the heuristics). Consider gitattributes(5) -- excerpt: diff Unset A path to which the diff attribute is unset will generate Binary files differ (or a binary patch, if binary patches are enabled). Unspecified A path to which the diff attribute is unspecified first gets its contents inspected, and if it looks like text and is smaller than core.bigFileThreshold, it is treated as text. Otherwise it would generate Binary files differ. git-show will only report "binary files differ", unless we pass --binary to it; in which case it will offer a base64-like binary patch. git-format-patch will produce that binary patch by default, unless we pass --no-binary to it. Another example, in .git/info/attributes or .gitattributes: *.uni diff=uni merge=uni The manual says: diff String Diff is shown using the specified diff driver. Each driver may specify one or more options, as described in the following section. The options for the diff driver "foo" are defined by the configuration variables in the "diff.foo" section of the Git config file. merge String 3-way merge is performed using the specified custom merge driver. The built-in 3-way merge driver can be explicitly specified by asking for "text" driver; the built-in "take the current branch" driver can be requested with "binary". Accordingly, from .git/config: [diff "uni"] textconv = iconv -f UCS-2 -t UTF-8 binary = true [merge "uni"] name = merge driver for UCS-2 encoded text files driver = uni-merge %O %A %B And my uni-merge script: --------- #!/bin/bash # Merge UCS-2 encoded text files. # This is a git-merge driver. It will overwrite "$MINE". # # Exit status: # 0 -- no conflicts # 1 -- conflicts # 2 -- trouble OLD="$1" MINE="$2" YOURS="$3" # Set safe defaults for the trap handler. OLD_UTF8= MY_UTF8= YOUR_UTF8= NEW_UTF8= EX=2 trap 'rm -f -- "$OLD_UTF8" "$MY_UTF8" "$YOUR_UTF8" "$NEW_UTF8"; exit $EX' EXIT set -e -u -C OLD_UTF8=$(mktemp) iconv --from UCS-2 --to UTF-8 <"$OLD" >>"$OLD_UTF8" MY_UTF8=$(mktemp) iconv --from UCS-2 --to UTF-8 <"$MINE" >>"$MY_UTF8" YOUR_UTF8=$(mktemp) iconv --from UCS-2 --to UTF-8 <"$YOURS" >>"$YOUR_UTF8" NEW_UTF8=$(mktemp) set +e diff3 --merge --label=mine --label=old --label=yours -- \ "$MY_UTF8" "$OLD_UTF8" "$YOUR_UTF8" >>"$NEW_UTF8" DIFFRET=$? set -e if [ 0 -ne "$DIFFRET" ] && [ 1 -ne "$DIFFRET" ]; then exit fi iconv --from UTF-8 --to UCS-2 <"$NEW_UTF8" >|"$MINE" EX=$DIFFRET --------- My understanding is that the "--no-textconv" option would undo the "textconv" that I set under [diff "uni"]. In other words, it tells git that, even if the heuristics suggested *some* conversion to text (I can't say what that would be, in the general case), git should still treat the file as binary. (At which point the --no-binary / --binary options decide whether we get a base64-like encoding, or just a message saying "binary files differ"). Therefore I think (= agree with Jordan) that the effect is the same as marking the affected file types in .gitattributes as "-diff". I guess the relevant question is scope. "--no-textconv" doesn't require us to assign "-diff" to specific file suffixes, such as *.pdf, *.efi, and so on. But, it also doesn't allow us to specify *some* text conversions, even if we wanted to (it would override & disable them, such as my "iconv" conversion when diffing UCS2 encoded UNI files). Adding a .gitattributes file to the root of the repository, and updating it whenever necessary (for example when we add the first .pdf file, the first .efi file...), is likely the best. ( My recommendation at may have targeted the local ".git/info/attributes" file because people can have quite different opinions on how git should act on various files. In theory, someone having simplistic PDF files might even want: In the attributes: *.pdf diff=pdf In the config: [diff "pdf"] textconv = sh -c 'pdftotext -- "$1" -' pdftotext-wrapper binary = true ("pdftotext" converts PDF to text, but its command line is awkward -- the hyphen marking stdout must be explicit, and must come as second operand. Hence the wrapping into "sh" -- and then we need to spell out $0 as well, for which I picked "pdftotext-wrapper". The "--" separator is a safety measure against filenames starting with "-", which could be mis-interpreted as options. "--" terminates the option list and starts the operand list.) ) Thanks Laszlo > On 2019-08-01 17:13:14, Michael D Kinney wrote: >> https://bugzilla.tianocore.org/show_bug.cgi?id=2044 >> >> 'git show' is used to extrat the patch contents for analysis. >> Add the flag '--no-textconv' to the 'git show' command to >> disable the conversion from some binary file types to text >> content. >> >> Without this change, binary files such as .pdf files are >> converted to text in the show command and PatchCheck complains >> that the wrong line endings are used in the patch. >> >> Cc: Bob Feng >> Cc: Liming Gao >> Cc: Jordan Justen >> Signed-off-by: Michael D Kinney >> --- >> BaseTools/Scripts/PatchCheck.py | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/BaseTools/Scripts/PatchCheck.py b/BaseTools/Scripts/PatchCheck.py >> index 6b07241bfe..2a4e6f603e 100755 >> --- a/BaseTools/Scripts/PatchCheck.py >> +++ b/BaseTools/Scripts/PatchCheck.py >> @@ -543,7 +543,7 @@ class CheckGitCommits: >> >> def read_patch_from_git(self, commit): >> # Run git to get the commit patch >> - return self.run_git('show', '--pretty=email', commit) >> + return self.run_git('show', '--pretty=email', '--no-textconv', commit) >> >> def run_git(self, *args): >> cmd = [ 'git' ] >> -- >> 2.21.0.windows.1 >>