From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: mx.groups.io; dkim=missing; spf=pass (domain: redhat.com, ip: 209.132.183.28, mailfrom: lersek@redhat.com) Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by groups.io with SMTP; Wed, 07 Aug 2019 06:26:30 -0700 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 231563C936; Wed, 7 Aug 2019 13:26:30 +0000 (UTC) Received: from lacos-laptop-7.usersys.redhat.com (unknown [10.36.118.2]) by smtp.corp.redhat.com (Postfix) with ESMTP id 815975D70D; Wed, 7 Aug 2019 13:26:28 +0000 (UTC) Subject: Re: [edk2-devel] [Patch 3/3] BaseTools/PatchCheck: Disable text conversion in 'git show' To: "Kinney, Michael D" , "devel@edk2.groups.io" , "Justen, Jordan L" Cc: "Feng, Bob C" , "Gao, Liming" References: <20190802001314.25980-1-michael.d.kinney@intel.com> <20190802001314.25980-4-michael.d.kinney@intel.com> <156477054101.20290.17514635538979393303@jljusten-skl> <8d511f89-f2a8-4d29-65e3-cc76fee12017@redhat.com> From: "Laszlo Ersek" Message-ID: <944f7f77-5500-aa76-824a-8e207e404ed0@redhat.com> Date: Wed, 7 Aug 2019 15:26:27 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.39]); Wed, 07 Aug 2019 13:26:30 +0000 (UTC) Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit On 08/06/19 00:01, Kinney, Michael D wrote: > Laszlo, > > The context of this change is only to the PatchCheck.py tool. > and how that tool uses git show. > > I agree with the summary of very flexible capabilities in git > to help developers review different types of files. All of > those settings that were added to support UNI files in UTF-16 > file format were very valuable when we had to review the > text changes to those binary files. We should be using UTF-8 > now, and we can even extend PatchCheck.py to flag an error if > a UNI file is in UTF-16 format. > > I still prefer we make this change only to PatchCheck.py to > prevent false positives and print lines of text that can > not be found in a developer's working directory. I prefer > this one time change to PatchCheck.py instead of having to > update .gitattributes whenever the git show features are > extended to convert more binary files to text files. More precisely, that would mean that you prefer this one time change to having to update .gitattributes every time a new type of binary file is introduced to edk2. I can see merits in both approaches. From my side, I don't intend to block this patch. Acked-by: Laszlo Ersek Thanks Laszlo > My expectation is that EDK II Maintainers need to review > if a binary file is ok or not for a repo. I would also be > ok with adding general rules to PatchCheck.py to generate > an error if a binary file is added/updated in one of the > text only repos (edk2, edk2-platforms) and not generate > an error if a binary file is added/updated in a repo that > supports binaries (edk-non-osi). > > Best regards, > > Mike > > >> -----Original Message----- >> From: devel@edk2.groups.io >> [mailto:devel@edk2.groups.io] On Behalf Of Laszlo Ersek >> Sent: Friday, August 2, 2019 4:58 PM >> To: Justen, Jordan L ; >> Kinney, Michael D ; >> devel@edk2.groups.io >> Cc: Feng, Bob C ; Gao, Liming >> >> Subject: Re: [edk2-devel] [Patch 3/3] >> BaseTools/PatchCheck: Disable text conversion in 'git >> show' >> >> On 08/02/19 20:29, Jordan Justen wrote: >>> First, I hope we don't add lots of large .pdf files >> into the source >>> tree. I see two duplicated > 200k .pdf files in edk2, >> which seems like >>> a waste of space in the edk2 tree. >>> >>> BaseTools/Source/C/BrotliCompress/docs/brotli- >> comparison-study-2015-09 >>> -22.pdf >>> >> MdeModulePkg/Library/BrotliCustomDecompressLib/docs/bro >> tli-comparison- >>> study-2015-09-22.pdf >>> >>> Second, I'm not too sure about this change. It seems >> like it might >>> have unintended consequences. >>> >>> One thought I had is that it might break UTF-16 >> unicode files diffs, >>> but luckily I think we've pretty much gotten rid of >> UTF-16 files at >>> this point. >>> >>> I also wonder if adding a .pdf attribute like Laszlo >> recommends for >>> .efi might be an alternative solution. >>> >>> >> https://github.com/tianocore/tianocore.github.io/wiki/L >> aszlo's-unkempt >>> -git-guide-for-edk2-contributors-and- >> maintainers#contrib-09 >>> >>> We could actually add these setting in the git tree >> in a >>> .gitattributes file, right? Has this been suggested? >> I think Laszlo >>> documented this before we had converted edk2 to git. >> >> This is a complex question. >> >> There are files which are binary for all intents and >> purposes, so you never want to see a "textualized" >> version of them. Git can attempt recognizing these >> files, but it can take hints too. >> >> For these files, once they are safely classified as >> "binary", it is a separate question whether you want a >> changeset shown (or >> email-formatted) for these files to contain base64-like >> binary patches, or just a statement that "binary files >> differ". >> >> Then there are files that are binary, but can be >> "textualized" for the purposes of diffing and even >> merging. Automated UCS2<->UTF8 conversion is an >> example, and I used that while edk2 still had UCS2- >> encoded UNI files. >> >> Some examples (not claiming this is going to be an >> exhaustive list -- for that, see gitattributes(5)), to >> be placed in .git/info/attributes (local clone only) or >> .gitattributes (tracked in the repo): >> >> *.efi -diff >> >> This classifies *.efi as binary (turns off the >> heuristics). >> >> Consider gitattributes(5) -- excerpt: >> >> diff >> Unset >> A path to which the diff attribute is >> unset will >> generate Binary files differ (or a >> binary patch, if >> binary patches are enabled). >> Unspecified >> A path to which the diff attribute is >> unspecified first >> gets its contents inspected, and if it >> looks like text >> and is smaller than >> core.bigFileThreshold, it is >> treated as text. Otherwise it would >> generate Binary >> files differ. >> >> git-show will only report "binary files differ", unless >> we pass --binary to it; in which case it will offer a >> base64-like binary patch. >> git-format-patch will produce that binary patch by >> default, unless we pass --no-binary to it. >> >> >> Another example, in .git/info/attributes or >> .gitattributes: >> >> *.uni diff=uni merge=uni >> >> The manual says: >> >> diff >> String >> Diff is shown using the specified diff >> driver. Each >> driver may specify one or more options, >> as described in >> the following section. The options for >> the diff driver >> "foo" are defined by the configuration >> variables in the >> "diff.foo" section of the Git config >> file. >> >> merge >> String >> 3-way merge is performed using the >> specified custom >> merge driver. The built-in 3-way merge >> driver can be >> explicitly specified by asking for >> "text" driver; the >> built-in "take the current branch" >> driver can be >> requested with "binary". >> >> Accordingly, from .git/config: >> >> [diff "uni"] >> textconv = iconv -f UCS-2 -t UTF-8 >> binary = true >> >> [merge "uni"] >> name = merge driver for UCS-2 encoded text files >> driver = uni-merge %O %A %B >> >> And my uni-merge script: >> >> --------- >> #!/bin/bash >> >> # Merge UCS-2 encoded text files. >> # This is a git-merge driver. It will overwrite >> "$MINE". >> # >> # Exit status: >> # 0 -- no conflicts >> # 1 -- conflicts >> # 2 -- trouble >> >> OLD="$1" >> MINE="$2" >> YOURS="$3" >> >> # Set safe defaults for the trap handler. >> OLD_UTF8= >> MY_UTF8= >> YOUR_UTF8= >> NEW_UTF8= >> EX=2 >> >> trap 'rm -f -- "$OLD_UTF8" "$MY_UTF8" "$YOUR_UTF8" >> "$NEW_UTF8"; exit $EX' EXIT set -e -u -C >> >> OLD_UTF8=$(mktemp) >> iconv --from UCS-2 --to UTF-8 <"$OLD" >>"$OLD_UTF8" >> >> MY_UTF8=$(mktemp) >> iconv --from UCS-2 --to UTF-8 <"$MINE" >>"$MY_UTF8" >> >> YOUR_UTF8=$(mktemp) >> iconv --from UCS-2 --to UTF-8 <"$YOURS" >>"$YOUR_UTF8" >> >> NEW_UTF8=$(mktemp) >> set +e >> diff3 --merge --label=mine --label=old --label=yours -- >> \ >> "$MY_UTF8" "$OLD_UTF8" "$YOUR_UTF8" >>"$NEW_UTF8" >> DIFFRET=$? >> set -e >> >> if [ 0 -ne "$DIFFRET" ] && [ 1 -ne "$DIFFRET" ]; then >> exit >> fi >> >> iconv --from UTF-8 --to UCS-2 <"$NEW_UTF8" >|"$MINE" >> EX=$DIFFRET >> --------- >> >> My understanding is that the "--no-textconv" option >> would undo the "textconv" that I set under [diff >> "uni"]. >> >> In other words, it tells git that, even if the >> heuristics suggested >> *some* conversion to text (I can't say what that would >> be, in the general case), git should still treat the >> file as binary. (At which point the --no-binary / -- >> binary options decide whether we get a base64-like >> encoding, or just a message saying "binary files >> differ"). >> Therefore I think (= agree with Jordan) that the effect >> is the same as marking the affected file types in >> .gitattributes as "-diff". >> >> I guess the relevant question is scope. "--no-textconv" >> doesn't require us to assign "-diff" to specific file >> suffixes, such as *.pdf, *.efi, and so on. But, it also >> doesn't allow us to specify *some* text conversions, >> even if we wanted to (it would override & disable them, >> such as my "iconv" conversion when diffing UCS2 encoded >> UNI files). >> >> Adding a .gitattributes file to the root of the >> repository, and updating it whenever necessary (for >> example when we add the first .pdf file, the first .efi >> file...), is likely the best. >> >> ( >> >> My recommendation at >> > Laszlo's-unkempt-git-guide-for-edk2-contributors-and- >> maintainers#contrib-09> >> may have targeted the local ".git/info/attributes" file >> because people can have quite different opinions on how >> git should act on various files. In theory, someone >> having simplistic PDF files might even want: >> >> In the attributes: >> >> *.pdf diff=pdf >> >> In the config: >> >> [diff "pdf"] >> textconv = sh -c 'pdftotext -- "$1" -' pdftotext- >> wrapper >> binary = true >> >> ("pdftotext" converts PDF to text, but its command line >> is awkward -- the hyphen marking stdout must be >> explicit, and must come as second operand. Hence the >> wrapping into "sh" -- and then we need to spell out >> $0 as well, for which I picked "pdftotext-wrapper". The >> "--" separator is a safety measure against filenames >> starting with "-", which could be mis-interpreted as >> options. "--" terminates the option list and starts the >> operand list.) >> >> ) >> >> Thanks >> Laszlo >> >> >>> On 2019-08-01 17:13:14, Michael D Kinney wrote: >>>> https://bugzilla.tianocore.org/show_bug.cgi?id=2044 >>>> >>>> 'git show' is used to extrat the patch contents for >> analysis. >>>> Add the flag '--no-textconv' to the 'git show' >> command to disable the >>>> conversion from some binary file types to text >> content. >>>> >>>> Without this change, binary files such as .pdf files >> are converted to >>>> text in the show command and PatchCheck complains >> that the wrong line >>>> endings are used in the patch. >>>> >>>> Cc: Bob Feng >>>> Cc: Liming Gao >>>> Cc: Jordan Justen >>>> Signed-off-by: Michael D Kinney >> >>>> --- >>>> BaseTools/Scripts/PatchCheck.py | 2 +- >>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>> >>>> diff --git a/BaseTools/Scripts/PatchCheck.py >>>> b/BaseTools/Scripts/PatchCheck.py index >> 6b07241bfe..2a4e6f603e 100755 >>>> --- a/BaseTools/Scripts/PatchCheck.py >>>> +++ b/BaseTools/Scripts/PatchCheck.py >>>> @@ -543,7 +543,7 @@ class CheckGitCommits: >>>> >>>> def read_patch_from_git(self, commit): >>>> # Run git to get the commit patch >>>> - return self.run_git('show', '-- >> pretty=email', commit) >>>> + return self.run_git('show', '-- >> pretty=email', >>>> + '--no-textconv', commit) >>>> >>>> def run_git(self, *args): >>>> cmd = [ 'git' ] >>>> -- >>>> 2.21.0.windows.1 >>>> >> >> >> >