public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed
From: "Laszlo Ersek" <lersek@redhat.com>
To: "Kinney, Michael D" <michael.d.kinney@intel.com>,
	"devel@edk2.groups.io" <devel@edk2.groups.io>,
	"Justen, Jordan L" <jordan.l.justen@intel.com>
Cc: "Feng, Bob C" <bob.c.feng@intel.com>,
	"Gao, Liming" <liming.gao@intel.com>
Subject: Re: [edk2-devel] [Patch 3/3] BaseTools/PatchCheck: Disable text conversion in 'git show'
Date: Wed, 7 Aug 2019 15:26:27 +0200	[thread overview]
Message-ID: <944f7f77-5500-aa76-824a-8e207e404ed0@redhat.com> (raw)
In-Reply-To: <E92EE9817A31E24EB0585FDF735412F5B9D83B8A@ORSMSX113.amr.corp.intel.com>

On 08/06/19 00:01, Kinney, Michael D wrote:
> Laszlo,
> 
> The context of this change is only to the PatchCheck.py tool.
> and how that tool uses git show.
> 
> I agree with the summary of very flexible capabilities in git
> to help developers review different types of files.  All of 
> those settings that were added to support UNI files in UTF-16
> file format were very valuable when we had to review the
> text changes to those binary files.  We should be using UTF-8
> now, and we can even extend PatchCheck.py to flag an error if
> a UNI file is in UTF-16 format.
> 
> I still prefer we make this change only to PatchCheck.py to
> prevent false positives and print lines of text that can 
> not be found in a developer's working directory.  I prefer
> this one time change to PatchCheck.py instead of having to
> update .gitattributes whenever the git show features are 
> extended to convert more binary files to text files.

More precisely, that would mean that you prefer this one time change to
having to update .gitattributes every time a new type of binary file is
introduced to edk2.

I can see merits in both approaches. From my side, I don't intend to
block this patch.

Acked-by: Laszlo Ersek <lersek@redhat.com>

Thanks
Laszlo

> My expectation is that EDK II Maintainers need to review 
> if a binary file is ok or not for a repo.  I would also be
> ok with adding general rules to PatchCheck.py to generate
> an error if a binary file is added/updated in one of the
> text only repos (edk2, edk2-platforms) and not generate
> an error if a binary file is added/updated in a repo that
> supports binaries (edk-non-osi).
> 
> Best regards,
> 
> Mike
> 
> 
>> -----Original Message-----
>> From: devel@edk2.groups.io
>> [mailto:devel@edk2.groups.io] On Behalf Of Laszlo Ersek
>> Sent: Friday, August 2, 2019 4:58 PM
>> To: Justen, Jordan L <jordan.l.justen@intel.com>;
>> Kinney, Michael D <michael.d.kinney@intel.com>;
>> devel@edk2.groups.io
>> Cc: Feng, Bob C <bob.c.feng@intel.com>; Gao, Liming
>> <liming.gao@intel.com>
>> Subject: Re: [edk2-devel] [Patch 3/3]
>> BaseTools/PatchCheck: Disable text conversion in 'git
>> show'
>>
>> On 08/02/19 20:29, Jordan Justen wrote:
>>> First, I hope we don't add lots of large .pdf files
>> into the source
>>> tree. I see two duplicated > 200k .pdf files in edk2,
>> which seems like
>>> a waste of space in the edk2 tree.
>>>
>>> BaseTools/Source/C/BrotliCompress/docs/brotli-
>> comparison-study-2015-09
>>> -22.pdf
>>>
>> MdeModulePkg/Library/BrotliCustomDecompressLib/docs/bro
>> tli-comparison-
>>> study-2015-09-22.pdf
>>>
>>> Second, I'm not too sure about this change. It seems
>> like it might
>>> have unintended consequences.
>>>
>>> One thought I had is that it might break UTF-16
>> unicode files diffs,
>>> but luckily I think we've pretty much gotten rid of
>> UTF-16 files at
>>> this point.
>>>
>>> I also wonder if adding a .pdf attribute like Laszlo
>> recommends for
>>> .efi might be an alternative solution.
>>>
>>>
>> https://github.com/tianocore/tianocore.github.io/wiki/L
>> aszlo's-unkempt
>>> -git-guide-for-edk2-contributors-and-
>> maintainers#contrib-09
>>>
>>> We could actually add these setting in the git tree
>> in a
>>> .gitattributes file, right? Has this been suggested?
>> I think Laszlo
>>> documented this before we had converted edk2 to git.
>>
>> This is a complex question.
>>
>> There are files which are binary for all intents and
>> purposes, so you never want to see a "textualized"
>> version of them. Git can attempt recognizing these
>> files, but it can take hints too.
>>
>> For these files, once they are safely classified as
>> "binary", it is a separate question whether you want a
>> changeset shown (or
>> email-formatted) for these files to contain base64-like
>> binary patches, or just a statement that "binary files
>> differ".
>>
>> Then there are files that are binary, but can be
>> "textualized" for the purposes of diffing and even
>> merging. Automated UCS2<->UTF8 conversion is an
>> example, and I used that while edk2 still had UCS2-
>> encoded UNI files.
>>
>> Some examples (not claiming this is going to be an
>> exhaustive list -- for that, see gitattributes(5)), to
>> be placed in .git/info/attributes (local clone only) or
>> .gitattributes (tracked in the repo):
>>
>> *.efi -diff
>>
>> This classifies *.efi as binary (turns off the
>> heuristics).
>>
>> Consider gitattributes(5) -- excerpt:
>>
>>        diff
>>            Unset
>>                A path to which the diff attribute is
>> unset will
>>                generate Binary files differ (or a
>> binary patch, if
>>                binary patches are enabled).
>>            Unspecified
>>                A path to which the diff attribute is
>> unspecified first
>>                gets its contents inspected, and if it
>> looks like text
>>                and is smaller than
>> core.bigFileThreshold, it is
>>                treated as text. Otherwise it would
>> generate Binary
>>                files differ.
>>
>> git-show will only report "binary files differ", unless
>> we pass --binary to it; in which case it will offer a
>> base64-like binary patch.
>> git-format-patch will produce that binary patch by
>> default, unless we pass --no-binary to it.
>>
>>
>> Another example, in .git/info/attributes or
>> .gitattributes:
>>
>> *.uni diff=uni merge=uni
>>
>> The manual says:
>>
>>        diff
>>            String
>>                Diff is shown using the specified diff
>> driver. Each
>>                driver may specify one or more options,
>> as described in
>>                the following section. The options for
>> the diff driver
>>                "foo" are defined by the configuration
>> variables in the
>>                "diff.foo" section of the Git config
>> file.
>>
>>        merge
>>            String
>>                3-way merge is performed using the
>> specified custom
>>                merge driver. The built-in 3-way merge
>> driver can be
>>                explicitly specified by asking for
>> "text" driver; the
>>                built-in "take the current branch"
>> driver can be
>>                requested with "binary".
>>
>> Accordingly, from .git/config:
>>
>> [diff "uni"]
>>       textconv = iconv -f UCS-2 -t UTF-8
>>       binary = true
>>
>> [merge "uni"]
>>       name = merge driver for UCS-2 encoded text files
>>       driver = uni-merge %O %A %B
>>
>> And my uni-merge script:
>>
>> ---------
>> #!/bin/bash
>>
>> # Merge UCS-2 encoded text files.
>> # This is a git-merge driver. It will overwrite
>> "$MINE".
>> #
>> # Exit status:
>> # 0 -- no conflicts
>> # 1 -- conflicts
>> # 2 -- trouble
>>
>> OLD="$1"
>> MINE="$2"
>> YOURS="$3"
>>
>> # Set safe defaults for the trap handler.
>> OLD_UTF8=
>> MY_UTF8=
>> YOUR_UTF8=
>> NEW_UTF8=
>> EX=2
>>
>> trap 'rm -f -- "$OLD_UTF8" "$MY_UTF8" "$YOUR_UTF8"
>> "$NEW_UTF8"; exit $EX' EXIT set -e -u -C
>>
>> OLD_UTF8=$(mktemp)
>> iconv --from UCS-2 --to UTF-8 <"$OLD" >>"$OLD_UTF8"
>>
>> MY_UTF8=$(mktemp)
>> iconv --from UCS-2 --to UTF-8 <"$MINE" >>"$MY_UTF8"
>>
>> YOUR_UTF8=$(mktemp)
>> iconv --from UCS-2 --to UTF-8 <"$YOURS" >>"$YOUR_UTF8"
>>
>> NEW_UTF8=$(mktemp)
>> set +e
>> diff3 --merge --label=mine --label=old --label=yours --
>> \
>>     "$MY_UTF8" "$OLD_UTF8" "$YOUR_UTF8" >>"$NEW_UTF8"
>> DIFFRET=$?
>> set -e
>>
>> if [ 0 -ne "$DIFFRET" ] && [ 1 -ne "$DIFFRET" ]; then
>>   exit
>> fi
>>
>> iconv --from UTF-8 --to UCS-2 <"$NEW_UTF8" >|"$MINE"
>> EX=$DIFFRET
>> ---------
>>
>> My understanding is that the "--no-textconv" option
>> would undo the "textconv" that I set under [diff
>> "uni"].
>>
>> In other words, it tells git that, even if the
>> heuristics suggested
>> *some* conversion to text (I can't say what that would
>> be, in the general case), git should still treat the
>> file as binary. (At which point the --no-binary / --
>> binary options decide whether we get a base64-like
>> encoding, or just a message saying "binary files
>> differ").
>> Therefore I think (= agree with Jordan) that the effect
>> is the same as marking the affected file types in
>> .gitattributes as "-diff".
>>
>> I guess the relevant question is scope. "--no-textconv"
>> doesn't require us to assign "-diff" to specific file
>> suffixes, such as *.pdf, *.efi, and so on. But, it also
>> doesn't allow us to specify *some* text conversions,
>> even if we wanted to (it would override & disable them,
>> such as my "iconv" conversion when diffing UCS2 encoded
>> UNI files).
>>
>> Adding a .gitattributes file to the root of the
>> repository, and updating it whenever necessary (for
>> example when we add the first .pdf file, the first .efi
>> file...), is likely the best.
>>
>> (
>>
>> My recommendation at
>> <https://github.com/tianocore/tianocore.github.io/wiki/
>> Laszlo's-unkempt-git-guide-for-edk2-contributors-and-
>> maintainers#contrib-09>
>> may have targeted the local ".git/info/attributes" file
>> because people can have quite different opinions on how
>> git should act on various files. In theory, someone
>> having simplistic PDF files might even want:
>>
>> In the attributes:
>>
>> *.pdf diff=pdf
>>
>> In the config:
>>
>> [diff "pdf"]
>>   textconv = sh -c 'pdftotext -- "$1" -' pdftotext-
>> wrapper
>>   binary = true
>>
>> ("pdftotext" converts PDF to text, but its command line
>> is awkward -- the hyphen marking stdout must be
>> explicit, and must come as second operand. Hence the
>> wrapping into "sh" -- and then we need to spell out
>> $0 as well, for which I picked "pdftotext-wrapper". The
>> "--" separator is a safety measure against filenames
>> starting with "-", which could be mis-interpreted as
>> options. "--" terminates the option list and starts the
>> operand list.)
>>
>> )
>>
>> Thanks
>> Laszlo
>>
>>
>>> On 2019-08-01 17:13:14, Michael D Kinney wrote:
>>>> https://bugzilla.tianocore.org/show_bug.cgi?id=2044
>>>>
>>>> 'git show' is used to extrat the patch contents for
>> analysis.
>>>> Add the flag '--no-textconv' to the 'git show'
>> command to disable the
>>>> conversion from some binary file types to text
>> content.
>>>>
>>>> Without this change, binary files such as .pdf files
>> are converted to
>>>> text in the show command and PatchCheck complains
>> that the wrong line
>>>> endings are used in the patch.
>>>>
>>>> Cc: Bob Feng <bob.c.feng@intel.com>
>>>> Cc: Liming Gao <liming.gao@intel.com>
>>>> Cc: Jordan Justen <jordan.l.justen@intel.com>
>>>> Signed-off-by: Michael D Kinney
>> <michael.d.kinney@intel.com>
>>>> ---
>>>>  BaseTools/Scripts/PatchCheck.py | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/BaseTools/Scripts/PatchCheck.py
>>>> b/BaseTools/Scripts/PatchCheck.py index
>> 6b07241bfe..2a4e6f603e 100755
>>>> --- a/BaseTools/Scripts/PatchCheck.py
>>>> +++ b/BaseTools/Scripts/PatchCheck.py
>>>> @@ -543,7 +543,7 @@ class CheckGitCommits:
>>>>
>>>>      def read_patch_from_git(self, commit):
>>>>          # Run git to get the commit patch
>>>> -        return self.run_git('show', '--
>> pretty=email', commit)
>>>> +        return self.run_git('show', '--
>> pretty=email',
>>>> + '--no-textconv', commit)
>>>>
>>>>      def run_git(self, *args):
>>>>          cmd = [ 'git' ]
>>>> --
>>>> 2.21.0.windows.1
>>>>
>>
>>
>> 
> 


  reply	other threads:[~2019-08-07 13:26 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-02  0:13 [Patch 0/3] BaseTools/PatchCheck: Fix false positives Michael D Kinney
2019-08-02  0:13 ` [Patch 1/3] BaseTools/PatchCheck: Ignore blank lines in diff Michael D Kinney
2019-08-02  0:13 ` [Patch 2/3] BaseTools/PatchCheck: Add copy from/to keywords Michael D Kinney
2019-08-02  0:13 ` [Patch 3/3] BaseTools/PatchCheck: Disable text conversion in 'git show' Michael D Kinney
2019-08-02 18:29   ` Jordan Justen
2019-08-02 18:51     ` Michael D Kinney
2019-08-02 23:57     ` Laszlo Ersek
2019-08-05 22:01       ` [edk2-devel] " Michael D Kinney
2019-08-07 13:26         ` Laszlo Ersek [this message]
2019-08-07 19:42         ` Jordan Justen
2019-08-08 20:57           ` Laszlo Ersek
2019-08-08 21:08             ` Michael D Kinney
2019-08-08 21:15               ` Andrew Fish
2019-08-08 22:34                 ` Laszlo Ersek
2019-08-02  7:59 ` [edk2-devel] [Patch 0/3] BaseTools/PatchCheck: Fix false positives Bob Feng
2019-08-02 18:30 ` Jordan Justen
2019-08-07 11:41 ` Liming Gao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=944f7f77-5500-aa76-824a-8e207e404ed0@redhat.com \
    --to=devel@edk2.groups.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox