public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed
From: "Laszlo Ersek" <lersek@redhat.com>
To: Jordan Justen <jordan.l.justen@intel.com>,
	Michael D Kinney <michael.d.kinney@intel.com>,
	devel@edk2.groups.io
Cc: Bob Feng <bob.c.feng@intel.com>, Liming Gao <liming.gao@intel.com>
Subject: Re: [Patch 3/3] BaseTools/PatchCheck: Disable text conversion in 'git show'
Date: Sat, 3 Aug 2019 01:57:45 +0200	[thread overview]
Message-ID: <8d511f89-f2a8-4d29-65e3-cc76fee12017@redhat.com> (raw)
In-Reply-To: <156477054101.20290.17514635538979393303@jljusten-skl>

On 08/02/19 20:29, Jordan Justen wrote:
> First, I hope we don't add lots of large .pdf files into the source
> tree. I see two duplicated > 200k .pdf files in edk2, which seems like
> a waste of space in the edk2 tree.
>
> BaseTools/Source/C/BrotliCompress/docs/brotli-comparison-study-2015-09-22.pdf
> MdeModulePkg/Library/BrotliCustomDecompressLib/docs/brotli-comparison-study-2015-09-22.pdf
>
> Second, I'm not too sure about this change. It seems like it might
> have unintended consequences.
>
> One thought I had is that it might break UTF-16 unicode files diffs,
> but luckily I think we've pretty much gotten rid of UTF-16 files at
> this point.
>
> I also wonder if adding a .pdf attribute like Laszlo recommends for
> .efi might be an alternative solution.
>
> https://github.com/tianocore/tianocore.github.io/wiki/Laszlo's-unkempt-git-guide-for-edk2-contributors-and-maintainers#contrib-09
>
> We could actually add these setting in the git tree in a
> .gitattributes file, right? Has this been suggested? I think Laszlo
> documented this before we had converted edk2 to git.

This is a complex question.

There are files which are binary for all intents and purposes, so you
never want to see a "textualized" version of them. Git can attempt
recognizing these files, but it can take hints too.

For these files, once they are safely classified as "binary", it is a
separate question whether you want a changeset shown (or
email-formatted) for these files to contain base64-like binary patches,
or just a statement that "binary files differ".

Then there are files that are binary, but can be "textualized" for the
purposes of diffing and even merging. Automated UCS2<->UTF8 conversion
is an example, and I used that while edk2 still had UCS2-encoded UNI
files.

Some examples (not claiming this is going to be an exhaustive list --
for that, see gitattributes(5)), to be placed in .git/info/attributes
(local clone only) or .gitattributes (tracked in the repo):

*.efi -diff

This classifies *.efi as binary (turns off the heuristics).

Consider gitattributes(5) -- excerpt:

       diff
           Unset
               A path to which the diff attribute is unset will
               generate Binary files differ (or a binary patch, if
               binary patches are enabled).
           Unspecified
               A path to which the diff attribute is unspecified first
               gets its contents inspected, and if it looks like text
               and is smaller than core.bigFileThreshold, it is
               treated as text. Otherwise it would generate Binary
               files differ.

git-show will only report "binary files differ", unless we pass --binary
to it; in which case it will offer a base64-like binary patch.
git-format-patch will produce that binary patch by default, unless we
pass --no-binary to it.


Another example, in .git/info/attributes or .gitattributes:

*.uni diff=uni merge=uni

The manual says:

       diff
           String
               Diff is shown using the specified diff driver. Each
               driver may specify one or more options, as described in
               the following section. The options for the diff driver
               "foo" are defined by the configuration variables in the
               "diff.foo" section of the Git config file.

       merge
           String
               3-way merge is performed using the specified custom
               merge driver. The built-in 3-way merge driver can be
               explicitly specified by asking for "text" driver; the
               built-in "take the current branch" driver can be
               requested with "binary".

Accordingly, from .git/config:

[diff "uni"]
      textconv = iconv -f UCS-2 -t UTF-8
      binary = true

[merge "uni"]
      name = merge driver for UCS-2 encoded text files
      driver = uni-merge %O %A %B

And my uni-merge script:

---------
#!/bin/bash

# Merge UCS-2 encoded text files.
# This is a git-merge driver. It will overwrite "$MINE".
#
# Exit status:
# 0 -- no conflicts
# 1 -- conflicts
# 2 -- trouble

OLD="$1"
MINE="$2"
YOURS="$3"

# Set safe defaults for the trap handler.
OLD_UTF8=
MY_UTF8=
YOUR_UTF8=
NEW_UTF8=
EX=2

trap 'rm -f -- "$OLD_UTF8" "$MY_UTF8" "$YOUR_UTF8" "$NEW_UTF8"; exit $EX' EXIT
set -e -u -C

OLD_UTF8=$(mktemp)
iconv --from UCS-2 --to UTF-8 <"$OLD" >>"$OLD_UTF8"

MY_UTF8=$(mktemp)
iconv --from UCS-2 --to UTF-8 <"$MINE" >>"$MY_UTF8"

YOUR_UTF8=$(mktemp)
iconv --from UCS-2 --to UTF-8 <"$YOURS" >>"$YOUR_UTF8"

NEW_UTF8=$(mktemp)
set +e
diff3 --merge --label=mine --label=old --label=yours -- \
    "$MY_UTF8" "$OLD_UTF8" "$YOUR_UTF8" >>"$NEW_UTF8"
DIFFRET=$?
set -e

if [ 0 -ne "$DIFFRET" ] && [ 1 -ne "$DIFFRET" ]; then
  exit
fi

iconv --from UTF-8 --to UCS-2 <"$NEW_UTF8" >|"$MINE"
EX=$DIFFRET
---------

My understanding is that the "--no-textconv" option would undo the
"textconv" that I set under [diff "uni"].

In other words, it tells git that, even if the heuristics suggested
*some* conversion to text (I can't say what that would be, in the
general case), git should still treat the file as binary. (At which
point the --no-binary / --binary options decide whether we get a
base64-like encoding, or just a message saying "binary files differ").
Therefore I think (= agree with Jordan) that the effect is the same as
marking the affected file types in .gitattributes as "-diff".

I guess the relevant question is scope. "--no-textconv" doesn't require
us to assign "-diff" to specific file suffixes, such as *.pdf, *.efi,
and so on. But, it also doesn't allow us to specify *some* text
conversions, even if we wanted to (it would override & disable them,
such as my "iconv" conversion when diffing UCS2 encoded UNI files).

Adding a .gitattributes file to the root of the repository, and updating
it whenever necessary (for example when we add the first .pdf file, the
first .efi file...), is likely the best.

(

My recommendation at
<https://github.com/tianocore/tianocore.github.io/wiki/Laszlo's-unkempt-git-guide-for-edk2-contributors-and-maintainers#contrib-09>
may have targeted the local ".git/info/attributes" file because people
can have quite different opinions on how git should act on various
files. In theory, someone having simplistic PDF files might even want:

In the attributes:

*.pdf diff=pdf

In the config:

[diff "pdf"]
  textconv = sh -c 'pdftotext -- "$1" -' pdftotext-wrapper
  binary = true

("pdftotext" converts PDF to text, but its command line is awkward --
the hyphen marking stdout must be explicit, and must come as second
operand. Hence the wrapping into "sh" -- and then we need to spell out
$0 as well, for which I picked "pdftotext-wrapper". The "--" separator
is a safety measure against filenames starting with "-", which could be
mis-interpreted as options. "--" terminates the option list and starts
the operand list.)

)

Thanks
Laszlo


> On 2019-08-01 17:13:14, Michael D Kinney wrote:
>> https://bugzilla.tianocore.org/show_bug.cgi?id=2044
>>
>> 'git show' is used to extrat the patch contents for analysis.
>> Add the flag '--no-textconv' to the 'git show' command to
>> disable the conversion from some binary file types to text
>> content.
>>
>> Without this change, binary files such as .pdf files are
>> converted to text in the show command and PatchCheck complains
>> that the wrong line endings are used in the patch.
>>
>> Cc: Bob Feng <bob.c.feng@intel.com>
>> Cc: Liming Gao <liming.gao@intel.com>
>> Cc: Jordan Justen <jordan.l.justen@intel.com>
>> Signed-off-by: Michael D Kinney <michael.d.kinney@intel.com>
>> ---
>>  BaseTools/Scripts/PatchCheck.py | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/BaseTools/Scripts/PatchCheck.py b/BaseTools/Scripts/PatchCheck.py
>> index 6b07241bfe..2a4e6f603e 100755
>> --- a/BaseTools/Scripts/PatchCheck.py
>> +++ b/BaseTools/Scripts/PatchCheck.py
>> @@ -543,7 +543,7 @@ class CheckGitCommits:
>>
>>      def read_patch_from_git(self, commit):
>>          # Run git to get the commit patch
>> -        return self.run_git('show', '--pretty=email', commit)
>> +        return self.run_git('show', '--pretty=email', '--no-textconv', commit)
>>
>>      def run_git(self, *args):
>>          cmd = [ 'git' ]
>> --
>> 2.21.0.windows.1
>>


  parent reply	other threads:[~2019-08-02 23:57 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-02  0:13 [Patch 0/3] BaseTools/PatchCheck: Fix false positives Michael D Kinney
2019-08-02  0:13 ` [Patch 1/3] BaseTools/PatchCheck: Ignore blank lines in diff Michael D Kinney
2019-08-02  0:13 ` [Patch 2/3] BaseTools/PatchCheck: Add copy from/to keywords Michael D Kinney
2019-08-02  0:13 ` [Patch 3/3] BaseTools/PatchCheck: Disable text conversion in 'git show' Michael D Kinney
2019-08-02 18:29   ` Jordan Justen
2019-08-02 18:51     ` Michael D Kinney
2019-08-02 23:57     ` Laszlo Ersek [this message]
2019-08-05 22:01       ` [edk2-devel] " Michael D Kinney
2019-08-07 13:26         ` Laszlo Ersek
2019-08-07 19:42         ` Jordan Justen
2019-08-08 20:57           ` Laszlo Ersek
2019-08-08 21:08             ` Michael D Kinney
2019-08-08 21:15               ` Andrew Fish
2019-08-08 22:34                 ` Laszlo Ersek
2019-08-02  7:59 ` [edk2-devel] [Patch 0/3] BaseTools/PatchCheck: Fix false positives Bob Feng
2019-08-02 18:30 ` Jordan Justen
2019-08-07 11:41 ` Liming Gao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8d511f89-f2a8-4d29-65e3-cc76fee12017@redhat.com \
    --to=devel@edk2.groups.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox