public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed
From: "Kinney, Michael D" <michael.d.kinney@intel.com>
To: Tim Lewis <tim.lewis@insyde.com>,
	"edk2-devel@lists.01.org" <edk2-devel@lists.01.org>,
	"Kinney, Michael D" <michael.d.kinney@intel.com>
Cc: "Carsey, Jaben" <jaben.carsey@intel.com>,
	"Shaw, Kevin W" <kevin.w.shaw@intel.com>
Subject: Re: [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM
Date: Fri, 28 Apr 2017 17:22:54 +0000	[thread overview]
Message-ID: <E92EE9817A31E24EB0585FDF735412F57D16E0F5@ORSMSX113.amr.corp.intel.com> (raw)
In-Reply-To: <7236196A5DF6C040855A6D96F556A53F5773D0@msmail.insydesw.com.tw>

Tim,

Thanks for the additional review on this topic.

I will push the UNI spec update.

Mike

> -----Original Message-----
> From: Tim Lewis [mailto:tim.lewis@insyde.com]
> Sent: Friday, April 28, 2017 9:48 AM
> To: Tim Lewis <tim.lewis@insyde.com>; Kinney, Michael D <michael.d.kinney@intel.com>;
> edk2-devel@lists.01.org
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8
> without a BOM
> 
> Mike --
> 
> After an internal review, we have found that there are fewer files than previously
> thought affected by this change.
> 
> So we have no objections to updating the UNI Spec to match the current EDK2 tool
> behavior?
> 
> Thanks,
> 
> Tim
> 
> -----Original Message-----
> From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of Tim Lewis
> Sent: Wednesday, April 26, 2017 5:27 PM
> To: Kinney, Michael D <michael.d.kinney@intel.com>; edk2-devel@lists.01.org
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W <kevin.w.shaw@intel.com>
> Subject: Re: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8
> without a BOM
> 
> Mike --
> 
> No, the meta-data (in this case, file extension .uni) was used by tools to determine
> the format of the file contents, as described in section 2.6. Little-endian, UCS-2 was
> assumed.
> 
> "When a higher-level protocol supplies mechanisms for handling the endianness of
> integral data types, it is not necessary to use Unicode encoding schemes or the byte
> order mark. In those cases Unicode text is simply a sequence of integral data types."
> 
> Of course, the tools had to be updated to accommodate different build systems, and
> even alternate encodings. But this doesn't remove the previous behavior.
> 
> Tim
> 
> 
> 
> -----Original Message-----
> From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> Sent: Wednesday, April 26, 2017 5:02 PM
> To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; Kinney, Michael D
> <michael.d.kinney@intel.com>
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8
> without a BOM
> 
> Hi Tim,
> 
> For UTF-16 files on disk with no BOM, do you follow the big-endian assumption as
> documented in the Unicode Specification Section 3.10, D98?
> 
> http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf
> 
> Mike
> 
> > -----Original Message-----
> > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > Sent: Wednesday, April 26, 2017 4:13 PM
> > To: Kinney, Michael D <michael.d.kinney@intel.com>;
> > edk2-devel@lists.01.org
> > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> > <kevin.w.shaw@intel.com>
> > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on
> > disk to be
> > UTF-8 without a BOM
> >
> > Mike --
> >
> > I would prefer to update the docs to match actual industry practice.
> > EDK2 is not the universe.
> >
> > Insyde has been using UNI files well before my time here (> 5 years).
> > The fact that recent specifications or EDK2 tools (2 years) added BOM
> > support it does not remove the backward compatibility issue.
> >
> > The Unicode specification usage of "not recommended" is referring
> > specifically to its usage for byte-order. The full sentence (from 2.6)
> > is: "Use of a BOM is neither required nor recommended [for byte order
> > determination] for UTF-8, but may be encountered in contexts where
> > UTF-8 data is converted from other encoding forms that use a BOM or
> > where the BOM is used as a UTF-8 signature" Editorial comment mine. In this case,
> the BOM marker would appear as a UTF-8 signature.
> > This would distinguish it from ASCII or any of the multi-byte encoding
> > schemes used.
> >
> > Tim
> >
> > -----Original Message-----
> > From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> > Sent: Wednesday, April 26, 2017 3:47 PM
> > To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; Kinney,
> > Michael D <michael.d.kinney@intel.com>
> > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> > <kevin.w.shaw@intel.com>
> > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on
> > disk to be
> > UTF-8 without a BOM
> >
> > Hi Tim,
> >
> > The recommendation for UTF-8 usage is to not use a BOM, which is why
> > no BOM for
> > UTF-8 was selected for EDK II.
> >
> > The current task is to update docs to match the current tool behavior.
> >
> > The EDK II repos on GitHub have .uni files in UTF-8 format without a
> > BOM to support easier patch review.
> >
> > There are ways to use GIT features to auto-convert .uni files when
> > pulling content from EDK II repos and pushing commits.
> > That may or may not help with the specific issue you are raising.
> >
> > If you have ideas on a tool change request to EDK II that would
> > provide compatibility with current EDK II tool behavior and support
> > UTF-16LE without a BOM, then let's work that through in a Bugzilla
> > feature request.  If we find a solution, we can update the docs and tools again.
> >
> > Do you have any objections to updating the UNI Spec to match the
> > current tool behavior?
> >
> > Thanks,
> >
> > Mike
> >
> > > -----Original Message-----
> > > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > > Sent: Wednesday, April 26, 2017 11:54 AM
> > > To: Kinney, Michael D <michael.d.kinney@intel.com>;
> > > edk2-devel@lists.01.org
> > > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> > > <kevin.w.shaw@intel.com>
> > > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files
> > > on disk to be
> > > UTF-8 without a BOM
> > >
> > > Mike --
> > >
> > > This is not about files in the EDK II repository. This is about
> > > files created based on the spec, and created with other sets of
> > > tools. Go back to early 2015, to the Build spec (1.22, etc.),
> > > Appendix G, which is where the UNI stuff used to live.
> > >
> > > The point is: files which worked before, and, at worst, generated a
> > > warning before, now are interpreted incorrectly even though they
> > > have correct
> > data.
> > >
> > > Making ASCII (or UTF-8) the default without a BOM is the breaking change.
> > >
> > > Tim
> > >
> > > -----Original Message-----
> > > From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> > > Sent: Wednesday, April 26, 2017 11:47 AM
> > > To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org;
> > > Kinney, Michael D <michael.d.kinney@intel.com>
> > > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> > > <kevin.w.shaw@intel.com>
> > > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files
> > > on disk to be
> > > UTF-8 without a BOM
> > >
> > > Tim,
> > >
> > > If you look at the entire file history of the EDK II, you will see
> > > that the BOM has always been present in the UTF-16LE formatted files.
> > >
> > > The build tools were updated in 2015 to *add* support for UTF-8 file.
> > > The .uni files in the EDK II project were then converted from
> > > UTF-16LE with a BOM to UTF-8 without a BOM.  This provided an easier
> > > developer experience when using GIT to do email patch review of .uni files.
> > >
> > > It is possible I am missing something here.  Can you please provide
> > > a pointer to the EDK II commit(s) where BOMs were added to UTF-16LE .uni files.
> > >
> > > Thanks,
> > >
> > > Mike
> > >
> > > > -----Original Message-----
> > > > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > > > Sent: Wednesday, April 26, 2017 11:34 AM
> > > > To: Kinney, Michael D <michael.d.kinney@intel.com>;
> > > > edk2-devel@lists.01.org
> > > > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> > > > <kevin.w.shaw@intel.com>
> > > > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files
> > > > on disk to be
> > > > UTF-8 without a BOM
> > > >
> > > > Mike --
> > > >
> > > > I understand that EDK2 has decided to add BOM markers two years ago.
> > > > Adding a BOM didn't change the default. The problem is (a) there
> > > > are still hundreds of files extant in our codebase which were
> > > > created prior to the 2015 changes and still in use, and (b) this
> > > > change is not backward
> > > compatible for these files.
> > > >
> > > > Tim
> > > >
> > > > -----Original Message-----
> > > > From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> > > > Sent: Wednesday, April 26, 2017 11:11 AM
> > > > To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org;
> > > > Kinney, Michael D <michael.d.kinney@intel.com>
> > > > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> > > > <kevin.w.shaw@intel.com>
> > > > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files
> > > > on disk to be
> > > > UTF-8 without a BOM
> > > >
> > > > Hi Tim,
> > > >
> > > > This is not a request for a new change.  Instead, the intent of
> > > > this document change is to update the document to reflect the
> > > > implemented behavior of the EDK II tools.  The EDK II tool updates
> > > > to add UTF-8 file support were completed with the patches listed
> > > > below.  Notice that the main one for normal build support was checked in almost
> 2 years ago.
> > > >
> > > > BaseTools - UniClassObject - 6/23/2015
> > > > *
> > > > https://github.com/tianocore/edk2/commit/d80e451b187c9d33cbd771253
> > > > fb
> > > > d5
> > > > 119670f75c6
> > > > *
> > > > https://github.com/tianocore/edk2/commit/be264422c95c781a345978f17
> > > > b7
> > > > e8
> > > > 0b91f816eda
> > > >
> > > > BaseTools - ECC - 12/29/2015
> > > > *
> > > > https://github.com/tianocore/edk2/commit/975889279df2eb3d3338cb88a
> > > > fb
> > > > 3f
> > > > aa71ddde4d6
> > > >
> > > > BaseTools - UPT - 4/25/2016
> > > > *
> > > > https://github.com/tianocore/edk2/commit/4a21fb3b67a0ef1655b43e936
> > > > 8b
> > > > 6b
> > > > 697bbf327af
> > > >
> > > > This was intended to be a 100% backwards compatible change.
> > > >
> > > > All .uni files in the EDK II project in UTF-16LE format have
> > > > always use a
> > BOM.
> > > > Please checkout UDK2015 or older UDKs and you will see all .uni
> > > > files start with 0xff 0xfe.
> > > >
> > > > Thanks,
> > > >
> > > > Mike
> > > >
> > > > > -----Original Message-----
> > > > > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > > > > Sent: Wednesday, April 26, 2017 9:15 AM
> > > > > To: Kinney, Michael D <michael.d.kinney@intel.com>;
> > > > > edk2-devel@lists.01.org
> > > > > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W
> > > > > <kevin.w.shaw@intel.com>
> > > > > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni
> > > > > files on disk to be
> > > > > UTF-8 without a BOM
> > > > >
> > > > > Mike --
> > > > >
> > > > > This breaks our existing build tools, which assume that a file
> > > > > without a BOM is UTF-16.
> > > > >
> > > > > Tim
> > > > >
> > > > > -----Original Message-----
> > > > > From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On
> > > > > Behalf Of Michael Kinney
> > > > > Sent: Tuesday, April 25, 2017 6:07 PM
> > > > > To: edk2-devel@lists.01.org
> > > > > Cc: Jaben Carsey <jaben.carsey@intel.com>; Kevin W Shaw
> > > > > <kevin.w.shaw@intel.com>
> > > > > Subject: [edk2] [edk2-UniSpecification PATCH] Allow .uni files
> > > > > on disk to be UTF-
> > > > > 8 without a BOM
> > > > >
> > > > > https://bugzilla.tianocore.org/show_bug.cgi?id=507
> > > > >
> > > > > Cc: Jaben Carsey <jaben.carsey@intel.com>
> > > > > Cc: Yonghong Zhu <yonghong.zhu@intel.com>
> > > > > Cc: Kevin W Shaw <kevin.w.shaw@intel.com>
> > > > > Contributed-under: TianoCore Contribution Agreement 1.1
> > > > > Signed-off-by: Michael Kinney <michael.d.kinney@intel.com>
> > > > > ---
> > > > >  2_unicode_strings_file_format.md |  9 ++++++---
> > > > >  README.md                        | 27 ++++++++++++++-------------
> > > > >  2 files changed, 20 insertions(+), 16 deletions(-)
> > > > >
> > > > > diff --git a/2_unicode_strings_file_format.md
> > > > > b/2_unicode_strings_file_format.md
> > > > > index 0150c85..7a4a019 100644
> > > > > --- a/2_unicode_strings_file_format.md
> > > > > +++ b/2_unicode_strings_file_format.md
> > > > > @@ -33,7 +33,8 @@
> > > > >
> > > > >  EDK II Unicode files are used for mapping token names to
> > > > > localized strings that are identified by an RFC4646 language code.
> > > > > The format for storing EDK II - Unicode files is UTF-16LE. The
> > > > > character content must be
> > > > UCS-2.
> > > > > +Unicode files on disk is UTF-8 (without a BOM character) or
> > > > > +UTF-16LE (with a BOM character). The character content must be UCS-2.
> > > > >
> > > > >  Strings ends are determined by the first of the following items found:
> > > > >
> > > > > @@ -44,11 +45,13 @@ Strings ends are determined by the first of
> > > > > the following items found:
> > > > >
> > > > >  Comments may appear anywhere within the string file.
> > > > >
> > > > > -All the files must begin with a Unicode BOM character.
> > > > > +All UTF-16LE files must begin with a Unicode BOM character.
> > > > > +All UTF-8 files must not begin with a Unicode BOM character.
> > > > >
> > > > >  **********
> > > > >  **NOTE:** Please make sure you select an editor that supports
> > > > > UCS-2 characters - that can be stored in a UTF-16LE file.
> > > > > +that can be stored in either a UTF-8 (without a BOM character)
> > > > > +or a UTF-16LE file (with a BOM character).
> > > > >  **********
> > > > >
> > > > >  ## 2.1 Common EBNF
> > > > > diff --git a/README.md b/README.md index 63842a1..015aef1 100644
> > > > > --- a/README.md
> > > > > +++ b/README.md
> > > > > @@ -77,16 +77,17 @@ Copyright (c) 2016-2017, Intel Corporation.
> > > > > All rights reserved.
> > > > >
> > > > >  ### Revision History
> > > > >
> > > > > -| Revision          | Description
> > > > > | Date            |
> > > > > -| ----------------- |
> > > > > -| ----------------------------------------------------------
> > > > > ------------------------------ | --------------- |
> > > > > -| 1.0               | Initial Release.
> > > > > | February 2014   |
> > > > > -| 1.1               | Updated EBNF to follow syntax specified in EBNF by
> > the
> > > > > ANTLR project.                    | August 2014     |
> > > > > -|                   | Added content related to EDK II Meta-Data
> > > > > -| Unicode
> > > files.
> > > > > |                 |
> > > > > -|                   | Restructured document.
> > > > > |                 |
> > > > > -|                   | Removed security and C format GUID
> > > > > -| definitions, not
> > > > > required for HII or other UNI files. |                 |
> > > > > -|                   | Removed invalid escape code sequences.
> > > > > |                 |
> > > > > -| 1.2               | Added optional font formatting
> > > > > | September 2014  |
> > > > > -| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`
> > > > > | April 2015      |
> > > > > -| 1.3               | Added: Syntax for non-ascii characters inside quoted
> > > > > strings.                            | March 2016      |
> > > > > -|                   | Removed: Info on specific consumers (.INF
> > > > > -| &
> > > > > -| .DEC)
> > > > removed.
> > > > > |                 |
> > > > > -| 1.4               | Convert to GitBook format
> > > > > | March 2017      |
> > > > > +| Revision          | Description
> > > > > | Date            |
> > > > > +| ----------------- |
> > > > > +| ----------------------------------------------------------
> > > > > ------------------------------------------------------------ |
> > > > > --------------- |
> > > > > +| 1.0               | Initial Release.
> > > > > | February 2014   |
> > > > > +| 1.1               | Updated EBNF to follow syntax specified in EBNF by
> > the
> > > > > ANTLR project.                                                  | August
> > 2014
> > > > > |
> > > > > +|                   | Added content related to EDK II Meta-Data
> > > > > +| Unicode
> > > files.
> > > > > |                 |
> > > > > +|                   | Restructured document.
> > > > > |                 |
> > > > > +|                   | Removed security and C format GUID
> > > > > +| definitions, not
> > > > > required for HII or other UNI files.                               |
> > > > > |
> > > > > +|                   | Removed invalid escape code sequences.
> > > > > |                 |
> > > > > +| 1.2               | Added optional font formatting
> > > > > | September 2014  |
> > > > > +| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`
> > > > > | April 2015      |
> > > > > +| 1.3               | Added: Syntax for non-ascii characters inside quoted
> > > > > strings.                                                          | March
> > > 2016
> > > > > |
> > > > > +|                   | Removed: Info on specific consumers (.INF
> > > > > +| &
> > > > > +| .DEC)
> > > > removed.
> > > > > |                 |
> > > > > +| 1.4               | Convert to GitBook format
> > > > > | April 2017      |
> > > > > +|                   |
> > > > > +| [#507](https://bugzilla.tianocore.org/show_bug.cgi?id=507)
> > > > > UNI Spec: Clarify that .uni files maybe UTF-8 without a BOM |
> > > |
> > > > > --
> > > > > 2.6.3.windows.1
> > > > >
> > > > > _______________________________________________
> > > > > edk2-devel mailing list
> > > > > edk2-devel@lists.01.org
> > > > > https://lists.01.org/mailman/listinfo/edk2-devel
> _______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.01.org
> https://lists.01.org/mailman/listinfo/edk2-devel


  reply	other threads:[~2017-04-28 17:22 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-26  1:07 [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM Michael Kinney
2017-04-26  1:07 ` Michael Kinney
2017-04-26 16:15   ` Tim Lewis
2017-04-26 17:44     ` Carsey, Jaben
2017-04-26 17:53       ` Tim Lewis
2017-04-26 18:25         ` Kinney, Michael D
2017-04-26 18:11     ` Kinney, Michael D
2017-04-26 18:34       ` Tim Lewis
2017-04-26 18:46         ` Kinney, Michael D
2017-04-26 18:53           ` Tim Lewis
2017-04-26 22:47             ` Kinney, Michael D
2017-04-26 23:13               ` Tim Lewis
2017-04-27  0:02                 ` Kinney, Michael D
2017-04-27  0:26                   ` Tim Lewis
2017-04-28 16:47                     ` Tim Lewis
2017-04-28 17:22                       ` Kinney, Michael D [this message]
2017-04-26  2:10 ` Zhu, Yonghong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=E92EE9817A31E24EB0585FDF735412F57D16E0F5@ORSMSX113.amr.corp.intel.com \
    --to=devel@edk2.groups.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox