public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed
From: Tim Lewis <tim.lewis@insyde.com>
To: "Kinney, Michael D" <michael.d.kinney@intel.com>,
	"edk2-devel@lists.01.org" <edk2-devel@lists.01.org>
Cc: "Carsey, Jaben" <jaben.carsey@intel.com>,
	"Shaw, Kevin W" <kevin.w.shaw@intel.com>
Subject: Re: [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM
Date: Wed, 26 Apr 2017 23:13:06 +0000	[thread overview]
Message-ID: <7236196A5DF6C040855A6D96F556A53F57683A@msmail.insydesw.com.tw> (raw)
In-Reply-To: <E92EE9817A31E24EB0585FDF735412F57D16D645@ORSMSX113.amr.corp.intel.com>

Mike --

I would prefer to update the docs to match actual industry practice. EDK2 is not the universe. 

Insyde has been using UNI files well before my time here (> 5 years). The fact that recent specifications or EDK2 tools (2 years) added BOM support it does not remove the backward compatibility issue.

The Unicode specification usage of "not recommended" is referring specifically to its usage for byte-order. The full sentence (from 2.6) is: "Use of a BOM is neither required nor recommended [for byte order determination] for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature" Editorial comment mine. In this case, the BOM marker would appear as a UTF-8 signature. This would distinguish it from ASCII or any of the multi-byte encoding schemes used.

Tim 

-----Original Message-----
From: Kinney, Michael D [mailto:michael.d.kinney@intel.com] 
Sent: Wednesday, April 26, 2017 3:47 PM
To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; Kinney, Michael D <michael.d.kinney@intel.com>
Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W <kevin.w.shaw@intel.com>
Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM

Hi Tim,

The recommendation for UTF-8 usage is to not use a BOM, which is why no BOM for UTF-8 was selected for EDK II.

The current task is to update docs to match the current tool behavior.

The EDK II repos on GitHub have .uni files in UTF-8 format without a BOM to support easier patch review.

There are ways to use GIT features to auto-convert .uni files when pulling content from EDK II repos and pushing commits.  
That may or may not help with the specific issue you are raising.

If you have ideas on a tool change request to EDK II that would provide compatibility with current EDK II tool behavior and support UTF-16LE without a BOM, then let's work that through in a Bugzilla feature request.  If we find a solution, we can update the docs and tools again.

Do you have any objections to updating the UNI Spec to match the current tool behavior?

Thanks,

Mike

> -----Original Message-----
> From: Tim Lewis [mailto:tim.lewis@insyde.com]
> Sent: Wednesday, April 26, 2017 11:54 AM
> To: Kinney, Michael D <michael.d.kinney@intel.com>; 
> edk2-devel@lists.01.org
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on 
> disk to be
> UTF-8 without a BOM
> 
> Mike --
> 
> This is not about files in the EDK II repository. This is about files 
> created based on the spec, and created with other sets of tools. Go 
> back to early 2015, to the Build spec (1.22, etc.), Appendix G, which 
> is where the UNI stuff used to live.
> 
> The point is: files which worked before, and, at worst, generated a 
> warning before, now are interpreted incorrectly even though they have correct data.
> 
> Making ASCII (or UTF-8) the default without a BOM is the breaking change.
> 
> Tim
> 
> -----Original Message-----
> From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> Sent: Wednesday, April 26, 2017 11:47 AM
> To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; Kinney, 
> Michael D <michael.d.kinney@intel.com>
> Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> <kevin.w.shaw@intel.com>
> Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on 
> disk to be
> UTF-8 without a BOM
> 
> Tim,
> 
> If you look at the entire file history of the EDK II, you will see 
> that the BOM has always been present in the UTF-16LE formatted files.
> 
> The build tools were updated in 2015 to *add* support for UTF-8 file.
> The .uni files in the EDK II project were then converted from UTF-16LE 
> with a BOM to UTF-8 without a BOM.  This provided an easier developer 
> experience when using GIT to do email patch review of .uni files.
> 
> It is possible I am missing something here.  Can you please provide a 
> pointer to the EDK II commit(s) where BOMs were added to UTF-16LE .uni files.
> 
> Thanks,
> 
> Mike
> 
> > -----Original Message-----
> > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > Sent: Wednesday, April 26, 2017 11:34 AM
> > To: Kinney, Michael D <michael.d.kinney@intel.com>; 
> > edk2-devel@lists.01.org
> > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> > <kevin.w.shaw@intel.com>
> > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files 
> > on disk to be
> > UTF-8 without a BOM
> >
> > Mike --
> >
> > I understand that EDK2 has decided to add BOM markers two years ago.
> > Adding a BOM didn't change the default. The problem is (a) there are 
> > still hundreds of files extant in our codebase which were created 
> > prior to the 2015 changes and still in use, and (b) this change is 
> > not backward
> compatible for these files.
> >
> > Tim
> >
> > -----Original Message-----
> > From: Kinney, Michael D [mailto:michael.d.kinney@intel.com]
> > Sent: Wednesday, April 26, 2017 11:11 AM
> > To: Tim Lewis <tim.lewis@insyde.com>; edk2-devel@lists.01.org; 
> > Kinney, Michael D <michael.d.kinney@intel.com>
> > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> > <kevin.w.shaw@intel.com>
> > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files 
> > on disk to be
> > UTF-8 without a BOM
> >
> > Hi Tim,
> >
> > This is not a request for a new change.  Instead, the intent of this 
> > document change is to update the document to reflect the implemented 
> > behavior of the EDK II tools.  The EDK II tool updates to add UTF-8 
> > file support were completed with the patches listed below.  Notice 
> > that the main one for normal build support was checked in almost 2 years ago.
> >
> > BaseTools - UniClassObject - 6/23/2015
> > *
> > https://github.com/tianocore/edk2/commit/d80e451b187c9d33cbd771253fb
> > d5
> > 119670f75c6
> > *
> > https://github.com/tianocore/edk2/commit/be264422c95c781a345978f17b7
> > e8
> > 0b91f816eda
> >
> > BaseTools - ECC - 12/29/2015
> > *
> > https://github.com/tianocore/edk2/commit/975889279df2eb3d3338cb88afb
> > 3f
> > aa71ddde4d6
> >
> > BaseTools - UPT - 4/25/2016
> > *
> > https://github.com/tianocore/edk2/commit/4a21fb3b67a0ef1655b43e9368b
> > 6b
> > 697bbf327af
> >
> > This was intended to be a 100% backwards compatible change.
> >
> > All .uni files in the EDK II project in UTF-16LE format have always use a BOM.
> > Please checkout UDK2015 or older UDKs and you will see all .uni 
> > files start with 0xff 0xfe.
> >
> > Thanks,
> >
> > Mike
> >
> > > -----Original Message-----
> > > From: Tim Lewis [mailto:tim.lewis@insyde.com]
> > > Sent: Wednesday, April 26, 2017 9:15 AM
> > > To: Kinney, Michael D <michael.d.kinney@intel.com>; 
> > > edk2-devel@lists.01.org
> > > Cc: Carsey, Jaben <jaben.carsey@intel.com>; Shaw, Kevin W 
> > > <kevin.w.shaw@intel.com>
> > > Subject: RE: [edk2] [edk2-UniSpecification PATCH] Allow .uni files 
> > > on disk to be
> > > UTF-8 without a BOM
> > >
> > > Mike --
> > >
> > > This breaks our existing build tools, which assume that a file 
> > > without a BOM is UTF-16.
> > >
> > > Tim
> > >
> > > -----Original Message-----
> > > From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On 
> > > Behalf Of Michael Kinney
> > > Sent: Tuesday, April 25, 2017 6:07 PM
> > > To: edk2-devel@lists.01.org
> > > Cc: Jaben Carsey <jaben.carsey@intel.com>; Kevin W Shaw 
> > > <kevin.w.shaw@intel.com>
> > > Subject: [edk2] [edk2-UniSpecification PATCH] Allow .uni files on 
> > > disk to be UTF-
> > > 8 without a BOM
> > >
> > > https://bugzilla.tianocore.org/show_bug.cgi?id=507
> > >
> > > Cc: Jaben Carsey <jaben.carsey@intel.com>
> > > Cc: Yonghong Zhu <yonghong.zhu@intel.com>
> > > Cc: Kevin W Shaw <kevin.w.shaw@intel.com>
> > > Contributed-under: TianoCore Contribution Agreement 1.1
> > > Signed-off-by: Michael Kinney <michael.d.kinney@intel.com>
> > > ---
> > >  2_unicode_strings_file_format.md |  9 ++++++---
> > >  README.md                        | 27 ++++++++++++++-------------
> > >  2 files changed, 20 insertions(+), 16 deletions(-)
> > >
> > > diff --git a/2_unicode_strings_file_format.md
> > > b/2_unicode_strings_file_format.md
> > > index 0150c85..7a4a019 100644
> > > --- a/2_unicode_strings_file_format.md
> > > +++ b/2_unicode_strings_file_format.md
> > > @@ -33,7 +33,8 @@
> > >
> > >  EDK II Unicode files are used for mapping token names to 
> > > localized strings that are identified by an RFC4646 language code. 
> > > The format for storing EDK II - Unicode files is UTF-16LE. The 
> > > character content must be
> > UCS-2.
> > > +Unicode files on disk is UTF-8 (without a BOM character) or 
> > > +UTF-16LE (with a BOM character). The character content must be UCS-2.
> > >
> > >  Strings ends are determined by the first of the following items found:
> > >
> > > @@ -44,11 +45,13 @@ Strings ends are determined by the first of 
> > > the following items found:
> > >
> > >  Comments may appear anywhere within the string file.
> > >
> > > -All the files must begin with a Unicode BOM character.
> > > +All UTF-16LE files must begin with a Unicode BOM character.
> > > +All UTF-8 files must not begin with a Unicode BOM character.
> > >
> > >  **********
> > >  **NOTE:** Please make sure you select an editor that supports 
> > > UCS-2 characters - that can be stored in a UTF-16LE file.
> > > +that can be stored in either a UTF-8 (without a BOM character) or 
> > > +a UTF-16LE file (with a BOM character).
> > >  **********
> > >
> > >  ## 2.1 Common EBNF
> > > diff --git a/README.md b/README.md index 63842a1..015aef1 100644
> > > --- a/README.md
> > > +++ b/README.md
> > > @@ -77,16 +77,17 @@ Copyright (c) 2016-2017, Intel Corporation. 
> > > All rights reserved.
> > >
> > >  ### Revision History
> > >
> > > -| Revision          | Description
> > > | Date            |
> > > -| ----------------- |
> > > -| ----------------------------------------------------------
> > > ------------------------------ | --------------- |
> > > -| 1.0               | Initial Release.
> > > | February 2014   |
> > > -| 1.1               | Updated EBNF to follow syntax specified in EBNF by the
> > > ANTLR project.                    | August 2014     |
> > > -|                   | Added content related to EDK II Meta-Data 
> > > -| Unicode
> files.
> > > |                 |
> > > -|                   | Restructured document.
> > > |                 |
> > > -|                   | Removed security and C format GUID 
> > > -| definitions, not
> > > required for HII or other UNI files. |                 |
> > > -|                   | Removed invalid escape code sequences.
> > > |                 |
> > > -| 1.2               | Added optional font formatting
> > > | September 2014  |
> > > -| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`
> > > | April 2015      |
> > > -| 1.3               | Added: Syntax for non-ascii characters inside quoted
> > > strings.                            | March 2016      |
> > > -|                   | Removed: Info on specific consumers (.INF &
> > > -| .DEC)
> > removed.
> > > |                 |
> > > -| 1.4               | Convert to GitBook format
> > > | March 2017      |
> > > +| Revision          | Description
> > > | Date            |
> > > +| ----------------- |
> > > +| ----------------------------------------------------------
> > > ------------------------------------------------------------ |
> > > --------------- |
> > > +| 1.0               | Initial Release.
> > > | February 2014   |
> > > +| 1.1               | Updated EBNF to follow syntax specified in EBNF by the
> > > ANTLR project.                                                  | August 2014
> > > |
> > > +|                   | Added content related to EDK II Meta-Data 
> > > +| Unicode
> files.
> > > |                 |
> > > +|                   | Restructured document.
> > > |                 |
> > > +|                   | Removed security and C format GUID 
> > > +| definitions, not
> > > required for HII or other UNI files.                               |
> > > |
> > > +|                   | Removed invalid escape code sequences.
> > > |                 |
> > > +| 1.2               | Added optional font formatting
> > > | September 2014  |
> > > +| 1.2 Errata A      | Correct misspelling of: `STR_PROPERTIES_MODULE_NAME`
> > > | April 2015      |
> > > +| 1.3               | Added: Syntax for non-ascii characters inside quoted
> > > strings.                                                          | March
> 2016
> > > |
> > > +|                   | Removed: Info on specific consumers (.INF &
> > > +| .DEC)
> > removed.
> > > |                 |
> > > +| 1.4               | Convert to GitBook format
> > > | April 2017      |
> > > +|                   |
> > > +| [#507](https://bugzilla.tianocore.org/show_bug.cgi?id=507)
> > > UNI Spec: Clarify that .uni files maybe UTF-8 without a BOM |
> |
> > > --
> > > 2.6.3.windows.1
> > >
> > > _______________________________________________
> > > edk2-devel mailing list
> > > edk2-devel@lists.01.org
> > > https://lists.01.org/mailman/listinfo/edk2-devel


  reply	other threads:[~2017-04-26 23:13 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-26  1:07 [edk2-UniSpecification PATCH] Allow .uni files on disk to be UTF-8 without a BOM Michael Kinney
2017-04-26  1:07 ` Michael Kinney
2017-04-26 16:15   ` Tim Lewis
2017-04-26 17:44     ` Carsey, Jaben
2017-04-26 17:53       ` Tim Lewis
2017-04-26 18:25         ` Kinney, Michael D
2017-04-26 18:11     ` Kinney, Michael D
2017-04-26 18:34       ` Tim Lewis
2017-04-26 18:46         ` Kinney, Michael D
2017-04-26 18:53           ` Tim Lewis
2017-04-26 22:47             ` Kinney, Michael D
2017-04-26 23:13               ` Tim Lewis [this message]
2017-04-27  0:02                 ` Kinney, Michael D
2017-04-27  0:26                   ` Tim Lewis
2017-04-28 16:47                     ` Tim Lewis
2017-04-28 17:22                       ` Kinney, Michael D
2017-04-26  2:10 ` Zhu, Yonghong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7236196A5DF6C040855A6D96F556A53F57683A@msmail.insydesw.com.tw \
    --to=devel@edk2.groups.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox