public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed
* [PATCH 0/3] MdePkg: Remove inline X86 assembly code
@ 2019-03-05  1:40 Shenglei Zhang
  2019-03-05  1:40 ` [PATCH 1/3] MdePkg/BaseCpuLib: " Shenglei Zhang
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Shenglei Zhang @ 2019-03-05  1:40 UTC (permalink / raw)
  To: edk2-devel; +Cc: Michael D Kinney, Liming Gao

MdePkg BaseLib, BaseCpuLib and BaseSynchronizationLib still use
the inline X86 assembly code in C code files. They should be updated
to consume nasm only.
https://bugzilla.tianocore.org/show_bug.cgi?id=1163

Cc: Michael D Kinney <michael.d.kinney@intel.com>
Cc: Liming Gao <liming.gao@intel.com>
Contributed-under: TianoCore Contribution Agreement 1.1
Signed-off-by: Shenglei Zhang <shenglei.zhang@intel.com>
Shenglei Zhang (3):
  MdePkg/BaseCpuLib: Remove inline X86 assembly code
  MdePkg/BaseLib: Remove inline X86 assembly code
  MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code

 MdePkg/Library/BaseCpuLib/BaseCpuLib.inf      |  6 -----
 MdePkg/Library/BaseLib/BaseLib.inf            | 27 -------------------
 .../BaseSynchronizationLib.inf                |  2 --
 3 files changed, 35 deletions(-)

-- 
2.18.0.windows.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/3] MdePkg/BaseCpuLib: Remove inline X86 assembly code
  2019-03-05  1:40 [PATCH 0/3] MdePkg: Remove inline X86 assembly code Shenglei Zhang
@ 2019-03-05  1:40 ` Shenglei Zhang
  2019-03-05  1:40 ` [PATCH 2/3] MdePkg/BaseLib: " Shenglei Zhang
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Shenglei Zhang @ 2019-03-05  1:40 UTC (permalink / raw)
  To: edk2-devel; +Cc: Michael D Kinney, Liming Gao

MdePkg BaseCpuLib still uses the inline X86 assembly code
in C code files.It should be updated to consume nasm only.
https://bugzilla.tianocore.org/show_bug.cgi?id=1163

Cc: Michael D Kinney <michael.d.kinney@intel.com>
Cc: Liming Gao <liming.gao@intel.com>
Contributed-under: TianoCore Contribution Agreement 1.1
Signed-off-by: Shenglei Zhang <shenglei.zhang@intel.com>
---
 MdePkg/Library/BaseCpuLib/BaseCpuLib.inf | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/MdePkg/Library/BaseCpuLib/BaseCpuLib.inf b/MdePkg/Library/BaseCpuLib/BaseCpuLib.inf
index af2f09617a..bc6585d8dd 100644
--- a/MdePkg/Library/BaseCpuLib/BaseCpuLib.inf
+++ b/MdePkg/Library/BaseCpuLib/BaseCpuLib.inf
@@ -55,18 +55,12 @@
   Ebc/CpuSleepFlushTlb.c
 
 [Sources.ARM]
-  Arm/CpuFlushTlb.asm | RVCT
-  Arm/CpuSleep.asm    | RVCT
-  Arm/CpuFlushTlb.asm | MSFT
-  Arm/CpuSleep.asm    | MSFT
   Arm/CpuFlushTlb.S   | GCC
   Arm/CpuSleep.S      | GCC
 
 [Sources.AARCH64]
   AArch64/CpuFlushTlb.S   | GCC
   AArch64/CpuSleep.S      | GCC
-  AArch64/CpuFlushTlb.asm | MSFT
-  AArch64/CpuSleep.asm    | MSFT
 
 [Packages]
   MdePkg/MdePkg.dec
-- 
2.18.0.windows.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/3] MdePkg/BaseLib: Remove inline X86 assembly code
  2019-03-05  1:40 [PATCH 0/3] MdePkg: Remove inline X86 assembly code Shenglei Zhang
  2019-03-05  1:40 ` [PATCH 1/3] MdePkg/BaseCpuLib: " Shenglei Zhang
@ 2019-03-05  1:40 ` Shenglei Zhang
  2019-03-05  1:40 ` [PATCH 3/3] MdePkg/BaseSynchronizationLib: " Shenglei Zhang
  2019-03-05  2:10 ` [PATCH 0/3] MdePkg: " Gao, Liming
  3 siblings, 0 replies; 21+ messages in thread
From: Shenglei Zhang @ 2019-03-05  1:40 UTC (permalink / raw)
  To: edk2-devel; +Cc: Michael D Kinney, Liming Gao

MdePkg BaseLib still uses the inline X86 assembly code
in C code files.It should be updated to consume nasm only.
https://bugzilla.tianocore.org/show_bug.cgi?id=1163

Cc: Michael D Kinney <michael.d.kinney@intel.com>
Cc: Liming Gao <liming.gao@intel.com>
Contributed-under: TianoCore Contribution Agreement 1.1
Signed-off-by: Shenglei Zhang <shenglei.zhang@intel.com>
---
 MdePkg/Library/BaseLib/BaseLib.inf | 27 ---------------------------
 1 file changed, 27 deletions(-)

diff --git a/MdePkg/Library/BaseLib/BaseLib.inf b/MdePkg/Library/BaseLib/BaseLib.inf
index a0d6c372f9..80e312fb34 100644
--- a/MdePkg/Library/BaseLib/BaseLib.inf
+++ b/MdePkg/Library/BaseLib/BaseLib.inf
@@ -555,26 +555,8 @@
   Math64.c                   | RVCT
   Math64.c                   | MSFT
 
-  Arm/SwitchStack.asm        | RVCT
-  Arm/SetJumpLongJump.asm    | RVCT
-  Arm/DisableInterrupts.asm  | RVCT
-  Arm/EnableInterrupts.asm   | RVCT
-  Arm/GetInterruptsState.asm | RVCT
-  Arm/CpuPause.asm           | RVCT
-  Arm/CpuBreakpoint.asm      | RVCT
-  Arm/MemoryFence.asm        | RVCT
   Arm/SpeculationBarrier.S   | RVCT
 
-  Arm/SwitchStack.asm        | MSFT
-  Arm/SetJumpLongJump.asm    | MSFT
-  Arm/DisableInterrupts.asm  | MSFT
-  Arm/EnableInterrupts.asm   | MSFT
-  Arm/GetInterruptsState.asm | MSFT
-  Arm/CpuPause.asm           | MSFT
-  Arm/CpuBreakpoint.asm      | MSFT
-  Arm/MemoryFence.asm        | MSFT
-  Arm/SpeculationBarrier.asm | MSFT
-
   Arm/Math64.S                  | GCC
   Arm/SwitchStack.S             | GCC
   Arm/EnableInterrupts.S        | GCC
@@ -599,15 +581,6 @@
   AArch64/CpuBreakpoint.S           | GCC
   AArch64/SpeculationBarrier.S      | GCC
 
-  AArch64/MemoryFence.asm           | MSFT
-  AArch64/SwitchStack.asm           | MSFT
-  AArch64/EnableInterrupts.asm      | MSFT
-  AArch64/DisableInterrupts.asm     | MSFT
-  AArch64/GetInterruptsState.asm    | MSFT
-  AArch64/SetJumpLongJump.asm       | MSFT
-  AArch64/CpuBreakpoint.asm         | MSFT
-  AArch64/SpeculationBarrier.asm    | MSFT
-
 [Packages]
   MdePkg/MdePkg.dec
 
-- 
2.18.0.windows.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
  2019-03-05  1:40 [PATCH 0/3] MdePkg: Remove inline X86 assembly code Shenglei Zhang
  2019-03-05  1:40 ` [PATCH 1/3] MdePkg/BaseCpuLib: " Shenglei Zhang
  2019-03-05  1:40 ` [PATCH 2/3] MdePkg/BaseLib: " Shenglei Zhang
@ 2019-03-05  1:40 ` Shenglei Zhang
  2019-03-05 20:57   ` Andrew Fish
  2019-03-05  2:10 ` [PATCH 0/3] MdePkg: " Gao, Liming
  3 siblings, 1 reply; 21+ messages in thread
From: Shenglei Zhang @ 2019-03-05  1:40 UTC (permalink / raw)
  To: edk2-devel; +Cc: Michael D Kinney, Liming Gao

MdePkg BaseSynchronizationLib still uses the inline X86 assembly code
in C code files.It should be updated to consume nasm only.
https://bugzilla.tianocore.org/show_bug.cgi?id=1163

Cc: Michael D Kinney <michael.d.kinney@intel.com>
Cc: Liming Gao <liming.gao@intel.com>
Contributed-under: TianoCore Contribution Agreement 1.1
Signed-off-by: Shenglei Zhang <shenglei.zhang@intel.com>
---
 .../Library/BaseSynchronizationLib/BaseSynchronizationLib.inf   | 2 --
 1 file changed, 2 deletions(-)

diff --git a/MdePkg/Library/BaseSynchronizationLib/BaseSynchronizationLib.inf b/MdePkg/Library/BaseSynchronizationLib/BaseSynchronizationLib.inf
index 32414b29fa..719dc1938d 100755
--- a/MdePkg/Library/BaseSynchronizationLib/BaseSynchronizationLib.inf
+++ b/MdePkg/Library/BaseSynchronizationLib/BaseSynchronizationLib.inf
@@ -75,13 +75,11 @@
 
 [Sources.ARM]
   Synchronization.c
-  Arm/Synchronization.asm       | RVCT
   Arm/Synchronization.S         | GCC
 
 [Sources.AARCH64]
   Synchronization.c
   AArch64/Synchronization.S     | GCC
-  AArch64/Synchronization.asm   | MSFT
 
 [Packages]
   MdePkg/MdePkg.dec
-- 
2.18.0.windows.1



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] MdePkg: Remove inline X86 assembly code
  2019-03-05  1:40 [PATCH 0/3] MdePkg: Remove inline X86 assembly code Shenglei Zhang
                   ` (2 preceding siblings ...)
  2019-03-05  1:40 ` [PATCH 3/3] MdePkg/BaseSynchronizationLib: " Shenglei Zhang
@ 2019-03-05  2:10 ` Gao, Liming
  3 siblings, 0 replies; 21+ messages in thread
From: Gao, Liming @ 2019-03-05  2:10 UTC (permalink / raw)
  To: Zhang, Shenglei, edk2-devel@lists.01.org; +Cc: Kinney, Michael D

Shenglei:
  The request is for X86, IA32 and X64. Please don't change ARM or AARCH64. 

Thanks
Liming
> -----Original Message-----
> From: Zhang, Shenglei
> Sent: Monday, March 4, 2019 5:41 PM
> To: edk2-devel@lists.01.org
> Cc: Kinney, Michael D <michael.d.kinney@intel.com>; Gao, Liming <liming.gao@intel.com>
> Subject: [PATCH 0/3] MdePkg: Remove inline X86 assembly code
> 
> MdePkg BaseLib, BaseCpuLib and BaseSynchronizationLib still use
> the inline X86 assembly code in C code files. They should be updated
> to consume nasm only.
> https://bugzilla.tianocore.org/show_bug.cgi?id=1163
> 
> Cc: Michael D Kinney <michael.d.kinney@intel.com>
> Cc: Liming Gao <liming.gao@intel.com>
> Contributed-under: TianoCore Contribution Agreement 1.1
> Signed-off-by: Shenglei Zhang <shenglei.zhang@intel.com>
> Shenglei Zhang (3):
>   MdePkg/BaseCpuLib: Remove inline X86 assembly code
>   MdePkg/BaseLib: Remove inline X86 assembly code
>   MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
> 
>  MdePkg/Library/BaseCpuLib/BaseCpuLib.inf      |  6 -----
>  MdePkg/Library/BaseLib/BaseLib.inf            | 27 -------------------
>  .../BaseSynchronizationLib.inf                |  2 --
>  3 files changed, 35 deletions(-)
> 
> --
> 2.18.0.windows.1



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
  2019-03-05  1:40 ` [PATCH 3/3] MdePkg/BaseSynchronizationLib: " Shenglei Zhang
@ 2019-03-05 20:57   ` Andrew Fish
  2019-03-05 21:32     ` Gao, Liming
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Fish @ 2019-03-05 20:57 UTC (permalink / raw)
  To: Shenglei Zhang; +Cc: edk2-devel-01, Mike Kinney, Liming Gao

Shenglei,

I was confused by the names of these patches. From a C point of view this is inline assembly:

VOID
EFIAPI
CpuBreakpoint (
VOID
)
{
__asm__ __volatile__ ("int $3");
}

These patches seem to be removing GAS and MASM assembly files, but not the inline assembly *.c files?

We don't want to remove the inline assembly from the BaseLib as that could have size, performance, and compiler optimization impact. 

For example CpuBreakpoint() for clang with LTO would end up inlining a single byte. For i385 a call to assembler would be 5 bytes at each call location plus the overhead of the function. So that is a size increase and a performance overhead. For a C complier calling an assembly function is a serializing event an that can restrict optimizations. Thus having some limited inline assembly support is very useful. 

Thanks,

Andrew Fish

> On Mar 4, 2019, at 5:40 PM, Shenglei Zhang <shenglei.zhang@intel.com> wrote:
> 
> MdePkg BaseSynchronizationLib still uses the inline X86 assembly code
> in C code files.It should be updated to consume nasm only.
> https://bugzilla.tianocore.org/show_bug.cgi?id=1163
> 
> Cc: Michael D Kinney <michael.d.kinney@intel.com>
> Cc: Liming Gao <liming.gao@intel.com>
> Contributed-under: TianoCore Contribution Agreement 1.1
> Signed-off-by: Shenglei Zhang <shenglei.zhang@intel.com>
> ---
> .../Library/BaseSynchronizationLib/BaseSynchronizationLib.inf   | 2 --
> 1 file changed, 2 deletions(-)
> 
> diff --git a/MdePkg/Library/BaseSynchronizationLib/BaseSynchronizationLib.inf b/MdePkg/Library/BaseSynchronizationLib/BaseSynchronizationLib.inf
> index 32414b29fa..719dc1938d 100755
> --- a/MdePkg/Library/BaseSynchronizationLib/BaseSynchronizationLib.inf
> +++ b/MdePkg/Library/BaseSynchronizationLib/BaseSynchronizationLib.inf
> @@ -75,13 +75,11 @@
> 
> [Sources.ARM]
> Synchronization.c
> -  Arm/Synchronization.asm       | RVCT
> Arm/Synchronization.S         | GCC
> 
> [Sources.AARCH64]
> Synchronization.c
> AArch64/Synchronization.S     | GCC
> -  AArch64/Synchronization.asm   | MSFT
> 
> [Packages]
> MdePkg/MdePkg.dec
> -- 
> 2.18.0.windows.1
> 
> _______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.01.org
> https://lists.01.org/mailman/listinfo/edk2-devel



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
  2019-03-05 20:57   ` Andrew Fish
@ 2019-03-05 21:32     ` Gao, Liming
  2019-03-05 21:45       ` Kinney, Michael D
  2019-03-05 22:43       ` Andrew Fish
  0 siblings, 2 replies; 21+ messages in thread
From: Gao, Liming @ 2019-03-05 21:32 UTC (permalink / raw)
  To: afish@apple.com, Zhang, Shenglei; +Cc: edk2-devel-01, Kinney, Michael D

Andrew:
  BZ 1163 is to remove inline X86 assembly code in C source file. But, this patch is wrong. I have gave my comments to update this patch.

  The change is to reduce the duplicated implementation. It will be good on the code maintain. Recently, one patch has to update .c and .nasm both. Please see https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html. Beside this change, I propose to remove all GAS assembly code for IA32 and X64 arch. After those change, the patch owner will collect the impact of the image size. 

Thanks
Liming
> -----Original Message-----
> From: afish@apple.com [mailto:afish@apple.com]
> Sent: Tuesday, March 5, 2019 12:58 PM
> To: Zhang, Shenglei <shenglei.zhang@intel.com>
> Cc: edk2-devel-01 <edk2-devel@lists.01.org>; Kinney, Michael D <michael.d.kinney@intel.com>; Gao, Liming <liming.gao@intel.com>
> Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
> 
> Shenglei,
> 
> I was confused by the names of these patches. From a C point of view this is inline assembly:
> 
> VOID
> EFIAPI
> CpuBreakpoint (
> VOID
> )
> {
> __asm__ __volatile__ ("int $3");
> }
> 
> These patches seem to be removing GAS and MASM assembly files, but not the inline assembly *.c files?
> 
> We don't want to remove the inline assembly from the BaseLib as that could have size, performance, and compiler optimization impact.
> 
> For example CpuBreakpoint() for clang with LTO would end up inlining a single byte. For i385 a call to assembler would be 5 bytes at each
> call location plus the overhead of the function. So that is a size increase and a performance overhead. For a C complier calling an assembly
> function is a serializing event an that can restrict optimizations. Thus having some limited inline assembly support is very useful.
> 
> Thanks,
> 
> Andrew Fish
> 
> > On Mar 4, 2019, at 5:40 PM, Shenglei Zhang <shenglei.zhang@intel.com> wrote:
> >
> > MdePkg BaseSynchronizationLib still uses the inline X86 assembly code
> > in C code files.It should be updated to consume nasm only.
> > https://bugzilla.tianocore.org/show_bug.cgi?id=1163
> >
> > Cc: Michael D Kinney <michael.d.kinney@intel.com>
> > Cc: Liming Gao <liming.gao@intel.com>
> > Contributed-under: TianoCore Contribution Agreement 1.1
> > Signed-off-by: Shenglei Zhang <shenglei.zhang@intel.com>
> > ---
> > .../Library/BaseSynchronizationLib/BaseSynchronizationLib.inf   | 2 --
> > 1 file changed, 2 deletions(-)
> >
> > diff --git a/MdePkg/Library/BaseSynchronizationLib/BaseSynchronizationLib.inf
> b/MdePkg/Library/BaseSynchronizationLib/BaseSynchronizationLib.inf
> > index 32414b29fa..719dc1938d 100755
> > --- a/MdePkg/Library/BaseSynchronizationLib/BaseSynchronizationLib.inf
> > +++ b/MdePkg/Library/BaseSynchronizationLib/BaseSynchronizationLib.inf
> > @@ -75,13 +75,11 @@
> >
> > [Sources.ARM]
> > Synchronization.c
> > -  Arm/Synchronization.asm       | RVCT
> > Arm/Synchronization.S         | GCC
> >
> > [Sources.AARCH64]
> > Synchronization.c
> > AArch64/Synchronization.S     | GCC
> > -  AArch64/Synchronization.asm   | MSFT
> >
> > [Packages]
> > MdePkg/MdePkg.dec
> > --
> > 2.18.0.windows.1
> >
> > _______________________________________________
> > edk2-devel mailing list
> > edk2-devel@lists.01.org
> > https://lists.01.org/mailman/listinfo/edk2-devel



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
  2019-03-05 21:32     ` Gao, Liming
@ 2019-03-05 21:45       ` Kinney, Michael D
  2019-03-05 22:43       ` Andrew Fish
  1 sibling, 0 replies; 21+ messages in thread
From: Kinney, Michael D @ 2019-03-05 21:45 UTC (permalink / raw)
  To: Gao, Liming, afish@apple.com, Zhang, Shenglei, Kinney, Michael D
  Cc: edk2-devel-01

Liming,

I agree .nasm files replacing .S/.asm files.  But the use of inline
assembly in C files for some primitives does produce much smaller/faster
code.

I would prefer that this BZ identify the specific functions that you
think would provide better maintainability with no impact to size
or performance.

Thanks,

Mike

> -----Original Message-----
> From: Gao, Liming
> Sent: Tuesday, March 5, 2019 1:33 PM
> To: afish@apple.com; Zhang, Shenglei
> <shenglei.zhang@intel.com>
> Cc: edk2-devel-01 <edk2-devel@lists.01.org>; Kinney,
> Michael D <michael.d.kinney@intel.com>
> Subject: RE: [edk2] [PATCH 3/3]
> MdePkg/BaseSynchronizationLib: Remove inline X86
> assembly code
> 
> Andrew:
>   BZ 1163 is to remove inline X86 assembly code in C
> source file. But, this patch is wrong. I have gave my
> comments to update this patch.
> 
>   The change is to reduce the duplicated
> implementation. It will be good on the code maintain.
> Recently, one patch has to update .c and .nasm both.
> Please see https://lists.01.org/pipermail/edk2-
> devel/2019-February/037165.html. Beside this change, I
> propose to remove all GAS assembly code for IA32 and
> X64 arch. After those change, the patch owner will
> collect the impact of the image size.
> 
> Thanks
> Liming
> > -----Original Message-----
> > From: afish@apple.com [mailto:afish@apple.com]
> > Sent: Tuesday, March 5, 2019 12:58 PM
> > To: Zhang, Shenglei <shenglei.zhang@intel.com>
> > Cc: edk2-devel-01 <edk2-devel@lists.01.org>; Kinney,
> Michael D <michael.d.kinney@intel.com>; Gao, Liming
> <liming.gao@intel.com>
> > Subject: Re: [edk2] [PATCH 3/3]
> MdePkg/BaseSynchronizationLib: Remove inline X86
> assembly code
> >
> > Shenglei,
> >
> > I was confused by the names of these patches. From a
> C point of view this is inline assembly:
> >
> > VOID
> > EFIAPI
> > CpuBreakpoint (
> > VOID
> > )
> > {
> > __asm__ __volatile__ ("int $3");
> > }
> >
> > These patches seem to be removing GAS and MASM
> assembly files, but not the inline assembly *.c files?
> >
> > We don't want to remove the inline assembly from the
> BaseLib as that could have size, performance, and
> compiler optimization impact.
> >
> > For example CpuBreakpoint() for clang with LTO would
> end up inlining a single byte. For i385 a call to
> assembler would be 5 bytes at each
> > call location plus the overhead of the function. So
> that is a size increase and a performance overhead. For
> a C complier calling an assembly
> > function is a serializing event an that can restrict
> optimizations. Thus having some limited inline assembly
> support is very useful.
> >
> > Thanks,
> >
> > Andrew Fish
> >
> > > On Mar 4, 2019, at 5:40 PM, Shenglei Zhang
> <shenglei.zhang@intel.com> wrote:
> > >
> > > MdePkg BaseSynchronizationLib still uses the inline
> X86 assembly code
> > > in C code files.It should be updated to consume
> nasm only.
> > > https://bugzilla.tianocore.org/show_bug.cgi?id=1163
> > >
> > > Cc: Michael D Kinney <michael.d.kinney@intel.com>
> > > Cc: Liming Gao <liming.gao@intel.com>
> > > Contributed-under: TianoCore Contribution Agreement
> 1.1
> > > Signed-off-by: Shenglei Zhang
> <shenglei.zhang@intel.com>
> > > ---
> > >
> .../Library/BaseSynchronizationLib/BaseSynchronizationL
> ib.inf   | 2 --
> > > 1 file changed, 2 deletions(-)
> > >
> > > diff --git
> a/MdePkg/Library/BaseSynchronizationLib/BaseSynchroniza
> tionLib.inf
> >
> b/MdePkg/Library/BaseSynchronizationLib/BaseSynchroniza
> tionLib.inf
> > > index 32414b29fa..719dc1938d 100755
> > > ---
> a/MdePkg/Library/BaseSynchronizationLib/BaseSynchroniza
> tionLib.inf
> > > +++
> b/MdePkg/Library/BaseSynchronizationLib/BaseSynchroniza
> tionLib.inf
> > > @@ -75,13 +75,11 @@
> > >
> > > [Sources.ARM]
> > > Synchronization.c
> > > -  Arm/Synchronization.asm       | RVCT
> > > Arm/Synchronization.S         | GCC
> > >
> > > [Sources.AARCH64]
> > > Synchronization.c
> > > AArch64/Synchronization.S     | GCC
> > > -  AArch64/Synchronization.asm   | MSFT
> > >
> > > [Packages]
> > > MdePkg/MdePkg.dec
> > > --
> > > 2.18.0.windows.1
> > >
> > > _______________________________________________
> > > edk2-devel mailing list
> > > edk2-devel@lists.01.org
> > > https://lists.01.org/mailman/listinfo/edk2-devel



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
  2019-03-05 21:32     ` Gao, Liming
  2019-03-05 21:45       ` Kinney, Michael D
@ 2019-03-05 22:43       ` Andrew Fish
  2019-03-07  0:28         ` Gao, Liming
  1 sibling, 1 reply; 21+ messages in thread
From: Andrew Fish @ 2019-03-05 22:43 UTC (permalink / raw)
  To: Gao, Liming; +Cc: Zhang, Shenglei, edk2-devel-01, Mike Kinney



> On Mar 5, 2019, at 1:32 PM, Gao, Liming <liming.gao@intel.com> wrote:
> 
> Andrew:
>  BZ 1163 is to remove inline X86 assembly code in C source file. But, this patch is wrong. I have gave my comments to update this patch.
> 

Why do we want to remove inline X86 assembly. As I mentioned it will lead to larger binaries, slower binaries, and less optimized code.

For example take this simple inline assembly function:

VOID
EFIAPI
CpuBreakpoint (
  VOID
  )
{
  __asm__ __volatile__ ("int $3");
}

Today with clang LTO turned on calling CpuBreakpoint() looks like this from the callers point of view. 

a.out[0x1fa5] <+6>:  cc              int3   

But if we move that to NASM:

a.out[0x1fa6] <+7>:  e8 07 00 00 00  calll  0x1fb2                    ; CpuBreakpoint

plus:
a.out`CpuBreakpoint:
a.out[0x1fb2] <+0>: cc     int3   
a.out[0x1fb3] <+1>: c3     retl   

And there is also an extra push and pop on the stack. The other issue is the call to the function that is unknown to the compiler will act like a _ReadWriteBarrier (Yikes I see _ReadWriteBarrier is deprecated in VC++ 2017). Is the depreciation of some of these intrinsics in VC++ driving the removal of inline assembly? For GCC inline assembly works great for local compile, and for clang LTO it works in entire program optimization.

The LTO bitcode includes inline assembly and constraints so that the optimizer knows what to do so it can get optimized just like the abstract bitcode during the Link Time Optimization. 

This is CpuBreakpoint():
; Function Attrs: noinline nounwind optnone ssp uwtable
define void @CpuBreakpoint() #0 {
  call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !3
  ret void
}

This is AsmReadMsr64():
; Function Attrs: noinline nounwind optnone ssp uwtable
define i64 @AsmReadMsr64(i32) #0 {
  %2 = alloca i32, align 4
  %3 = alloca i64, align 8
  store i32 %0, i32* %2, align 4
  %4 = load i32, i32* %2, align 4
  %5 = call i64 asm sideeffect "rdmsr", "=A,{cx},~{dirflag},~{fpsr},~{flags}"(i32 %4) #2, !srcloc !4
  store i64 %5, i64* %3, align 8
  %6 = load i64, i64* %3, align 8
  ret i64 %6
}


I agree it make sense to remove .S and .asm files and only have .nasm files. 

Thanks,

Andrew Fish

PS For the Xcode clang since it emits frame pointers by default we need to add the extra 4 bytes to save the assembly functions so the stack can get unwound. 

a.out`CpuBreakpoint:
a.out[0x1f99] <+0>: 55     pushl  %ebp
a.out[0x1f9a] <+1>: 89 e5  movl   %esp, %ebp
a.out[0x1f9c] <+3>: cc     int3   
a.out[0x1f9d] <+4>: 5d     popl   %ebp
a.out[0x1f9e] <+5>: c3     retl   

So breakpoint goes from 1 byte to 11 bytes if we get rid of the intrinsics. 

>  The change is to reduce the duplicated implementation. It will be good on the code maintain. Recently, one patch has to update .c and .nasm both. Please see https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html <https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html>. Beside this change, I propose to remove all GAS assembly code for IA32 and X64 arch. After those change, the patch owner will collect the impact of the image size. 
> 
> Thanks
> Liming



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
  2019-03-05 22:43       ` Andrew Fish
@ 2019-03-07  0:28         ` Gao, Liming
  2019-03-07  0:41           ` Kinney, Michael D
  0 siblings, 1 reply; 21+ messages in thread
From: Gao, Liming @ 2019-03-07  0:28 UTC (permalink / raw)
  To: afish@apple.com; +Cc: Zhang, Shenglei, edk2-devel-01, Kinney, Michael D

Andrew:
  I want to keep only one implementation. If inline assembly c source is preferred, I suggest to remove its nasm implementation.

Thanks
Liming
From: afish@apple.com [mailto:afish@apple.com]
Sent: Tuesday, March 5, 2019 2:44 PM
To: Gao, Liming <liming.gao@intel.com>
Cc: Zhang, Shenglei <shenglei.zhang@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org>; Kinney, Michael D <michael.d.kinney@intel.com>
Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code




On Mar 5, 2019, at 1:32 PM, Gao, Liming <liming.gao@intel.com<mailto:liming.gao@intel.com>> wrote:

Andrew:
 BZ 1163 is to remove inline X86 assembly code in C source file. But, this patch is wrong. I have gave my comments to update this patch.


Why do we want to remove inline X86 assembly. As I mentioned it will lead to larger binaries, slower binaries, and less optimized code.

For example take this simple inline assembly function:

VOID
EFIAPI
CpuBreakpoint (
  VOID
  )
{
  __asm__ __volatile__ ("int $3");
}


Today with clang LTO turned on calling CpuBreakpoint() looks like this from the callers point of view.

a.out[0x1fa5] <+6>:  cc              int3


But if we move that to NASM:


a.out[0x1fa6] <+7>:  e8 07 00 00 00  calll  0x1fb2                    ; CpuBreakpoint


plus:
a.out`CpuBreakpoint:
a.out[0x1fb2] <+0>: cc     int3
a.out[0x1fb3] <+1>: c3     retl


And there is also an extra push and pop on the stack. The other issue is the call to the function that is unknown to the compiler will act like a _ReadWriteBarrier (Yikes I see _ReadWriteBarrier is deprecated in VC++ 2017). Is the depreciation of some of these intrinsics in VC++ driving the removal of inline assembly? For GCC inline assembly works great for local compile, and for clang LTO it works in entire program optimization.

The LTO bitcode includes inline assembly and constraints so that the optimizer knows what to do so it can get optimized just like the abstract bitcode during the Link Time Optimization.

This is CpuBreakpoint():
; Function Attrs: noinline nounwind optnone ssp uwtable
define void @CpuBreakpoint() #0 {
  call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !3
  ret void
}


This is AsmReadMsr64():
; Function Attrs: noinline nounwind optnone ssp uwtable
define i64 @AsmReadMsr64(i32) #0 {
  %2 = alloca i32, align 4
  %3 = alloca i64, align 8
  store i32 %0, i32* %2, align 4
  %4 = load i32, i32* %2, align 4
  %5 = call i64 asm sideeffect "rdmsr", "=A,{cx},~{dirflag},~{fpsr},~{flags}"(i32 %4) #2, !srcloc !4
  store i64 %5, i64* %3, align 8
  %6 = load i64, i64* %3, align 8
  ret i64 %6
}



I agree it make sense to remove .S and .asm files and only have .nasm files.

Thanks,

Andrew Fish

PS For the Xcode clang since it emits frame pointers by default we need to add the extra 4 bytes to save the assembly functions so the stack can get unwound.

a.out`CpuBreakpoint:
a.out[0x1f99] <+0>: 55     pushl  %ebp
a.out[0x1f9a] <+1>: 89 e5  movl   %esp, %ebp
a.out[0x1f9c] <+3>: cc     int3
a.out[0x1f9d] <+4>: 5d     popl   %ebp
a.out[0x1f9e] <+5>: c3     retl


So breakpoint goes from 1 byte to 11 bytes if we get rid of the intrinsics.


 The change is to reduce the duplicated implementation. It will be good on the code maintain. Recently, one patch has to update .c and .nasm both. Please see https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html. Beside this change, I propose to remove all GAS assembly code for IA32 and X64 arch. After those change, the patch owner will collect the impact of the image size.

Thanks
Liming



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
  2019-03-07  0:28         ` Gao, Liming
@ 2019-03-07  0:41           ` Kinney, Michael D
  2019-03-07  1:31             ` Andrew Fish
  0 siblings, 1 reply; 21+ messages in thread
From: Kinney, Michael D @ 2019-03-07  0:41 UTC (permalink / raw)
  To: Gao, Liming, afish@apple.com, Kinney, Michael D
  Cc: Zhang, Shenglei, edk2-devel-01

Liming,

That does not work either because inline assembly uses compiler specific syntax.

Please update with the specific list of functions that you think the inline should be removed to improve maintenance with no impacts in size/perf.

The issue you pointed to was around SetJump()/LongJump().  Can we limit this BZ to only those 2 functions?

Mike

From: Gao, Liming
Sent: Wednesday, March 6, 2019 4:28 PM
To: afish@apple.com
Cc: Zhang, Shenglei <shenglei.zhang@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org>; Kinney, Michael D <michael.d.kinney@intel.com>
Subject: RE: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code

Andrew:
  I want to keep only one implementation. If inline assembly c source is preferred, I suggest to remove its nasm implementation.

Thanks
Liming
From: afish@apple.com<mailto:afish@apple.com> [mailto:afish@apple.com]
Sent: Tuesday, March 5, 2019 2:44 PM
To: Gao, Liming <liming.gao@intel.com<mailto:liming.gao@intel.com>>
Cc: Zhang, Shenglei <shenglei.zhang@intel.com<mailto:shenglei.zhang@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>>; Kinney, Michael D <michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>>
Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code



On Mar 5, 2019, at 1:32 PM, Gao, Liming <liming.gao@intel.com<mailto:liming.gao@intel.com>> wrote:

Andrew:
 BZ 1163 is to remove inline X86 assembly code in C source file. But, this patch is wrong. I have gave my comments to update this patch.

Why do we want to remove inline X86 assembly. As I mentioned it will lead to larger binaries, slower binaries, and less optimized code.

For example take this simple inline assembly function:

VOID
EFIAPI
CpuBreakpoint (
  VOID
  )
{
  __asm__ __volatile__ ("int $3");
}

Today with clang LTO turned on calling CpuBreakpoint() looks like this from the callers point of view.

a.out[0x1fa5] <+6>:  cc              int3

But if we move that to NASM:

a.out[0x1fa6] <+7>:  e8 07 00 00 00  calll  0x1fb2                    ; CpuBreakpoint

plus:
a.out`CpuBreakpoint:
a.out[0x1fb2] <+0>: cc     int3
a.out[0x1fb3] <+1>: c3     retl

And there is also an extra push and pop on the stack. The other issue is the call to the function that is unknown to the compiler will act like a _ReadWriteBarrier (Yikes I see _ReadWriteBarrier is deprecated in VC++ 2017). Is the depreciation of some of these intrinsics in VC++ driving the removal of inline assembly? For GCC inline assembly works great for local compile, and for clang LTO it works in entire program optimization.

The LTO bitcode includes inline assembly and constraints so that the optimizer knows what to do so it can get optimized just like the abstract bitcode during the Link Time Optimization.

This is CpuBreakpoint():
; Function Attrs: noinline nounwind optnone ssp uwtable
define void @CpuBreakpoint() #0 {
  call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !3
  ret void
}

This is AsmReadMsr64():
; Function Attrs: noinline nounwind optnone ssp uwtable
define i64 @AsmReadMsr64(i32) #0 {
  %2 = alloca i32, align 4
  %3 = alloca i64, align 8
  store i32 %0, i32* %2, align 4
  %4 = load i32, i32* %2, align 4
  %5 = call i64 asm sideeffect "rdmsr", "=A,{cx},~{dirflag},~{fpsr},~{flags}"(i32 %4) #2, !srcloc !4
  store i64 %5, i64* %3, align 8
  %6 = load i64, i64* %3, align 8
  ret i64 %6
}


I agree it make sense to remove .S and .asm files and only have .nasm files.

Thanks,

Andrew Fish

PS For the Xcode clang since it emits frame pointers by default we need to add the extra 4 bytes to save the assembly functions so the stack can get unwound.

a.out`CpuBreakpoint:
a.out[0x1f99] <+0>: 55     pushl  %ebp
a.out[0x1f9a] <+1>: 89 e5  movl   %esp, %ebp
a.out[0x1f9c] <+3>: cc     int3
a.out[0x1f9d] <+4>: 5d     popl   %ebp
a.out[0x1f9e] <+5>: c3     retl

So breakpoint goes from 1 byte to 11 bytes if we get rid of the intrinsics.

 The change is to reduce the duplicated implementation. It will be good on the code maintain. Recently, one patch has to update .c and .nasm both. Please see https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html. Beside this change, I propose to remove all GAS assembly code for IA32 and X64 arch. After those change, the patch owner will collect the impact of the image size.

Thanks
Liming



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
  2019-03-07  0:41           ` Kinney, Michael D
@ 2019-03-07  1:31             ` Andrew Fish
  2019-03-07  5:06               ` Yao, Jiewen
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Fish @ 2019-03-07  1:31 UTC (permalink / raw)
  To: Mike Kinney; +Cc: Gao, Liming, Zhang, Shenglei, edk2-devel-01



> On Mar 6, 2019, at 4:41 PM, Kinney, Michael D <michael.d.kinney@intel.com> wrote:
> 
> Liming,
>  
> That does not work either because inline assembly uses compiler specific syntax.
>  
> Please update with the specific list of functions that you think the inline should be removed to improve maintenance with no impacts in size/perf.
>  

Mike,

It is easy to find the gcc/clang inline assembler, just `git grep "__asm__ __volatile__"`

The main files are:
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseIoLibIntrinsic/IoLibGcc.c
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/X64/GccInline.c
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/Ia32/GccInline.c
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSynchronizationLib/Ia32/GccInline.c
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSynchronizationLib/X64/GccInline.c

This is really just compiler optimizer control. 
Library/BaseSynchronizationLib/SynchronizationGcc.c:21:#define _ReadWriteBarrier() do { __asm__ __volatile__ ("": : : "memory"); } while(0)

Looks like this is working around the alignment ASSERT in BaseIoLibIntrinsic. 
OvmfPkg/QemuVideoDxe/UnalignedIoGcc.c:43:  __asm__ __volatile__ ( "outl %0, %1" : : "a" (Value), "d" ((UINT16)Port) );
OvmfPkg/QemuVideoDxe/UnalignedIoGcc.c:67:  __asm__ __volatile__ ( "inl %1, %0" : "=a" (Data) : "d" ((UINT16)Port) );

It seems like we have a reasonable set and should not use the inline assembly for any more complex code. 

Thanks,

Andrew Fish

> The issue you pointed to was around SetJump()/LongJump().  Can we limit this BZ to only those 2 functions?
>  
> Mike
>   <>
> From: Gao, Liming 
> Sent: Wednesday, March 6, 2019 4:28 PM
> To: afish@apple.com
> Cc: Zhang, Shenglei <shenglei.zhang@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org>; Kinney, Michael D <michael.d.kinney@intel.com>
> Subject: RE: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
>  
> Andrew:
>   I want to keep only one implementation. If inline assembly c source is preferred, I suggest to remove its nasm implementation.
>  
> Thanks
> Liming
>  <>From: afish@apple.com <mailto:afish@apple.com> [mailto:afish@apple.com <mailto:afish@apple.com>] 
> Sent: Tuesday, March 5, 2019 2:44 PM
> To: Gao, Liming <liming.gao@intel.com <mailto:liming.gao@intel.com>>
> Cc: Zhang, Shenglei <shenglei.zhang@intel.com <mailto:shenglei.zhang@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org <mailto:edk2-devel@lists.01.org>>; Kinney, Michael D <michael.d.kinney@intel.com <mailto:michael.d.kinney@intel.com>>
> Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
>  
>  
>  
> 
> On Mar 5, 2019, at 1:32 PM, Gao, Liming <liming.gao@intel.com <mailto:liming.gao@intel.com>> wrote:
>  
> Andrew:
>  BZ 1163 is to remove inline X86 assembly code in C source file. But, this patch is wrong. I have gave my comments to update this patch.
> 
>  
> Why do we want to remove inline X86 assembly. As I mentioned it will lead to larger binaries, slower binaries, and less optimized code.
>  
> For example take this simple inline assembly function:
>  
> VOID
> EFIAPI
> CpuBreakpoint (
>   VOID
>   )
> {
>   __asm__ __volatile__ ("int $3");
> }
>  
> 
> Today with clang LTO turned on calling CpuBreakpoint() looks like this from the callers point of view. 
>  
> a.out[0x1fa5] <+6>:  cc              int3   
>  
> 
> But if we move that to NASM:
>  
> 
> a.out[0x1fa6] <+7>:  e8 07 00 00 00  calll  0x1fb2                    ; CpuBreakpoint
>  
> 
> plus:
> a.out`CpuBreakpoint:
> a.out[0x1fb2] <+0>: cc     int3   
> a.out[0x1fb3] <+1>: c3     retl   
>  
> 
> And there is also an extra push and pop on the stack. The other issue is the call to the function that is unknown to the compiler will act like a _ReadWriteBarrier (Yikes I see _ReadWriteBarrier is deprecated in VC++ 2017). Is the depreciation of some of these intrinsics in VC++ driving the removal of inline assembly? For GCC inline assembly works great for local compile, and for clang LTO it works in entire program optimization.
>  
> The LTO bitcode includes inline assembly and constraints so that the optimizer knows what to do so it can get optimized just like the abstract bitcode during the Link Time Optimization. 
>  
> This is CpuBreakpoint():
> ; Function Attrs: noinline nounwind optnone ssp uwtable
> define void @CpuBreakpoint() #0 {
>   call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !3
>   ret void
> }
>  
> 
> This is AsmReadMsr64():
> ; Function Attrs: noinline nounwind optnone ssp uwtable
> define i64 @AsmReadMsr64(i32) #0 {
>   %2 = alloca i32, align 4
>   %3 = alloca i64, align 8
>   store i32 %0, i32* %2, align 4
>   %4 = load i32, i32* %2, align 4
>   %5 = call i64 asm sideeffect "rdmsr", "=A,{cx},~{dirflag},~{fpsr},~{flags}"(i32 %4) #2, !srcloc !4
>   store i64 %5, i64* %3, align 8
>   %6 = load i64, i64* %3, align 8
>   ret i64 %6
> }
>  
> 
>  
> I agree it make sense to remove .S and .asm files and only have .nasm files. 
>  
> Thanks,
>  
> Andrew Fish
>  
> PS For the Xcode clang since it emits frame pointers by default we need to add the extra 4 bytes to save the assembly functions so the stack can get unwound. 
>  
> a.out`CpuBreakpoint:
> a.out[0x1f99] <+0>: 55     pushl  %ebp
> a.out[0x1f9a] <+1>: 89 e5  movl   %esp, %ebp
> a.out[0x1f9c] <+3>: cc     int3   
> a.out[0x1f9d] <+4>: 5d     popl   %ebp
> a.out[0x1f9e] <+5>: c3     retl   
>  
> 
> So breakpoint goes from 1 byte to 11 bytes if we get rid of the intrinsics. 
>  
> 
>  The change is to reduce the duplicated implementation. It will be good on the code maintain. Recently, one patch has to update .c and .nasm both. Please see https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html <https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html>. Beside this change, I propose to remove all GAS assembly code for IA32 and X64 arch. After those change, the patch owner will collect the impact of the image size. 
> 
> Thanks
> Liming



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
  2019-03-07  1:31             ` Andrew Fish
@ 2019-03-07  5:06               ` Yao, Jiewen
  2019-03-07  5:56                 ` Andrew Fish
  0 siblings, 1 reply; 21+ messages in thread
From: Yao, Jiewen @ 2019-03-07  5:06 UTC (permalink / raw)
  To: Andrew Fish, Kinney, Michael D; +Cc: Gao, Liming, edk2-devel-01

HI Mike and Andrew
The problem is maintainability.

If we have multiple copy of implementation, a developer need verify multiple copy of implementation, if we make update. Before that, a developer has to be aware that there is multiple copy of implementation. - That increases the complexity.

If we have everything, there MAY be 5 copy - ASM, NASM, S, GCC inline, MS inline, theoretically.
Now, we remove ASM. It is good first step.
But we may still have 4 copies. I suggest we consider do more.

A recently case that SetJump/LongJump, I updated the NASM file for both IA32 and X64.

Later, to my surprise, only X64 version NASM works, and IA32 version NASM does not work.
Then I notice that there is a different copy if IA32 Jump.c MS inline - I also need update. That is frustrated.

I think there should be a balance between optimization and code readability/maintainability.

Do we have data on what size benefit we can get with these inline function, from whole image level?
We can do evaluation at function level, case by case.
If we see a huge size benefit, I agree to keep this function.
If we see no size benefit, I suggest we do the cleanup for this function.


Thank you
Yao Jiewen



> -----Original Message-----
> From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of
> Andrew Fish via edk2-devel
> Sent: Wednesday, March 6, 2019 5:31 PM
> To: Kinney, Michael D <michael.d.kinney@intel.com>
> Cc: edk2-devel-01 <edk2-devel@lists.01.org>; Gao, Liming
> <liming.gao@intel.com>
> Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
> inline X86 assembly code
> 
> 
> 
> > On Mar 6, 2019, at 4:41 PM, Kinney, Michael D
> <michael.d.kinney@intel.com> wrote:
> >
> > Liming,
> >
> > That does not work either because inline assembly uses compiler specific
> syntax.
> >
> > Please update with the specific list of functions that you think the inline
> should be removed to improve maintenance with no impacts in size/perf.
> >
> 
> Mike,
> 
> It is easy to find the gcc/clang inline assembler, just `git grep "__asm__
> __volatile__"`
> 
> The main files are:
> https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseIoLib
> Intrinsic/IoLibGcc.c
> https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/X
> 64/GccInline.c
> https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/I
> a32/GccInline.c
> https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync
> hronizationLib/Ia32/GccInline.c
> https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync
> hronizationLib/X64/GccInline.c
> 
> This is really just compiler optimizer control.
> Library/BaseSynchronizationLib/SynchronizationGcc.c:21:#define
> _ReadWriteBarrier() do { __asm__ __volatile__ ("": : : "memory"); } while(0)
> 
> Looks like this is working around the alignment ASSERT in BaseIoLibIntrinsic.
> OvmfPkg/QemuVideoDxe/UnalignedIoGcc.c:43:  __asm__ __volatile__
> ( "outl %0, %1" : : "a" (Value), "d" ((UINT16)Port) );
> OvmfPkg/QemuVideoDxe/UnalignedIoGcc.c:67:  __asm__ __volatile__
> ( "inl %1, %0" : "=a" (Data) : "d" ((UINT16)Port) );
> 
> It seems like we have a reasonable set and should not use the inline
> assembly for any more complex code.
> 
> Thanks,
> 
> Andrew Fish
> 
> > The issue you pointed to was around SetJump()/LongJump().  Can we limit
> this BZ to only those 2 functions?
> >
> > Mike
> >   <>
> > From: Gao, Liming
> > Sent: Wednesday, March 6, 2019 4:28 PM
> > To: afish@apple.com
> > Cc: Zhang, Shenglei <shenglei.zhang@intel.com>; edk2-devel-01
> <edk2-devel@lists.01.org>; Kinney, Michael D
> <michael.d.kinney@intel.com>
> > Subject: RE: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
> inline X86 assembly code
> >
> > Andrew:
> >   I want to keep only one implementation. If inline assembly c source is
> preferred, I suggest to remove its nasm implementation.
> >
> > Thanks
> > Liming
> >  <>From: afish@apple.com <mailto:afish@apple.com>
> [mailto:afish@apple.com <mailto:afish@apple.com>]
> > Sent: Tuesday, March 5, 2019 2:44 PM
> > To: Gao, Liming <liming.gao@intel.com <mailto:liming.gao@intel.com>>
> > Cc: Zhang, Shenglei <shenglei.zhang@intel.com
> <mailto:shenglei.zhang@intel.com>>; edk2-devel-01
> <edk2-devel@lists.01.org <mailto:edk2-devel@lists.01.org>>; Kinney,
> Michael D <michael.d.kinney@intel.com
> <mailto:michael.d.kinney@intel.com>>
> > Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
> inline X86 assembly code
> >
> >
> >
> >
> > On Mar 5, 2019, at 1:32 PM, Gao, Liming <liming.gao@intel.com
> <mailto:liming.gao@intel.com>> wrote:
> >
> > Andrew:
> >  BZ 1163 is to remove inline X86 assembly code in C source file. But, this
> patch is wrong. I have gave my comments to update this patch.
> >
> >
> > Why do we want to remove inline X86 assembly. As I mentioned it will lead
> to larger binaries, slower binaries, and less optimized code.
> >
> > For example take this simple inline assembly function:
> >
> > VOID
> > EFIAPI
> > CpuBreakpoint (
> >   VOID
> >   )
> > {
> >   __asm__ __volatile__ ("int $3");
> > }
> >
> >
> > Today with clang LTO turned on calling CpuBreakpoint() looks like this from
> the callers point of view.
> >
> > a.out[0x1fa5] <+6>:  cc              int3
> >
> >
> > But if we move that to NASM:
> >
> >
> > a.out[0x1fa6] <+7>:  e8 07 00 00 00  calll
> 0x1fb2                    ; CpuBreakpoint
> >
> >
> > plus:
> > a.out`CpuBreakpoint:
> > a.out[0x1fb2] <+0>: cc     int3
> > a.out[0x1fb3] <+1>: c3     retl
> >
> >
> > And there is also an extra push and pop on the stack. The other issue is the
> call to the function that is unknown to the compiler will act like a
> _ReadWriteBarrier (Yikes I see _ReadWriteBarrier is deprecated in VC++
> 2017). Is the depreciation of some of these intrinsics in VC++ driving the
> removal of inline assembly? For GCC inline assembly works great for local
> compile, and for clang LTO it works in entire program optimization.
> >
> > The LTO bitcode includes inline assembly and constraints so that the
> optimizer knows what to do so it can get optimized just like the abstract
> bitcode during the Link Time Optimization.
> >
> > This is CpuBreakpoint():
> > ; Function Attrs: noinline nounwind optnone ssp uwtable
> > define void @CpuBreakpoint() #0 {
> >   call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"()
> #2, !srcloc !3
> >   ret void
> > }
> >
> >
> > This is AsmReadMsr64():
> > ; Function Attrs: noinline nounwind optnone ssp uwtable
> > define i64 @AsmReadMsr64(i32) #0 {
> >   %2 = alloca i32, align 4
> >   %3 = alloca i64, align 8
> >   store i32 %0, i32* %2, align 4
> >   %4 = load i32, i32* %2, align 4
> >   %5 = call i64 asm sideeffect "rdmsr",
> "=A,{cx},~{dirflag},~{fpsr},~{flags}"(i32 %4) #2, !srcloc !4
> >   store i64 %5, i64* %3, align 8
> >   %6 = load i64, i64* %3, align 8
> >   ret i64 %6
> > }
> >
> >
> >
> > I agree it make sense to remove .S and .asm files and only have .nasm files.
> >
> > Thanks,
> >
> > Andrew Fish
> >
> > PS For the Xcode clang since it emits frame pointers by default we need to
> add the extra 4 bytes to save the assembly functions so the stack can get
> unwound.
> >
> > a.out`CpuBreakpoint:
> > a.out[0x1f99] <+0>: 55     pushl  %ebp
> > a.out[0x1f9a] <+1>: 89 e5  movl   %esp, %ebp
> > a.out[0x1f9c] <+3>: cc     int3
> > a.out[0x1f9d] <+4>: 5d     popl   %ebp
> > a.out[0x1f9e] <+5>: c3     retl
> >
> >
> > So breakpoint goes from 1 byte to 11 bytes if we get rid of the intrinsics.
> >
> >
> >  The change is to reduce the duplicated implementation. It will be good
> on the code maintain. Recently, one patch has to update .c and .nasm both.
> Please see
> https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html
> <https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html>.
> Beside this change, I propose to remove all GAS assembly code for IA32 and
> X64 arch. After those change, the patch owner will collect the impact of the
> image size.
> >
> > Thanks
> > Liming
> 
> _______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.01.org
> https://lists.01.org/mailman/listinfo/edk2-devel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
  2019-03-07  5:06               ` Yao, Jiewen
@ 2019-03-07  5:56                 ` Andrew Fish
  2019-03-07  6:09                   ` Yao, Jiewen
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Fish @ 2019-03-07  5:56 UTC (permalink / raw)
  To: Yao, Jiewen; +Cc: Mike Kinney, Gao, Liming, edk2-devel-01



> On Mar 6, 2019, at 9:06 PM, Yao, Jiewen <jiewen.yao@intel.com> wrote:
> 
> HI Mike and Andrew
> The problem is maintainability.
> 
> If we have multiple copy of implementation, a developer need verify multiple copy of implementation, if we make update. Before that, a developer has to be aware that there is multiple copy of implementation. - That increases the complexity.
> 
> If we have everything, there MAY be 5 copy - ASM, NASM, S, GCC inline, MS inline, theoretically.
> Now, we remove ASM. It is good first step.
> But we may still have 4 copies. I suggest we consider do more.
> 

Jiewen,

I think you are trying do the right thing, but are optimize the wrong thing. 

Most of the GCC/Clang inline assembler code is in Gccinline.c and since that code is mostly just abstracting an x86 instruction and the functions are very simple and thus it is NOT code that needs ongoing maintenance. 

Lets look at the history:
https://github.com/tianocore/edk2/commits/master/MdePkg/Library/BaseLib/X64/GccInline.c
The last logic fix was Jun 1, 2010

https://github.com/tianocore/edk2/commits/master/MdePkg/Library/BaseLib/Ia32/GccInline.c
Ok so Mike had to make a fix in this file in 2015 to make something optional, due to an embedded CPU defeating an instruction. But prior to that it was 2010. 

The set of things that are C inline assembler we have should be static and not require much maintenance. More complex code should be written in C and use the C inline assembler functions we already have. If any more complex assembly code is required we should just write it in NASM. So I think it is fine to restrict new C inline assembler, or at least have a very high bar to add anything new. Any new inline assembler should also be simple and just be a function abstracting a CPU op-code that is not available to C. This is how we prevent the maintenance issues you are worrying about. 

I gave an example in this  mail thread on how a Breakpoint goes from being 1 byte to 11 bytes if you remove the C inline assembler. For clang with LTO enabled a CpuBreakpoint will always get inlined into the callers code and it will only take the one byte for int 3 instruction. If that code moves to NASM then it get replaces with a 5 byte call instruction and an actual C ABI function for the breakpoint. 

VOID
EFIAPI
CpuBreakpoint (
  VOID
  )
{
  __asm__ __volatile__ ("int $3");
}

Today with clang LTO turned on calling CpuBreakpoint() looks like this from the callers point of view. 

a.out[0x1fa5] <+6>:  cc              int3   

But if we move that to NASM:

a.out[0x1fa6] <+7>:  e8 07 00 00 00  calll  0x1fb2                    ; CpuBreakpoint

plus:
a.out`CpuBreakpoint:
a.out[0x1f99] <+0>: 55     pushl  %ebp
a.out[0x1f9a] <+1>: 89 e5  movl   %esp, %ebp
a.out[0x1f9c] <+3>: cc     int3   
a.out[0x1f9d] <+4>: 5d     popl   %ebp
a.out[0x1f9e] <+5>: c3     retl   

For any compiler that emits the frame pointer if you move the INT 3 to assembly you need the frame pointer or the Exception Lib is not going to be able to print out the stack backtrace of the code when you hit a breakpoint. So this is how I get to 11 vs. 1 bytes. 

Thanks,

Andrew Fish

PS for clang LTO the compiler compiles to bitcode and that is an abstracted assembly language. At link time all that code gets optimized to together and then passed through the CPU specific code gen. For C inline assembler the assembly instructions end up in the bitcode with lots of information about the constraints. That is why these GccInline.c functions almost always get inlined with clang and LTO. 

The CpuBreakpoint() function looks like this in bitcode, but the function call gets optimized away by LTO in the code gen. 

; Function Attrs: noinline nounwind optnone ssp uwtable
define void @CpuBreakpoint() #0 {
  call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !3
  ret void
}

Even something more complex like AsmReadMsr64() can get inlined, but it contains a lot more info about the constraints:

; Function Attrs: noinline nounwind optnone ssp uwtable
define i64 @AsmReadMsr64(i32) #0 {
  %2 = alloca i32, align 4
  %3 = alloca i64, align 8
  store i32 %0, i32* %2, align 4
  %4 = load i32, i32* %2, align 4
  %5 = call i64 asm sideeffect "rdmsr", "=A,{cx},~{dirflag},~{fpsr},~{flags}"(i32 %4) #2, !srcloc !4
  store i64 %5, i64* %3, align 8
  %6 = load i64, i64* %3, align 8
  ret i64 %6
}

The alloca, store, load, etc are the same bitcode instructions you would see with C code. 

> A recently case that SetJump/LongJump, I updated the NASM file for both IA32 and X64.
> 
> Later, to my surprise, only X64 version NASM works, and IA32 version NASM does not work.
> Then I notice that there is a different copy if IA32 Jump.c MS inline - I also need update. That is frustrated.
> 
> I think there should be a balance between optimization and code readability/maintainability.
> 
> Do we have data on what size benefit we can get with these inline function, from whole image level?
> We can do evaluation at function level, case by case.
> If we see a huge size benefit, I agree to keep this function.
> If we see no size benefit, I suggest we do the cleanup for this function.
> 
> 
> Thank you
> Yao Jiewen
> 
> 
> 
>> -----Original Message-----
>> From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of
>> Andrew Fish via edk2-devel
>> Sent: Wednesday, March 6, 2019 5:31 PM
>> To: Kinney, Michael D <michael.d.kinney@intel.com>
>> Cc: edk2-devel-01 <edk2-devel@lists.01.org>; Gao, Liming
>> <liming.gao@intel.com>
>> Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
>> inline X86 assembly code
>> 
>> 
>> 
>>> On Mar 6, 2019, at 4:41 PM, Kinney, Michael D
>> <michael.d.kinney@intel.com> wrote:
>>> 
>>> Liming,
>>> 
>>> That does not work either because inline assembly uses compiler specific
>> syntax.
>>> 
>>> Please update with the specific list of functions that you think the inline
>> should be removed to improve maintenance with no impacts in size/perf.
>>> 
>> 
>> Mike,
>> 
>> It is easy to find the gcc/clang inline assembler, just `git grep "__asm__
>> __volatile__"`
>> 
>> The main files are:
>> https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseIoLib
>> Intrinsic/IoLibGcc.c
>> https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/X
>> 64/GccInline.c
>> https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/I
>> a32/GccInline.c
>> https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync
>> hronizationLib/Ia32/GccInline.c
>> https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync
>> hronizationLib/X64/GccInline.c
>> 
>> This is really just compiler optimizer control.
>> Library/BaseSynchronizationLib/SynchronizationGcc.c:21:#define
>> _ReadWriteBarrier() do { __asm__ __volatile__ ("": : : "memory"); } while(0)
>> 
>> Looks like this is working around the alignment ASSERT in BaseIoLibIntrinsic.
>> OvmfPkg/QemuVideoDxe/UnalignedIoGcc.c:43:  __asm__ __volatile__
>> ( "outl %0, %1" : : "a" (Value), "d" ((UINT16)Port) );
>> OvmfPkg/QemuVideoDxe/UnalignedIoGcc.c:67:  __asm__ __volatile__
>> ( "inl %1, %0" : "=a" (Data) : "d" ((UINT16)Port) );
>> 
>> It seems like we have a reasonable set and should not use the inline
>> assembly for any more complex code.
>> 
>> Thanks,
>> 
>> Andrew Fish
>> 
>>> The issue you pointed to was around SetJump()/LongJump().  Can we limit
>> this BZ to only those 2 functions?
>>> 
>>> Mike
>>>   <>
>>> From: Gao, Liming
>>> Sent: Wednesday, March 6, 2019 4:28 PM
>>> To: afish@apple.com
>>> Cc: Zhang, Shenglei <shenglei.zhang@intel.com>; edk2-devel-01
>> <edk2-devel@lists.01.org>; Kinney, Michael D
>> <michael.d.kinney@intel.com>
>>> Subject: RE: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
>> inline X86 assembly code
>>> 
>>> Andrew:
>>>  I want to keep only one implementation. If inline assembly c source is
>> preferred, I suggest to remove its nasm implementation.
>>> 
>>> Thanks
>>> Liming
>>> <>From: afish@apple.com <mailto:afish@apple.com>
>> [mailto:afish@apple.com <mailto:afish@apple.com>]
>>> Sent: Tuesday, March 5, 2019 2:44 PM
>>> To: Gao, Liming <liming.gao@intel.com <mailto:liming.gao@intel.com>>
>>> Cc: Zhang, Shenglei <shenglei.zhang@intel.com
>> <mailto:shenglei.zhang@intel.com>>; edk2-devel-01
>> <edk2-devel@lists.01.org <mailto:edk2-devel@lists.01.org>>; Kinney,
>> Michael D <michael.d.kinney@intel.com
>> <mailto:michael.d.kinney@intel.com>>
>>> Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
>> inline X86 assembly code
>>> 
>>> 
>>> 
>>> 
>>> On Mar 5, 2019, at 1:32 PM, Gao, Liming <liming.gao@intel.com
>> <mailto:liming.gao@intel.com>> wrote:
>>> 
>>> Andrew:
>>> BZ 1163 is to remove inline X86 assembly code in C source file. But, this
>> patch is wrong. I have gave my comments to update this patch.
>>> 
>>> 
>>> Why do we want to remove inline X86 assembly. As I mentioned it will lead
>> to larger binaries, slower binaries, and less optimized code.
>>> 
>>> For example take this simple inline assembly function:
>>> 
>>> VOID
>>> EFIAPI
>>> CpuBreakpoint (
>>>  VOID
>>>  )
>>> {
>>>  __asm__ __volatile__ ("int $3");
>>> }
>>> 
>>> 
>>> Today with clang LTO turned on calling CpuBreakpoint() looks like this from
>> the callers point of view.
>>> 
>>> a.out[0x1fa5] <+6>:  cc              int3
>>> 
>>> 
>>> But if we move that to NASM:
>>> 
>>> 
>>> a.out[0x1fa6] <+7>:  e8 07 00 00 00  calll
>> 0x1fb2                    ; CpuBreakpoint
>>> 
>>> 
>>> plus:
>>> a.out`CpuBreakpoint:
>>> a.out[0x1fb2] <+0>: cc     int3
>>> a.out[0x1fb3] <+1>: c3     retl
>>> 
>>> 
>>> And there is also an extra push and pop on the stack. The other issue is the
>> call to the function that is unknown to the compiler will act like a
>> _ReadWriteBarrier (Yikes I see _ReadWriteBarrier is deprecated in VC++
>> 2017). Is the depreciation of some of these intrinsics in VC++ driving the
>> removal of inline assembly? For GCC inline assembly works great for local
>> compile, and for clang LTO it works in entire program optimization.
>>> 
>>> The LTO bitcode includes inline assembly and constraints so that the
>> optimizer knows what to do so it can get optimized just like the abstract
>> bitcode during the Link Time Optimization.
>>> 
>>> This is CpuBreakpoint():
>>> ; Function Attrs: noinline nounwind optnone ssp uwtable
>>> define void @CpuBreakpoint() #0 {
>>>  call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"()
>> #2, !srcloc !3
>>>  ret void
>>> }
>>> 
>>> 
>>> This is AsmReadMsr64():
>>> ; Function Attrs: noinline nounwind optnone ssp uwtable
>>> define i64 @AsmReadMsr64(i32) #0 {
>>>  %2 = alloca i32, align 4
>>>  %3 = alloca i64, align 8
>>>  store i32 %0, i32* %2, align 4
>>>  %4 = load i32, i32* %2, align 4
>>>  %5 = call i64 asm sideeffect "rdmsr",
>> "=A,{cx},~{dirflag},~{fpsr},~{flags}"(i32 %4) #2, !srcloc !4
>>>  store i64 %5, i64* %3, align 8
>>>  %6 = load i64, i64* %3, align 8
>>>  ret i64 %6
>>> }
>>> 
>>> 
>>> 
>>> I agree it make sense to remove .S and .asm files and only have .nasm files.
>>> 
>>> Thanks,
>>> 
>>> Andrew Fish
>>> 
>>> PS For the Xcode clang since it emits frame pointers by default we need to
>> add the extra 4 bytes to save the assembly functions so the stack can get
>> unwound.
>>> 
>>> a.out`CpuBreakpoint:
>>> a.out[0x1f99] <+0>: 55     pushl  %ebp
>>> a.out[0x1f9a] <+1>: 89 e5  movl   %esp, %ebp
>>> a.out[0x1f9c] <+3>: cc     int3
>>> a.out[0x1f9d] <+4>: 5d     popl   %ebp
>>> a.out[0x1f9e] <+5>: c3     retl
>>> 
>>> 
>>> So breakpoint goes from 1 byte to 11 bytes if we get rid of the intrinsics.
>>> 
>>> 
>>> The change is to reduce the duplicated implementation. It will be good
>> on the code maintain. Recently, one patch has to update .c and .nasm both.
>> Please see
>> https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html
>> <https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html>.
>> Beside this change, I propose to remove all GAS assembly code for IA32 and
>> X64 arch. After those change, the patch owner will collect the impact of the
>> image size.
>>> 
>>> Thanks
>>> Liming
>> 
>> _______________________________________________
>> edk2-devel mailing list
>> edk2-devel@lists.01.org
>> https://lists.01.org/mailman/listinfo/edk2-devel



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
  2019-03-07  5:56                 ` Andrew Fish
@ 2019-03-07  6:09                   ` Yao, Jiewen
  2019-03-07  6:45                     ` Andrew Fish
  0 siblings, 1 reply; 21+ messages in thread
From: Yao, Jiewen @ 2019-03-07  6:09 UTC (permalink / raw)
  To: afish@apple.com; +Cc: Kinney, Michael D, Gao, Liming, edk2-devel-01

Thanks Andrew.
Now I 100% agree with you - I think it is fine to restrict new C inline assembler, or at least have a very high bar to add anything new. Any new inline assembler *should also be simple and just be a function abstracting a CPU op-code* that is not available to C. This is how we prevent the maintenance issues you are worrying about.

And I also agree with your case.

Let's discuss another case. Below 2 functions SetJump/LongJump are updated recently, by me unfortunately.

It is unclear to me what is the size optimization we can get by inline SetJump/LongJump.
But obviously, it brings the maintenance issue to me.
And I don't believe it meets the criteria you mentioned above.

_declspec (naked)
RETURNS_TWICE
UINTN
EFIAPI
SetJump (
  OUT     BASE_LIBRARY_JUMP_BUFFER  *JumpBuffer
  )
{
  _asm {
    push    [esp + 4]
    call    InternalAssertJumpBuffer
    pop     ecx
    pop     ecx
    mov     edx, [esp]

    xor     eax, eax
    mov     [edx + 24], eax        ; save 0 to SSP

    mov     eax, [PcdGet32 (PcdControlFlowEnforcementPropertyMask)]
    test    eax, eax
    jz      CetDone
    _emit      0x0F
    _emit      0x20
    _emit      0xE0                ; mov     eax, cr4
    bt      eax, 23                ; check if CET is enabled
    jnc     CetDone

    mov     eax, 1
    _emit      0xF3
    _emit      0x0F
    _emit      0xAE
    _emit      0xE8                ; INCSSP EAX to read original SSP
    _emit      0xF3
    _emit      0x0F
    _emit      0x1E
    _emit      0xC8                ; READSSP EAX
    mov     [edx + 0x24], eax      ; save SSP

CetDone:

    mov     [edx], ebx
    mov     [edx + 4], esi
    mov     [edx + 8], edi
    mov     [edx + 12], ebp
    mov     [edx + 16], esp
    mov     [edx + 20], ecx
    xor     eax, eax
    jmp     ecx
  }
}

__declspec (naked)
VOID
EFIAPI
InternalLongJump (
  IN      BASE_LIBRARY_JUMP_BUFFER  *JumpBuffer,
  IN      UINTN                     Value
  )
{
  _asm {
    mov     eax, [PcdGet32 (PcdControlFlowEnforcementPropertyMask)]
    test    eax, eax
    jz      CetDone
    _emit      0x0F
    _emit      0x20
    _emit      0xE0                ; mov     eax, cr4
    bt      eax, 23                ; check if CET is enabled
    jnc     CetDone

    mov     edx, [esp + 4]         ; edx = JumpBuffer
    mov     edx, [edx + 24]        ; edx = target SSP
    _emit      0xF3
    _emit      0x0F
    _emit      0x1E
    _emit      0xC8                ; READSSP EAX
    sub     edx, eax               ; edx = delta
    mov     eax, edx               ; eax = delta

    shr     eax, 2                 ; eax = delta/sizeof(UINT32)
    _emit      0xF3
    _emit      0x0F
    _emit      0xAE
    _emit      0xE8                ; INCSSP EAX

CetDone:

    pop     eax                         ; skip return address
    pop     edx                         ; edx <- JumpBuffer
    pop     eax                         ; eax <- Value
    mov     ebx, [edx]
    mov     esi, [edx + 4]
    mov     edi, [edx + 8]
    mov     ebp, [edx + 12]
    mov     esp, [edx + 16]
    jmp     dword ptr [edx + 20]
  }
}



From: afish@apple.com [mailto:afish@apple.com]
Sent: Wednesday, March 6, 2019 9:56 PM
To: Yao, Jiewen <jiewen.yao@intel.com>
Cc: Kinney, Michael D <michael.d.kinney@intel.com>; Gao, Liming <liming.gao@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org>
Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code




On Mar 6, 2019, at 9:06 PM, Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>> wrote:

HI Mike and Andrew
The problem is maintainability.

If we have multiple copy of implementation, a developer need verify multiple copy of implementation, if we make update. Before that, a developer has to be aware that there is multiple copy of implementation. - That increases the complexity.

If we have everything, there MAY be 5 copy - ASM, NASM, S, GCC inline, MS inline, theoretically.
Now, we remove ASM. It is good first step.
But we may still have 4 copies. I suggest we consider do more.

Jiewen,

I think you are trying do the right thing, but are optimize the wrong thing.

Most of the GCC/Clang inline assembler code is in Gccinline.c and since that code is mostly just abstracting an x86 instruction and the functions are very simple and thus it is NOT code that needs ongoing maintenance.

Lets look at the history:
https://github.com/tianocore/edk2/commits/master/MdePkg/Library/BaseLib/X64/GccInline.c
The last logic fix was Jun 1, 2010

https://github.com/tianocore/edk2/commits/master/MdePkg/Library/BaseLib/Ia32/GccInline.c
Ok so Mike had to make a fix in this file in 2015 to make something optional, due to an embedded CPU defeating an instruction. But prior to that it was 2010.

The set of things that are C inline assembler we have should be static and not require much maintenance. More complex code should be written in C and use the C inline assembler functions we already have. If any more complex assembly code is required we should just write it in NASM. So I think it is fine to restrict new C inline assembler, or at least have a very high bar to add anything new. Any new inline assembler should also be simple and just be a function abstracting a CPU op-code that is not available to C. This is how we prevent the maintenance issues you are worrying about.

I gave an example in this  mail thread on how a Breakpoint goes from being 1 byte to 11 bytes if you remove the C inline assembler. For clang with LTO enabled a CpuBreakpoint will always get inlined into the callers code and it will only take the one byte for int 3 instruction. If that code moves to NASM then it get replaces with a 5 byte call instruction and an actual C ABI function for the breakpoint.

VOID
EFIAPI
CpuBreakpoint (
  VOID
  )
{
  __asm__ __volatile__ ("int $3");
}


Today with clang LTO turned on calling CpuBreakpoint() looks like this from the callers point of view.

a.out[0x1fa5] <+6>:  cc              int3


But if we move that to NASM:


a.out[0x1fa6] <+7>:  e8 07 00 00 00  calll  0x1fb2                    ; CpuBreakpoint


plus:
a.out`CpuBreakpoint:
a.out[0x1f99] <+0>: 55     pushl  %ebp
a.out[0x1f9a] <+1>: 89 e5  movl   %esp, %ebp
a.out[0x1f9c] <+3>: cc     int3
a.out[0x1f9d] <+4>: 5d     popl   %ebp
a.out[0x1f9e] <+5>: c3     retl


For any compiler that emits the frame pointer if you move the INT 3 to assembly you need the frame pointer or the Exception Lib is not going to be able to print out the stack backtrace of the code when you hit a breakpoint. So this is how I get to 11 vs. 1 bytes.


Thanks,

Andrew Fish

PS for clang LTO the compiler compiles to bitcode and that is an abstracted assembly language. At link time all that code gets optimized to together and then passed through the CPU specific code gen. For C inline assembler the assembly instructions end up in the bitcode with lots of information about the constraints. That is why these GccInline.c functions almost always get inlined with clang and LTO.

The CpuBreakpoint() function looks like this in bitcode, but the function call gets optimized away by LTO in the code gen.

; Function Attrs: noinline nounwind optnone ssp uwtable
define void @CpuBreakpoint() #0 {
  call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !3
  ret void
}


Even something more complex like AsmReadMsr64() can get inlined, but it contains a lot more info about the constraints:

; Function Attrs: noinline nounwind optnone ssp uwtable
define i64 @AsmReadMsr64(i32) #0 {
  %2 = alloca i32, align 4
  %3 = alloca i64, align 8
  store i32 %0, i32* %2, align 4
  %4 = load i32, i32* %2, align 4
  %5 = call i64 asm sideeffect "rdmsr", "=A,{cx},~{dirflag},~{fpsr},~{flags}"(i32 %4) #2, !srcloc !4
  store i64 %5, i64* %3, align 8
  %6 = load i64, i64* %3, align 8
  ret i64 %6
}


The alloca, store, load, etc are the same bitcode instructions you would see with C code.


A recently case that SetJump/LongJump, I updated the NASM file for both IA32 and X64.

Later, to my surprise, only X64 version NASM works, and IA32 version NASM does not work.
Then I notice that there is a different copy if IA32 Jump.c MS inline - I also need update. That is frustrated.

I think there should be a balance between optimization and code readability/maintainability.

Do we have data on what size benefit we can get with these inline function, from whole image level?
We can do evaluation at function level, case by case.
If we see a huge size benefit, I agree to keep this function.
If we see no size benefit, I suggest we do the cleanup for this function.


Thank you
Yao Jiewen




-----Original Message-----
From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of
Andrew Fish via edk2-devel
Sent: Wednesday, March 6, 2019 5:31 PM
To: Kinney, Michael D <michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>>
Cc: edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>>; Gao, Liming
<liming.gao@intel.com<mailto:liming.gao@intel.com>>
Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
inline X86 assembly code




On Mar 6, 2019, at 4:41 PM, Kinney, Michael D
<michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>> wrote:


Liming,

That does not work either because inline assembly uses compiler specific
syntax.


Please update with the specific list of functions that you think the inline
should be removed to improve maintenance with no impacts in size/perf.



Mike,

It is easy to find the gcc/clang inline assembler, just `git grep "__asm__
__volatile__"`

The main files are:
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseIoLib
Intrinsic/IoLibGcc.c
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/X
64/GccInline.c
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/I
a32/GccInline.c
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync
hronizationLib/Ia32/GccInline.c
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync
hronizationLib/X64/GccInline.c

This is really just compiler optimizer control.
Library/BaseSynchronizationLib/SynchronizationGcc.c:21:#define
_ReadWriteBarrier() do { __asm__ __volatile__ ("": : : "memory"); } while(0)

Looks like this is working around the alignment ASSERT in BaseIoLibIntrinsic.
OvmfPkg/QemuVideoDxe/UnalignedIoGcc.c:43:  __asm__ __volatile__
( "outl %0, %1" : : "a" (Value), "d" ((UINT16)Port) );
OvmfPkg/QemuVideoDxe/UnalignedIoGcc.c:67:  __asm__ __volatile__
( "inl %1, %0" : "=a" (Data) : "d" ((UINT16)Port) );

It seems like we have a reasonable set and should not use the inline
assembly for any more complex code.

Thanks,

Andrew Fish


The issue you pointed to was around SetJump()/LongJump().  Can we limit
this BZ to only those 2 functions?


Mike
  <>
From: Gao, Liming
Sent: Wednesday, March 6, 2019 4:28 PM
To: afish@apple.com<mailto:afish@apple.com>
Cc: Zhang, Shenglei <shenglei.zhang@intel.com<mailto:shenglei.zhang@intel.com>>; edk2-devel-01
<edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>>; Kinney, Michael D
<michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>>

Subject: RE: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
inline X86 assembly code


Andrew:
 I want to keep only one implementation. If inline assembly c source is
preferred, I suggest to remove its nasm implementation.


Thanks
Liming
<>From: afish@apple.com<mailto:afish@apple.com> <mailto:afish@apple.com>
[mailto:afish@apple.com <mailto:afish@apple.com><mailto:afish@apple.com%20%3cmailto:afish@apple.com%3e>]

Sent: Tuesday, March 5, 2019 2:44 PM
To: Gao, Liming <liming.gao@intel.com <mailto:liming.gao@intel.com<mailto:liming.gao@intel.com%20%3cmailto:liming.gao@intel.com>>>
Cc: Zhang, Shenglei <shenglei.zhang@intel.com
<mailto:shenglei.zhang@intel.com>>; edk2-devel-01
<edk2-devel@lists.01.org <mailto:edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org%20%3cmailto:edk2-devel@lists.01.org>>>; Kinney,
Michael D <michael.d.kinney@intel.com
<mailto:michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com%0b%3cmailto:michael.d.kinney@intel.com>>>

Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
inline X86 assembly code





On Mar 5, 2019, at 1:32 PM, Gao, Liming <liming.gao@intel.com
<mailto:liming.gao@intel.com>> wrote:


Andrew:
BZ 1163 is to remove inline X86 assembly code in C source file. But, this
patch is wrong. I have gave my comments to update this patch.



Why do we want to remove inline X86 assembly. As I mentioned it will lead
to larger binaries, slower binaries, and less optimized code.


For example take this simple inline assembly function:

VOID
EFIAPI
CpuBreakpoint (
 VOID
 )
{
 __asm__ __volatile__ ("int $3");
}


Today with clang LTO turned on calling CpuBreakpoint() looks like this from
the callers point of view.


a.out[0x1fa5] <+6>:  cc              int3


But if we move that to NASM:


a.out[0x1fa6] <+7>:  e8 07 00 00 00  calll
0x1fb2                    ; CpuBreakpoint



plus:
a.out`CpuBreakpoint:
a.out[0x1fb2] <+0>: cc     int3
a.out[0x1fb3] <+1>: c3     retl


And there is also an extra push and pop on the stack. The other issue is the
call to the function that is unknown to the compiler will act like a
_ReadWriteBarrier (Yikes I see _ReadWriteBarrier is deprecated in VC++
2017). Is the depreciation of some of these intrinsics in VC++ driving the
removal of inline assembly? For GCC inline assembly works great for local
compile, and for clang LTO it works in entire program optimization.


The LTO bitcode includes inline assembly and constraints so that the
optimizer knows what to do so it can get optimized just like the abstract
bitcode during the Link Time Optimization.


This is CpuBreakpoint():
; Function Attrs: noinline nounwind optnone ssp uwtable
define void @CpuBreakpoint() #0 {
 call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"()
#2, !srcloc !3

 ret void
}


This is AsmReadMsr64():
; Function Attrs: noinline nounwind optnone ssp uwtable
define i64 @AsmReadMsr64(i32) #0 {
 %2 = alloca i32, align 4
 %3 = alloca i64, align 8
 store i32 %0, i32* %2, align 4
 %4 = load i32, i32* %2, align 4
 %5 = call i64 asm sideeffect "rdmsr",
"=A,{cx},~{dirflag},~{fpsr},~{flags}"(i32 %4) #2, !srcloc !4

 store i64 %5, i64* %3, align 8
 %6 = load i64, i64* %3, align 8
 ret i64 %6
}



I agree it make sense to remove .S and .asm files and only have .nasm files.

Thanks,

Andrew Fish

PS For the Xcode clang since it emits frame pointers by default we need to
add the extra 4 bytes to save the assembly functions so the stack can get
unwound.


a.out`CpuBreakpoint:
a.out[0x1f99] <+0>: 55     pushl  %ebp
a.out[0x1f9a] <+1>: 89 e5  movl   %esp, %ebp
a.out[0x1f9c] <+3>: cc     int3
a.out[0x1f9d] <+4>: 5d     popl   %ebp
a.out[0x1f9e] <+5>: c3     retl


So breakpoint goes from 1 byte to 11 bytes if we get rid of the intrinsics.


The change is to reduce the duplicated implementation. It will be good
on the code maintain. Recently, one patch has to update .c and .nasm both.
Please see
https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html
<https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html>.
Beside this change, I propose to remove all GAS assembly code for IA32 and
X64 arch. After those change, the patch owner will collect the impact of the
image size.


Thanks
Liming

_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>
https://lists.01.org/mailman/listinfo/edk2-devel



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
  2019-03-07  6:09                   ` Yao, Jiewen
@ 2019-03-07  6:45                     ` Andrew Fish
  2019-03-07  7:15                       ` Gao, Liming
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Fish @ 2019-03-07  6:45 UTC (permalink / raw)
  To: Yao, Jiewen; +Cc: Mike Kinney, Gao, Liming, edk2-devel-01



> On Mar 6, 2019, at 10:09 PM, Yao, Jiewen <jiewen.yao@intel.com> wrote:
> 
> Thanks Andrew.
> Now I 100% agree with you - I think it is fine to restrict new C inline assembler, or at least have a very high bar to add anything new. Any new inline assembler *should also be simple and just be a function abstracting a CPU op-code* that is not available to C. This is how we prevent the maintenance issues you are worrying about.
>  
> And I also agree with your case.
>  
> Let’s discuss another case. Below 2 functions SetJump/LongJump are updated recently, by me unfortunately.
>  
> It is unclear to me what is the size optimization we can get by inline SetJump/LongJump.
> But obviously, it brings the maintenance issue to me.
> And I don’t believe it meets the criteria you mentioned above.
>  

Jiewen,

I agree it seems like given the rules we agree on this code should be written in NASM. 

Thanks,

Andrew Fish

> _declspec (naked)
> RETURNS_TWICE
> UINTN
> EFIAPI
> SetJump (
>   OUT     BASE_LIBRARY_JUMP_BUFFER  *JumpBuffer
>   )
> {
>   _asm {
>     push    [esp + 4]
>     call    InternalAssertJumpBuffer
>     pop     ecx
>     pop     ecx
>     mov     edx, [esp]
>  
>     xor     eax, eax
>     mov     [edx + 24], eax        ; save 0 to SSP
>  
>     mov     eax, [PcdGet32 (PcdControlFlowEnforcementPropertyMask)]
>     test    eax, eax
>     jz      CetDone
>     _emit      0x0F
>     _emit      0x20
>     _emit      0xE0                ; mov     eax, cr4
>     bt      eax, 23                ; check if CET is enabled
>     jnc     CetDone
>  
>     mov     eax, 1
>     _emit      0xF3
>     _emit      0x0F
>     _emit      0xAE
>     _emit      0xE8                ; INCSSP EAX to read original SSP
>     _emit      0xF3
>     _emit      0x0F
>     _emit      0x1E
>     _emit      0xC8                ; READSSP EAX
>     mov     [edx + 0x24], eax      ; save SSP
>  
> CetDone:
>  
>     mov     [edx], ebx
>     mov     [edx + 4], esi
>     mov     [edx + 8], edi
>     mov     [edx + 12], ebp
>     mov     [edx + 16], esp
>     mov     [edx + 20], ecx
>     xor     eax, eax
>     jmp     ecx
>   }
> }
>  
> __declspec (naked)
> VOID
> EFIAPI
> InternalLongJump (
>   IN      BASE_LIBRARY_JUMP_BUFFER  *JumpBuffer,
>   IN      UINTN                     Value
>   )
> {
>   _asm {
>     mov     eax, [PcdGet32 (PcdControlFlowEnforcementPropertyMask)]
>     test    eax, eax
>     jz      CetDone
>     _emit      0x0F
>     _emit      0x20
>     _emit      0xE0                ; mov     eax, cr4
>     bt      eax, 23                ; check if CET is enabled
>     jnc     CetDone
>  
>     mov     edx, [esp + 4]         ; edx = JumpBuffer
>     mov     edx, [edx + 24]        ; edx = target SSP
>     _emit      0xF3
>     _emit      0x0F
>     _emit      0x1E
>     _emit      0xC8                ; READSSP EAX
>     sub     edx, eax               ; edx = delta
>     mov     eax, edx               ; eax = delta
>  
>     shr     eax, 2                 ; eax = delta/sizeof(UINT32)
>     _emit      0xF3
>     _emit      0x0F
>     _emit      0xAE
>     _emit      0xE8                ; INCSSP EAX
>  
> CetDone:
>  
>     pop     eax                         ; skip return address
>     pop     edx                         ; edx <- JumpBuffer
>     pop     eax                         ; eax <- Value
>     mov     ebx, [edx]
>     mov     esi, [edx + 4]
>     mov     edi, [edx + 8]
>     mov     ebp, [edx + 12]
>     mov     esp, [edx + 16]
>     jmp     dword ptr [edx + 20]
>   }
> }
>  
>  
>   <>
>  <>From: afish@apple.com [mailto:afish@apple.com] 
> Sent: Wednesday, March 6, 2019 9:56 PM
> To: Yao, Jiewen <jiewen.yao@intel.com>
> Cc: Kinney, Michael D <michael.d.kinney@intel.com>; Gao, Liming <liming.gao@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org>
> Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
>  
>  
> 
> 
> On Mar 6, 2019, at 9:06 PM, Yao, Jiewen <jiewen.yao@intel.com <mailto:jiewen.yao@intel.com>> wrote:
>  
> HI Mike and Andrew
> The problem is maintainability.
> 
> If we have multiple copy of implementation, a developer need verify multiple copy of implementation, if we make update. Before that, a developer has to be aware that there is multiple copy of implementation. - That increases the complexity.
> 
> If we have everything, there MAY be 5 copy - ASM, NASM, S, GCC inline, MS inline, theoretically.
> Now, we remove ASM. It is good first step.
> But we may still have 4 copies. I suggest we consider do more.
> 
>  
> Jiewen,
>  
> I think you are trying do the right thing, but are optimize the wrong thing. 
>  
> Most of the GCC/Clang inline assembler code is in Gccinline.c and since that code is mostly just abstracting an x86 instruction and the functions are very simple and thus it is NOT code that needs ongoing maintenance. 
>  
> Lets look at the history:
> https://github.com/tianocore/edk2/commits/master/MdePkg/Library/BaseLib/X64/GccInline.c <https://github.com/tianocore/edk2/commits/master/MdePkg/Library/BaseLib/X64/GccInline.c>
> The last logic fix was Jun 1, 2010
>  
> https://github.com/tianocore/edk2/commits/master/MdePkg/Library/BaseLib/Ia32/GccInline.c <https://github.com/tianocore/edk2/commits/master/MdePkg/Library/BaseLib/Ia32/GccInline.c>
> Ok so Mike had to make a fix in this file in 2015 to make something optional, due to an embedded CPU defeating an instruction. But prior to that it was 2010. 
>  
> The set of things that are C inline assembler we have should be static and not require much maintenance. More complex code should be written in C and use the C inline assembler functions we already have. If any more complex assembly code is required we should just write it in NASM. So I think it is fine to restrict new C inline assembler, or at least have a very high bar to add anything new. Any new inline assembler should also be simple and just be a function abstracting a CPU op-code that is not available to C. This is how we prevent the maintenance issues you are worrying about. 
>  
> I gave an example in this  mail thread on how a Breakpoint goes from being 1 byte to 11 bytes if you remove the C inline assembler. For clang with LTO enabled a CpuBreakpoint will always get inlined into the callers code and it will only take the one byte for int 3 instruction. If that code moves to NASM then it get replaces with a 5 byte call instruction and an actual C ABI function for the breakpoint. 
>  
> VOID
> EFIAPI
> CpuBreakpoint (
>   VOID
>   )
> {
>   __asm__ __volatile__ ("int $3");
> }
> 
> 
> Today with clang LTO turned on calling CpuBreakpoint() looks like this from the callers point of view. 
>  
> a.out[0x1fa5] <+6>:  cc              int3   
> 
> 
> But if we move that to NASM:
> 
> 
> a.out[0x1fa6] <+7>:  e8 07 00 00 00  calll  0x1fb2                    ; CpuBreakpoint
> 
> 
> plus:
> a.out`CpuBreakpoint:
> a.out[0x1f99] <+0>: 55     pushl  %ebp
> a.out[0x1f9a] <+1>: 89 e5  movl   %esp, %ebp
> a.out[0x1f9c] <+3>: cc     int3   
> a.out[0x1f9d] <+4>: 5d     popl   %ebp
> a.out[0x1f9e] <+5>: c3     retl   
> 
> 
> For any compiler that emits the frame pointer if you move the INT 3 to assembly you need the frame pointer or the Exception Lib is not going to be able to print out the stack backtrace of the code when you hit a breakpoint. So this is how I get to 11 vs. 1 bytes. 
> 
> 
> Thanks,
>  
> Andrew Fish
>  
> PS for clang LTO the compiler compiles to bitcode and that is an abstracted assembly language. At link time all that code gets optimized to together and then passed through the CPU specific code gen. For C inline assembler the assembly instructions end up in the bitcode with lots of information about the constraints. That is why these GccInline.c functions almost always get inlined with clang and LTO. 
>  
> The CpuBreakpoint() function looks like this in bitcode, but the function call gets optimized away by LTO in the code gen. 
>  
> ; Function Attrs: noinline nounwind optnone ssp uwtable
> define void @CpuBreakpoint() #0 {
>   call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !3
>   ret void
> }
> 
> 
> Even something more complex like AsmReadMsr64() can get inlined, but it contains a lot more info about the constraints:
>  
> ; Function Attrs: noinline nounwind optnone ssp uwtable
> define i64 @AsmReadMsr64(i32) #0 {
>   %2 = alloca i32, align 4
>   %3 = alloca i64, align 8
>   store i32 %0, i32* %2, align 4
>   %4 = load i32, i32* %2, align 4
>   %5 = call i64 asm sideeffect "rdmsr", "=A,{cx},~{dirflag},~{fpsr},~{flags}"(i32 %4) #2, !srcloc !4
>   store i64 %5, i64* %3, align 8
>   %6 = load i64, i64* %3, align 8
>   ret i64 %6
> }
> 
> 
> The alloca, store, load, etc are the same bitcode instructions you would see with C code. 
> 
> 
> A recently case that SetJump/LongJump, I updated the NASM file for both IA32 and X64.
> 
> Later, to my surprise, only X64 version NASM works, and IA32 version NASM does not work.
> Then I notice that there is a different copy if IA32 Jump.c MS inline - I also need update. That is frustrated.
> 
> I think there should be a balance between optimization and code readability/maintainability.
> 
> Do we have data on what size benefit we can get with these inline function, from whole image level?
> We can do evaluation at function level, case by case.
> If we see a huge size benefit, I agree to keep this function.
> If we see no size benefit, I suggest we do the cleanup for this function.
> 
> 
> Thank you
> Yao Jiewen
> 
> 
> 
> 
> -----Original Message-----
> From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org <mailto:edk2-devel-bounces@lists.01.org>] On Behalf Of
> Andrew Fish via edk2-devel
> Sent: Wednesday, March 6, 2019 5:31 PM
> To: Kinney, Michael D <michael.d.kinney@intel.com <mailto:michael.d.kinney@intel.com>>
> Cc: edk2-devel-01 <edk2-devel@lists.01.org <mailto:edk2-devel@lists.01.org>>; Gao, Liming
> <liming.gao@intel.com <mailto:liming.gao@intel.com>>
> Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
> inline X86 assembly code
> 
> 
> 
> 
> On Mar 6, 2019, at 4:41 PM, Kinney, Michael D
> <michael.d.kinney@intel.com <mailto:michael.d.kinney@intel.com>> wrote:
> 
> 
> Liming,
> 
> That does not work either because inline assembly uses compiler specific
> syntax.
> 
> 
> Please update with the specific list of functions that you think the inline
> should be removed to improve maintenance with no impacts in size/perf.
> 
>  
> 
> Mike,
> 
> It is easy to find the gcc/clang inline assembler, just `git grep "__asm__
> __volatile__"`
> 
> The main files are:
> https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseIoLib <https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseIoLib>
> Intrinsic/IoLibGcc.c
> https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/X <https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/X>
> 64/GccInline.c
> https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/I <https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/I>
> a32/GccInline.c
> https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync <https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync>
> hronizationLib/Ia32/GccInline.c
> https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync <https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync>
> hronizationLib/X64/GccInline.c
> 
> This is really just compiler optimizer control.
> Library/BaseSynchronizationLib/SynchronizationGcc.c:21:#define
> _ReadWriteBarrier() do { __asm__ __volatile__ ("": : : "memory"); } while(0)
> 
> Looks like this is working around the alignment ASSERT in BaseIoLibIntrinsic.
> OvmfPkg/QemuVideoDxe/UnalignedIoGcc.c:43:  __asm__ __volatile__
> ( "outl %0, %1" : : "a" (Value), "d" ((UINT16)Port) );
> OvmfPkg/QemuVideoDxe/UnalignedIoGcc.c:67:  __asm__ __volatile__
> ( "inl %1, %0" : "=a" (Data) : "d" ((UINT16)Port) );
> 
> It seems like we have a reasonable set and should not use the inline
> assembly for any more complex code.
> 
> Thanks,
> 
> Andrew Fish
> 
> 
> The issue you pointed to was around SetJump()/LongJump().  Can we limit
> this BZ to only those 2 functions?
> 
> 
> Mike
>   <>
> From: Gao, Liming
> Sent: Wednesday, March 6, 2019 4:28 PM
> To: afish@apple.com <mailto:afish@apple.com>
> Cc: Zhang, Shenglei <shenglei.zhang@intel.com <mailto:shenglei.zhang@intel.com>>; edk2-devel-01
> <edk2-devel@lists.01.org <mailto:edk2-devel@lists.01.org>>; Kinney, Michael D
> <michael.d.kinney@intel.com <mailto:michael.d.kinney@intel.com>>
> 
> Subject: RE: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
> inline X86 assembly code
> 
> 
> Andrew:
>  I want to keep only one implementation. If inline assembly c source is
> preferred, I suggest to remove its nasm implementation.
> 
> 
> Thanks
> Liming
> <>From: afish@apple.com <mailto:afish@apple.com> <mailto:afish@apple.com <mailto:afish@apple.com>>
> [mailto:afish@apple.com <mailto:afish@apple.com> <mailto:afish@apple.com%20%3cmailto:afish@apple.com%3e>]
> 
> Sent: Tuesday, March 5, 2019 2:44 PM
> To: Gao, Liming <liming.gao@intel.com <mailto:liming.gao@intel.com <mailto:liming.gao@intel.com%20%3cmailto:liming.gao@intel.com>>>
> Cc: Zhang, Shenglei <shenglei.zhang@intel.com
> <mailto:shenglei.zhang@intel.com <mailto:shenglei.zhang@intel.com>>>; edk2-devel-01
> <edk2-devel@lists.01.org <mailto:edk2-devel@lists.01.org <mailto:edk2-devel@lists.01.org%20%3cmailto:edk2-devel@lists.01.org>>>; Kinney,
> Michael D <michael.d.kinney@intel.com
> <mailto:michael.d.kinney@intel.com <mailto:michael.d.kinney@intel.com%0b%3cmailto:michael.d.kinney@intel.com>>>
> 
> Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
> inline X86 assembly code
> 
> 
> 
> 
> 
> On Mar 5, 2019, at 1:32 PM, Gao, Liming <liming.gao@intel.com
> <mailto:liming.gao@intel.com <mailto:liming.gao@intel.com>>> wrote:
> 
> 
> Andrew:
> BZ 1163 is to remove inline X86 assembly code in C source file. But, this
> patch is wrong. I have gave my comments to update this patch.
> 
> 
> 
> Why do we want to remove inline X86 assembly. As I mentioned it will lead
> to larger binaries, slower binaries, and less optimized code.
> 
> 
> For example take this simple inline assembly function:
> 
> VOID
> EFIAPI
> CpuBreakpoint (
>  VOID
>  )
> {
>  __asm__ __volatile__ ("int $3");
> }
> 
> 
> Today with clang LTO turned on calling CpuBreakpoint() looks like this from
> the callers point of view.
> 
> 
> a.out[0x1fa5] <+6>:  cc              int3
> 
> 
> But if we move that to NASM:
> 
> 
> a.out[0x1fa6] <+7>:  e8 07 00 00 00  calll
> 0x1fb2                    ; CpuBreakpoint
> 
> 
> 
> plus:
> a.out`CpuBreakpoint:
> a.out[0x1fb2] <+0>: cc     int3
> a.out[0x1fb3] <+1>: c3     retl
> 
> 
> And there is also an extra push and pop on the stack. The other issue is the
> call to the function that is unknown to the compiler will act like a
> _ReadWriteBarrier (Yikes I see _ReadWriteBarrier is deprecated in VC++
> 2017). Is the depreciation of some of these intrinsics in VC++ driving the
> removal of inline assembly? For GCC inline assembly works great for local
> compile, and for clang LTO it works in entire program optimization.
> 
> 
> The LTO bitcode includes inline assembly and constraints so that the
> optimizer knows what to do so it can get optimized just like the abstract
> bitcode during the Link Time Optimization.
> 
> 
> This is CpuBreakpoint():
> ; Function Attrs: noinline nounwind optnone ssp uwtable
> define void @CpuBreakpoint() #0 {
>  call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"()
> #2, !srcloc !3
> 
>  ret void
> }
> 
> 
> This is AsmReadMsr64():
> ; Function Attrs: noinline nounwind optnone ssp uwtable
> define i64 @AsmReadMsr64(i32) #0 {
>  %2 = alloca i32, align 4
>  %3 = alloca i64, align 8
>  store i32 %0, i32* %2, align 4
>  %4 = load i32, i32* %2, align 4
>  %5 = call i64 asm sideeffect "rdmsr",
> "=A,{cx},~{dirflag},~{fpsr},~{flags}"(i32 %4) #2, !srcloc !4
> 
>  store i64 %5, i64* %3, align 8
>  %6 = load i64, i64* %3, align 8
>  ret i64 %6
> }
> 
> 
> 
> I agree it make sense to remove .S and .asm files and only have .nasm files.
> 
> Thanks,
> 
> Andrew Fish
> 
> PS For the Xcode clang since it emits frame pointers by default we need to
> add the extra 4 bytes to save the assembly functions so the stack can get
> unwound.
> 
> 
> a.out`CpuBreakpoint:
> a.out[0x1f99] <+0>: 55     pushl  %ebp
> a.out[0x1f9a] <+1>: 89 e5  movl   %esp, %ebp
> a.out[0x1f9c] <+3>: cc     int3
> a.out[0x1f9d] <+4>: 5d     popl   %ebp
> a.out[0x1f9e] <+5>: c3     retl
> 
> 
> So breakpoint goes from 1 byte to 11 bytes if we get rid of the intrinsics.
> 
> 
> The change is to reduce the duplicated implementation. It will be good
> on the code maintain. Recently, one patch has to update .c and .nasm both.
> Please see
> https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html <https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html>
> <https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html <https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html>>.
> Beside this change, I propose to remove all GAS assembly code for IA32 and
> X64 arch. After those change, the patch owner will collect the impact of the
> image size.
> 
> 
> Thanks
> Liming
> 
> _______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.01.org <mailto:edk2-devel@lists.01.org>
> https://lists.01.org/mailman/listinfo/edk2-devel <https://lists.01.org/mailman/listinfo/edk2-devel>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
  2019-03-07  6:45                     ` Andrew Fish
@ 2019-03-07  7:15                       ` Gao, Liming
  2019-03-07  7:25                         ` Yao, Jiewen
  0 siblings, 1 reply; 21+ messages in thread
From: Gao, Liming @ 2019-03-07  7:15 UTC (permalink / raw)
  To: afish@apple.com, Yao, Jiewen; +Cc: Kinney, Michael D, edk2-devel-01

Thanks for your clarification. Now, we will focus on SetJump/LongJump() first.

Thanks
Liming
From: afish@apple.com [mailto:afish@apple.com]
Sent: Wednesday, March 6, 2019 10:45 PM
To: Yao, Jiewen <jiewen.yao@intel.com>
Cc: Kinney, Michael D <michael.d.kinney@intel.com>; Gao, Liming <liming.gao@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org>
Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code




On Mar 6, 2019, at 10:09 PM, Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>> wrote:

Thanks Andrew.
Now I 100% agree with you - I think it is fine to restrict new C inline assembler, or at least have a very high bar to add anything new. Any new inline assembler *should also be simple and just be a function abstracting a CPU op-code* that is not available to C. This is how we prevent the maintenance issues you are worrying about.

And I also agree with your case.

Let’s discuss another case. Below 2 functions SetJump/LongJump are updated recently, by me unfortunately.

It is unclear to me what is the size optimization we can get by inline SetJump/LongJump.
But obviously, it brings the maintenance issue to me.
And I don’t believe it meets the criteria you mentioned above.


Jiewen,

I agree it seems like given the rules we agree on this code should be written in NASM.

Thanks,

Andrew Fish


_declspec (naked)
RETURNS_TWICE
UINTN
EFIAPI
SetJump (
  OUT     BASE_LIBRARY_JUMP_BUFFER  *JumpBuffer
  )
{
  _asm {
    push    [esp + 4]
    call    InternalAssertJumpBuffer
    pop     ecx
    pop     ecx
    mov     edx, [esp]

    xor     eax, eax
    mov     [edx + 24], eax        ; save 0 to SSP

    mov     eax, [PcdGet32 (PcdControlFlowEnforcementPropertyMask)]
    test    eax, eax
    jz      CetDone
    _emit      0x0F
    _emit      0x20
    _emit      0xE0                ; mov     eax, cr4
    bt      eax, 23                ; check if CET is enabled
    jnc     CetDone

    mov     eax, 1
    _emit      0xF3
    _emit      0x0F
    _emit      0xAE
    _emit      0xE8                ; INCSSP EAX to read original SSP
    _emit      0xF3
    _emit      0x0F
    _emit      0x1E
    _emit      0xC8                ; READSSP EAX
    mov     [edx + 0x24], eax      ; save SSP

CetDone:

    mov     [edx], ebx
    mov     [edx + 4], esi
    mov     [edx + 8], edi
    mov     [edx + 12], ebp
    mov     [edx + 16], esp
    mov     [edx + 20], ecx
    xor     eax, eax
    jmp     ecx
  }
}

__declspec (naked)
VOID
EFIAPI
InternalLongJump (
  IN      BASE_LIBRARY_JUMP_BUFFER  *JumpBuffer,
  IN      UINTN                     Value
  )
{
  _asm {
    mov     eax, [PcdGet32 (PcdControlFlowEnforcementPropertyMask)]
    test    eax, eax
    jz      CetDone
    _emit      0x0F
    _emit      0x20
    _emit      0xE0                ; mov     eax, cr4
    bt      eax, 23                ; check if CET is enabled
    jnc     CetDone

    mov     edx, [esp + 4]         ; edx = JumpBuffer
    mov     edx, [edx + 24]        ; edx = target SSP
    _emit      0xF3
    _emit      0x0F
    _emit      0x1E
    _emit      0xC8                ; READSSP EAX
    sub     edx, eax               ; edx = delta
    mov     eax, edx               ; eax = delta

    shr     eax, 2                 ; eax = delta/sizeof(UINT32)
    _emit      0xF3
    _emit      0x0F
    _emit      0xAE
    _emit      0xE8                ; INCSSP EAX

CetDone:

    pop     eax                         ; skip return address
    pop     edx                         ; edx <- JumpBuffer
    pop     eax                         ; eax <- Value
    mov     ebx, [edx]
    mov     esi, [edx + 4]
    mov     edi, [edx + 8]
    mov     ebp, [edx + 12]
    mov     esp, [edx + 16]
    jmp     dword ptr [edx + 20]
  }
}



From: afish@apple.com<mailto:afish@apple.com> [mailto:afish@apple.com]
Sent: Wednesday, March 6, 2019 9:56 PM
To: Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>>
Cc: Kinney, Michael D <michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>>; Gao, Liming <liming.gao@intel.com<mailto:liming.gao@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>>
Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code





On Mar 6, 2019, at 9:06 PM, Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>> wrote:

HI Mike and Andrew
The problem is maintainability.

If we have multiple copy of implementation, a developer need verify multiple copy of implementation, if we make update. Before that, a developer has to be aware that there is multiple copy of implementation. - That increases the complexity.

If we have everything, there MAY be 5 copy - ASM, NASM, S, GCC inline, MS inline, theoretically.
Now, we remove ASM. It is good first step.
But we may still have 4 copies. I suggest we consider do more.

Jiewen,

I think you are trying do the right thing, but are optimize the wrong thing.

Most of the GCC/Clang inline assembler code is in Gccinline.c and since that code is mostly just abstracting an x86 instruction and the functions are very simple and thus it is NOT code that needs ongoing maintenance.

Lets look at the history:
https://github.com/tianocore/edk2/commits/master/MdePkg/Library/BaseLib/X64/GccInline.c
The last logic fix was Jun 1, 2010

https://github.com/tianocore/edk2/commits/master/MdePkg/Library/BaseLib/Ia32/GccInline.c
Ok so Mike had to make a fix in this file in 2015 to make something optional, due to an embedded CPU defeating an instruction. But prior to that it was 2010.

The set of things that are C inline assembler we have should be static and not require much maintenance. More complex code should be written in C and use the C inline assembler functions we already have. If any more complex assembly code is required we should just write it in NASM. So I think it is fine to restrict new C inline assembler, or at least have a very high bar to add anything new. Any new inline assembler should also be simple and just be a function abstracting a CPU op-code that is not available to C. This is how we prevent the maintenance issues you are worrying about.

I gave an example in this  mail thread on how a Breakpoint goes from being 1 byte to 11 bytes if you remove the C inline assembler. For clang with LTO enabled a CpuBreakpoint will always get inlined into the callers code and it will only take the one byte for int 3 instruction. If that code moves to NASM then it get replaces with a 5 byte call instruction and an actual C ABI function for the breakpoint.

VOID
EFIAPI
CpuBreakpoint (
  VOID
  )
{
  __asm__ __volatile__ ("int $3");
}



Today with clang LTO turned on calling CpuBreakpoint() looks like this from the callers point of view.

a.out[0x1fa5] <+6>:  cc              int3



But if we move that to NASM:



a.out[0x1fa6] <+7>:  e8 07 00 00 00  calll  0x1fb2                    ; CpuBreakpoint



plus:
a.out`CpuBreakpoint:
a.out[0x1f99] <+0>: 55     pushl  %ebp
a.out[0x1f9a] <+1>: 89 e5  movl   %esp, %ebp
a.out[0x1f9c] <+3>: cc     int3
a.out[0x1f9d] <+4>: 5d     popl   %ebp
a.out[0x1f9e] <+5>: c3     retl



For any compiler that emits the frame pointer if you move the INT 3 to assembly you need the frame pointer or the Exception Lib is not going to be able to print out the stack backtrace of the code when you hit a breakpoint. So this is how I get to 11 vs. 1 bytes.



Thanks,

Andrew Fish

PS for clang LTO the compiler compiles to bitcode and that is an abstracted assembly language. At link time all that code gets optimized to together and then passed through the CPU specific code gen. For C inline assembler the assembly instructions end up in the bitcode with lots of information about the constraints. That is why these GccInline.c functions almost always get inlined with clang and LTO.

The CpuBreakpoint() function looks like this in bitcode, but the function call gets optimized away by LTO in the code gen.

; Function Attrs: noinline nounwind optnone ssp uwtable
define void @CpuBreakpoint() #0 {
  call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !3
  ret void
}



Even something more complex like AsmReadMsr64() can get inlined, but it contains a lot more info about the constraints:

; Function Attrs: noinline nounwind optnone ssp uwtable
define i64 @AsmReadMsr64(i32) #0 {
  %2 = alloca i32, align 4
  %3 = alloca i64, align 8
  store i32 %0, i32* %2, align 4
  %4 = load i32, i32* %2, align 4
  %5 = call i64 asm sideeffect "rdmsr", "=A,{cx},~{dirflag},~{fpsr},~{flags}"(i32 %4) #2, !srcloc !4
  store i64 %5, i64* %3, align 8
  %6 = load i64, i64* %3, align 8
  ret i64 %6
}



The alloca, store, load, etc are the same bitcode instructions you would see with C code.



A recently case that SetJump/LongJump, I updated the NASM file for both IA32 and X64.

Later, to my surprise, only X64 version NASM works, and IA32 version NASM does not work.
Then I notice that there is a different copy if IA32 Jump.c MS inline - I also need update. That is frustrated.

I think there should be a balance between optimization and code readability/maintainability.

Do we have data on what size benefit we can get with these inline function, from whole image level?
We can do evaluation at function level, case by case.
If we see a huge size benefit, I agree to keep this function.
If we see no size benefit, I suggest we do the cleanup for this function.


Thank you
Yao Jiewen





-----Original Message-----
From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of
Andrew Fish via edk2-devel
Sent: Wednesday, March 6, 2019 5:31 PM
To: Kinney, Michael D <michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>>
Cc: edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>>; Gao, Liming
<liming.gao@intel.com<mailto:liming.gao@intel.com>>
Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
inline X86 assembly code





On Mar 6, 2019, at 4:41 PM, Kinney, Michael D
<michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>> wrote:



Liming,

That does not work either because inline assembly uses compiler specific
syntax.



Please update with the specific list of functions that you think the inline
should be removed to improve maintenance with no impacts in size/perf.




Mike,

It is easy to find the gcc/clang inline assembler, just `git grep "__asm__
__volatile__"`

The main files are:
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseIoLib
Intrinsic/IoLibGcc.c
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/X
64/GccInline.c
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/I
a32/GccInline.c
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync
hronizationLib/Ia32/GccInline.c
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync
hronizationLib/X64/GccInline.c

This is really just compiler optimizer control.
Library/BaseSynchronizationLib/SynchronizationGcc.c:21:#define
_ReadWriteBarrier() do { __asm__ __volatile__ ("": : : "memory"); } while(0)

Looks like this is working around the alignment ASSERT in BaseIoLibIntrinsic.
OvmfPkg/QemuVideoDxe/UnalignedIoGcc.c:43:  __asm__ __volatile__
( "outl %0, %1" : : "a" (Value), "d" ((UINT16)Port) );
OvmfPkg/QemuVideoDxe/UnalignedIoGcc.c:67:  __asm__ __volatile__
( "inl %1, %0" : "=a" (Data) : "d" ((UINT16)Port) );

It seems like we have a reasonable set and should not use the inline
assembly for any more complex code.

Thanks,

Andrew Fish



The issue you pointed to was around SetJump()/LongJump().  Can we limit
this BZ to only those 2 functions?



Mike
  <>
From: Gao, Liming
Sent: Wednesday, March 6, 2019 4:28 PM
To: afish@apple.com<mailto:afish@apple.com>
Cc: Zhang, Shenglei <shenglei.zhang@intel.com<mailto:shenglei.zhang@intel.com>>; edk2-devel-01
<edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>>; Kinney, Michael D
<michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>>


Subject: RE: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
inline X86 assembly code



Andrew:
 I want to keep only one implementation. If inline assembly c source is
preferred, I suggest to remove its nasm implementation.



Thanks
Liming
<>From: afish@apple.com<mailto:afish@apple.com> <mailto:afish@apple.com>
[mailto:afish@apple.com <mailto:afish@apple.com><mailto:afish@apple.com%20%3cmailto:afish@apple.com%3e>]


Sent: Tuesday, March 5, 2019 2:44 PM
To: Gao, Liming <liming.gao@intel.com <mailto:liming.gao@intel.com<mailto:liming.gao@intel.com%20%3cmailto:liming.gao@intel.com>>>
Cc: Zhang, Shenglei <shenglei.zhang@intel.com<mailto:shenglei.zhang@intel.com>
<mailto:shenglei.zhang@intel.com>>; edk2-devel-01
<edk2-devel@lists.01.org <mailto:edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org%20%3cmailto:edk2-devel@lists.01.org>>>; Kinney,
Michael D <michael.d.kinney@intel.com
<mailto:michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com%0b%3cmailto:michael.d.kinney@intel.com>>>


Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
inline X86 assembly code






On Mar 5, 2019, at 1:32 PM, Gao, Liming <liming.gao@intel.com<mailto:liming.gao@intel.com>
<mailto:liming.gao@intel.com>> wrote:



Andrew:
BZ 1163 is to remove inline X86 assembly code in C source file. But, this
patch is wrong. I have gave my comments to update this patch.




Why do we want to remove inline X86 assembly. As I mentioned it will lead
to larger binaries, slower binaries, and less optimized code.



For example take this simple inline assembly function:

VOID
EFIAPI
CpuBreakpoint (
 VOID
 )
{
 __asm__ __volatile__ ("int $3");
}


Today with clang LTO turned on calling CpuBreakpoint() looks like this from
the callers point of view.



a.out[0x1fa5] <+6>:  cc              int3


But if we move that to NASM:


a.out[0x1fa6] <+7>:  e8 07 00 00 00  calll
0x1fb2                    ; CpuBreakpoint




plus:
a.out`CpuBreakpoint:
a.out[0x1fb2] <+0>: cc     int3
a.out[0x1fb3] <+1>: c3     retl


And there is also an extra push and pop on the stack. The other issue is the
call to the function that is unknown to the compiler will act like a
_ReadWriteBarrier (Yikes I see _ReadWriteBarrier is deprecated in VC++
2017). Is the depreciation of some of these intrinsics in VC++ driving the
removal of inline assembly? For GCC inline assembly works great for local
compile, and for clang LTO it works in entire program optimization.



The LTO bitcode includes inline assembly and constraints so that the
optimizer knows what to do so it can get optimized just like the abstract
bitcode during the Link Time Optimization.



This is CpuBreakpoint():
; Function Attrs: noinline nounwind optnone ssp uwtable
define void @CpuBreakpoint() #0 {
 call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"()
#2, !srcloc !3


 ret void
}


This is AsmReadMsr64():
; Function Attrs: noinline nounwind optnone ssp uwtable
define i64 @AsmReadMsr64(i32) #0 {
 %2 = alloca i32, align 4
 %3 = alloca i64, align 8
 store i32 %0, i32* %2, align 4
 %4 = load i32, i32* %2, align 4
 %5 = call i64 asm sideeffect "rdmsr",
"=A,{cx},~{dirflag},~{fpsr},~{flags}"(i32 %4) #2, !srcloc !4


 store i64 %5, i64* %3, align 8
 %6 = load i64, i64* %3, align 8
 ret i64 %6
}



I agree it make sense to remove .S and .asm files and only have .nasm files.

Thanks,

Andrew Fish

PS For the Xcode clang since it emits frame pointers by default we need to
add the extra 4 bytes to save the assembly functions so the stack can get
unwound.



a.out`CpuBreakpoint:
a.out[0x1f99] <+0>: 55     pushl  %ebp
a.out[0x1f9a] <+1>: 89 e5  movl   %esp, %ebp
a.out[0x1f9c] <+3>: cc     int3
a.out[0x1f9d] <+4>: 5d     popl   %ebp
a.out[0x1f9e] <+5>: c3     retl


So breakpoint goes from 1 byte to 11 bytes if we get rid of the intrinsics.


The change is to reduce the duplicated implementation. It will be good
on the code maintain. Recently, one patch has to update .c and .nasm both.
Please see
https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html
<https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html>.
Beside this change, I propose to remove all GAS assembly code for IA32 and
X64 arch. After those change, the patch owner will collect the impact of the
image size.



Thanks
Liming

_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>
https://lists.01.org/mailman/listinfo/edk2-devel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
  2019-03-07  7:15                       ` Gao, Liming
@ 2019-03-07  7:25                         ` Yao, Jiewen
  2019-03-09  0:07                           ` Kinney, Michael D
  0 siblings, 1 reply; 21+ messages in thread
From: Yao, Jiewen @ 2019-03-07  7:25 UTC (permalink / raw)
  To: Gao, Liming, afish@apple.com; +Cc: Kinney, Michael D, edk2-devel-01

Thanks.

I also recommend to take a look at MdePkg\Library\BaseSynchronizationLib.

That is also complicated and not so readable for me.

And we have 8 patches to *fix* the real problem in 2018.

Thank you
Yao Jiewen


From: Gao, Liming
Sent: Wednesday, March 6, 2019 11:15 PM
To: afish@apple.com; Yao, Jiewen <jiewen.yao@intel.com>
Cc: Kinney, Michael D <michael.d.kinney@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org>
Subject: RE: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code

Thanks for your clarification. Now, we will focus on SetJump/LongJump() first.

Thanks
Liming
From: afish@apple.com<mailto:afish@apple.com> [mailto:afish@apple.com]
Sent: Wednesday, March 6, 2019 10:45 PM
To: Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>>
Cc: Kinney, Michael D <michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>>; Gao, Liming <liming.gao@intel.com<mailto:liming.gao@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>>
Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code



On Mar 6, 2019, at 10:09 PM, Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>> wrote:

Thanks Andrew.
Now I 100% agree with you - I think it is fine to restrict new C inline assembler, or at least have a very high bar to add anything new. Any new inline assembler *should also be simple and just be a function abstracting a CPU op-code* that is not available to C. This is how we prevent the maintenance issues you are worrying about.

And I also agree with your case.

Let’s discuss another case. Below 2 functions SetJump/LongJump are updated recently, by me unfortunately.

It is unclear to me what is the size optimization we can get by inline SetJump/LongJump.
But obviously, it brings the maintenance issue to me.
And I don’t believe it meets the criteria you mentioned above.


Jiewen,

I agree it seems like given the rules we agree on this code should be written in NASM.

Thanks,

Andrew Fish

_declspec (naked)
RETURNS_TWICE
UINTN
EFIAPI
SetJump (
  OUT     BASE_LIBRARY_JUMP_BUFFER  *JumpBuffer
  )
{
  _asm {
    push    [esp + 4]
    call    InternalAssertJumpBuffer
    pop     ecx
    pop     ecx
    mov     edx, [esp]

    xor     eax, eax
    mov     [edx + 24], eax        ; save 0 to SSP

    mov     eax, [PcdGet32 (PcdControlFlowEnforcementPropertyMask)]
    test    eax, eax
    jz      CetDone
    _emit      0x0F
    _emit      0x20
    _emit      0xE0                ; mov     eax, cr4
    bt      eax, 23                ; check if CET is enabled
    jnc     CetDone

    mov     eax, 1
    _emit      0xF3
    _emit      0x0F
    _emit      0xAE
    _emit      0xE8                ; INCSSP EAX to read original SSP
    _emit      0xF3
    _emit      0x0F
    _emit      0x1E
    _emit      0xC8                ; READSSP EAX
    mov     [edx + 0x24], eax      ; save SSP

CetDone:

    mov     [edx], ebx
    mov     [edx + 4], esi
    mov     [edx + 8], edi
    mov     [edx + 12], ebp
    mov     [edx + 16], esp
    mov     [edx + 20], ecx
    xor     eax, eax
    jmp     ecx
  }
}

__declspec (naked)
VOID
EFIAPI
InternalLongJump (
  IN      BASE_LIBRARY_JUMP_BUFFER  *JumpBuffer,
  IN      UINTN                     Value
  )
{
  _asm {
    mov     eax, [PcdGet32 (PcdControlFlowEnforcementPropertyMask)]
    test    eax, eax
    jz      CetDone
    _emit      0x0F
    _emit      0x20
    _emit      0xE0                ; mov     eax, cr4
    bt      eax, 23                ; check if CET is enabled
    jnc     CetDone

    mov     edx, [esp + 4]         ; edx = JumpBuffer
    mov     edx, [edx + 24]        ; edx = target SSP
    _emit      0xF3
    _emit      0x0F
    _emit      0x1E
    _emit      0xC8                ; READSSP EAX
    sub     edx, eax               ; edx = delta
    mov     eax, edx               ; eax = delta

    shr     eax, 2                 ; eax = delta/sizeof(UINT32)
    _emit      0xF3
    _emit      0x0F
    _emit      0xAE
    _emit      0xE8                ; INCSSP EAX

CetDone:

    pop     eax                         ; skip return address
    pop     edx                         ; edx <- JumpBuffer
    pop     eax                         ; eax <- Value
    mov     ebx, [edx]
    mov     esi, [edx + 4]
    mov     edi, [edx + 8]
    mov     ebp, [edx + 12]
    mov     esp, [edx + 16]
    jmp     dword ptr [edx + 20]
  }
}



From: afish@apple.com<mailto:afish@apple.com> [mailto:afish@apple.com]
Sent: Wednesday, March 6, 2019 9:56 PM
To: Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>>
Cc: Kinney, Michael D <michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>>; Gao, Liming <liming.gao@intel.com<mailto:liming.gao@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>>
Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code




On Mar 6, 2019, at 9:06 PM, Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>> wrote:

HI Mike and Andrew
The problem is maintainability.

If we have multiple copy of implementation, a developer need verify multiple copy of implementation, if we make update. Before that, a developer has to be aware that there is multiple copy of implementation. - That increases the complexity.

If we have everything, there MAY be 5 copy - ASM, NASM, S, GCC inline, MS inline, theoretically.
Now, we remove ASM. It is good first step.
But we may still have 4 copies. I suggest we consider do more.

Jiewen,

I think you are trying do the right thing, but are optimize the wrong thing.

Most of the GCC/Clang inline assembler code is in Gccinline.c and since that code is mostly just abstracting an x86 instruction and the functions are very simple and thus it is NOT code that needs ongoing maintenance.

Lets look at the history:
https://github.com/tianocore/edk2/commits/master/MdePkg/Library/BaseLib/X64/GccInline.c
The last logic fix was Jun 1, 2010

https://github.com/tianocore/edk2/commits/master/MdePkg/Library/BaseLib/Ia32/GccInline.c
Ok so Mike had to make a fix in this file in 2015 to make something optional, due to an embedded CPU defeating an instruction. But prior to that it was 2010.

The set of things that are C inline assembler we have should be static and not require much maintenance. More complex code should be written in C and use the C inline assembler functions we already have. If any more complex assembly code is required we should just write it in NASM. So I think it is fine to restrict new C inline assembler, or at least have a very high bar to add anything new. Any new inline assembler should also be simple and just be a function abstracting a CPU op-code that is not available to C. This is how we prevent the maintenance issues you are worrying about.

I gave an example in this  mail thread on how a Breakpoint goes from being 1 byte to 11 bytes if you remove the C inline assembler. For clang with LTO enabled a CpuBreakpoint will always get inlined into the callers code and it will only take the one byte for int 3 instruction. If that code moves to NASM then it get replaces with a 5 byte call instruction and an actual C ABI function for the breakpoint.

VOID
EFIAPI
CpuBreakpoint (
  VOID
  )
{
  __asm__ __volatile__ ("int $3");
}


Today with clang LTO turned on calling CpuBreakpoint() looks like this from the callers point of view.

a.out[0x1fa5] <+6>:  cc              int3


But if we move that to NASM:


a.out[0x1fa6] <+7>:  e8 07 00 00 00  calll  0x1fb2                    ; CpuBreakpoint


plus:
a.out`CpuBreakpoint:
a.out[0x1f99] <+0>: 55     pushl  %ebp
a.out[0x1f9a] <+1>: 89 e5  movl   %esp, %ebp
a.out[0x1f9c] <+3>: cc     int3
a.out[0x1f9d] <+4>: 5d     popl   %ebp
a.out[0x1f9e] <+5>: c3     retl


For any compiler that emits the frame pointer if you move the INT 3 to assembly you need the frame pointer or the Exception Lib is not going to be able to print out the stack backtrace of the code when you hit a breakpoint. So this is how I get to 11 vs. 1 bytes.


Thanks,

Andrew Fish

PS for clang LTO the compiler compiles to bitcode and that is an abstracted assembly language. At link time all that code gets optimized to together and then passed through the CPU specific code gen. For C inline assembler the assembly instructions end up in the bitcode with lots of information about the constraints. That is why these GccInline.c functions almost always get inlined with clang and LTO.

The CpuBreakpoint() function looks like this in bitcode, but the function call gets optimized away by LTO in the code gen.

; Function Attrs: noinline nounwind optnone ssp uwtable
define void @CpuBreakpoint() #0 {
  call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !3
  ret void
}


Even something more complex like AsmReadMsr64() can get inlined, but it contains a lot more info about the constraints:

; Function Attrs: noinline nounwind optnone ssp uwtable
define i64 @AsmReadMsr64(i32) #0 {
  %2 = alloca i32, align 4
  %3 = alloca i64, align 8
  store i32 %0, i32* %2, align 4
  %4 = load i32, i32* %2, align 4
  %5 = call i64 asm sideeffect "rdmsr", "=A,{cx},~{dirflag},~{fpsr},~{flags}"(i32 %4) #2, !srcloc !4
  store i64 %5, i64* %3, align 8
  %6 = load i64, i64* %3, align 8
  ret i64 %6
}


The alloca, store, load, etc are the same bitcode instructions you would see with C code.


A recently case that SetJump/LongJump, I updated the NASM file for both IA32 and X64.

Later, to my surprise, only X64 version NASM works, and IA32 version NASM does not work.
Then I notice that there is a different copy if IA32 Jump.c MS inline - I also need update. That is frustrated.

I think there should be a balance between optimization and code readability/maintainability.

Do we have data on what size benefit we can get with these inline function, from whole image level?
We can do evaluation at function level, case by case.
If we see a huge size benefit, I agree to keep this function.
If we see no size benefit, I suggest we do the cleanup for this function.


Thank you
Yao Jiewen




-----Original Message-----
From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of
Andrew Fish via edk2-devel
Sent: Wednesday, March 6, 2019 5:31 PM
To: Kinney, Michael D <michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>>
Cc: edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>>; Gao, Liming
<liming.gao@intel.com<mailto:liming.gao@intel.com>>
Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
inline X86 assembly code




On Mar 6, 2019, at 4:41 PM, Kinney, Michael D
<michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>> wrote:


Liming,

That does not work either because inline assembly uses compiler specific
syntax.


Please update with the specific list of functions that you think the inline
should be removed to improve maintenance with no impacts in size/perf.



Mike,

It is easy to find the gcc/clang inline assembler, just `git grep "__asm__
__volatile__"`

The main files are:
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseIoLib
Intrinsic/IoLibGcc.c
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/X
64/GccInline.c
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/I
a32/GccInline.c
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync
hronizationLib/Ia32/GccInline.c
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync
hronizationLib/X64/GccInline.c

This is really just compiler optimizer control.
Library/BaseSynchronizationLib/SynchronizationGcc.c:21:#define
_ReadWriteBarrier() do { __asm__ __volatile__ ("": : : "memory"); } while(0)

Looks like this is working around the alignment ASSERT in BaseIoLibIntrinsic.
OvmfPkg/QemuVideoDxe/UnalignedIoGcc.c:43:  __asm__ __volatile__
( "outl %0, %1" : : "a" (Value), "d" ((UINT16)Port) );
OvmfPkg/QemuVideoDxe/UnalignedIoGcc.c:67:  __asm__ __volatile__
( "inl %1, %0" : "=a" (Data) : "d" ((UINT16)Port) );

It seems like we have a reasonable set and should not use the inline
assembly for any more complex code.

Thanks,

Andrew Fish


The issue you pointed to was around SetJump()/LongJump().  Can we limit
this BZ to only those 2 functions?


Mike
  <>
From: Gao, Liming
Sent: Wednesday, March 6, 2019 4:28 PM
To: afish@apple.com<mailto:afish@apple.com>
Cc: Zhang, Shenglei <shenglei.zhang@intel.com<mailto:shenglei.zhang@intel.com>>; edk2-devel-01
<edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>>; Kinney, Michael D
<michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>>

Subject: RE: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
inline X86 assembly code


Andrew:
 I want to keep only one implementation. If inline assembly c source is
preferred, I suggest to remove its nasm implementation.


Thanks
Liming
<>From: afish@apple.com<mailto:afish@apple.com> <mailto:afish@apple.com>
[mailto:afish@apple.com <mailto:afish@apple.com><mailto:afish@apple.com%20%3cmailto:afish@apple.com%3e>]

Sent: Tuesday, March 5, 2019 2:44 PM
To: Gao, Liming <liming.gao@intel.com <mailto:liming.gao@intel.com<mailto:liming.gao@intel.com%20%3cmailto:liming.gao@intel.com>>>
Cc: Zhang, Shenglei <shenglei.zhang@intel.com<mailto:shenglei.zhang@intel.com>
<mailto:shenglei.zhang@intel.com>>; edk2-devel-01
<edk2-devel@lists.01.org <mailto:edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org%20%3cmailto:edk2-devel@lists.01.org>>>; Kinney,
Michael D <michael.d.kinney@intel.com
<mailto:michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com%0b%3cmailto:michael.d.kinney@intel.com>>>

Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
inline X86 assembly code





On Mar 5, 2019, at 1:32 PM, Gao, Liming <liming.gao@intel.com<mailto:liming.gao@intel.com>
<mailto:liming.gao@intel.com>> wrote:


Andrew:
BZ 1163 is to remove inline X86 assembly code in C source file. But, this
patch is wrong. I have gave my comments to update this patch.



Why do we want to remove inline X86 assembly. As I mentioned it will lead
to larger binaries, slower binaries, and less optimized code.


For example take this simple inline assembly function:

VOID
EFIAPI
CpuBreakpoint (
 VOID
 )
{
 __asm__ __volatile__ ("int $3");
}


Today with clang LTO turned on calling CpuBreakpoint() looks like this from
the callers point of view.


a.out[0x1fa5] <+6>:  cc              int3


But if we move that to NASM:


a.out[0x1fa6] <+7>:  e8 07 00 00 00  calll
0x1fb2                    ; CpuBreakpoint



plus:
a.out`CpuBreakpoint:
a.out[0x1fb2] <+0>: cc     int3
a.out[0x1fb3] <+1>: c3     retl


And there is also an extra push and pop on the stack. The other issue is the
call to the function that is unknown to the compiler will act like a
_ReadWriteBarrier (Yikes I see _ReadWriteBarrier is deprecated in VC++
2017). Is the depreciation of some of these intrinsics in VC++ driving the
removal of inline assembly? For GCC inline assembly works great for local
compile, and for clang LTO it works in entire program optimization.


The LTO bitcode includes inline assembly and constraints so that the
optimizer knows what to do so it can get optimized just like the abstract
bitcode during the Link Time Optimization.


This is CpuBreakpoint():
; Function Attrs: noinline nounwind optnone ssp uwtable
define void @CpuBreakpoint() #0 {
 call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"()
#2, !srcloc !3

 ret void
}


This is AsmReadMsr64():
; Function Attrs: noinline nounwind optnone ssp uwtable
define i64 @AsmReadMsr64(i32) #0 {
 %2 = alloca i32, align 4
 %3 = alloca i64, align 8
 store i32 %0, i32* %2, align 4
 %4 = load i32, i32* %2, align 4
 %5 = call i64 asm sideeffect "rdmsr",
"=A,{cx},~{dirflag},~{fpsr},~{flags}"(i32 %4) #2, !srcloc !4

 store i64 %5, i64* %3, align 8
 %6 = load i64, i64* %3, align 8
 ret i64 %6
}



I agree it make sense to remove .S and .asm files and only have .nasm files.

Thanks,

Andrew Fish

PS For the Xcode clang since it emits frame pointers by default we need to
add the extra 4 bytes to save the assembly functions so the stack can get
unwound.


a.out`CpuBreakpoint:
a.out[0x1f99] <+0>: 55     pushl  %ebp
a.out[0x1f9a] <+1>: 89 e5  movl   %esp, %ebp
a.out[0x1f9c] <+3>: cc     int3
a.out[0x1f9d] <+4>: 5d     popl   %ebp
a.out[0x1f9e] <+5>: c3     retl


So breakpoint goes from 1 byte to 11 bytes if we get rid of the intrinsics.


The change is to reduce the duplicated implementation. It will be good
on the code maintain. Recently, one patch has to update .c and .nasm both.
Please see
https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html
<https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html>.
Beside this change, I propose to remove all GAS assembly code for IA32 and
X64 arch. After those change, the patch owner will collect the impact of the
image size.


Thanks
Liming

_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>
https://lists.01.org/mailman/listinfo/edk2-devel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
  2019-03-07  7:25                         ` Yao, Jiewen
@ 2019-03-09  0:07                           ` Kinney, Michael D
  2019-03-09  0:15                             ` Andrew Fish
  0 siblings, 1 reply; 21+ messages in thread
From: Kinney, Michael D @ 2019-03-09  0:07 UTC (permalink / raw)
  To: Yao, Jiewen, Gao, Liming, afish@apple.com, Kinney, Michael D
  Cc: edk2-devel-01

Jiewen,

There are not many of those functions and they can be on speed paths.

I do not recommend converting them to only NASM.

I do recommend we add comments to NASM files and C files with include assembly that state that updates to one require updates to the others.

Mike

From: Yao, Jiewen
Sent: Wednesday, March 6, 2019 11:25 PM
To: Gao, Liming <liming.gao@intel.com>; afish@apple.com
Cc: Kinney, Michael D <michael.d.kinney@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org>
Subject: RE: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code

Thanks.

I also recommend to take a look at MdePkg\Library\BaseSynchronizationLib.

That is also complicated and not so readable for me.

And we have 8 patches to *fix* the real problem in 2018.

Thank you
Yao Jiewen


From: Gao, Liming
Sent: Wednesday, March 6, 2019 11:15 PM
To: afish@apple.com<mailto:afish@apple.com>; Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>>
Cc: Kinney, Michael D <michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>>
Subject: RE: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code

Thanks for your clarification. Now, we will focus on SetJump/LongJump() first.

Thanks
Liming
From: afish@apple.com<mailto:afish@apple.com> [mailto:afish@apple.com]
Sent: Wednesday, March 6, 2019 10:45 PM
To: Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>>
Cc: Kinney, Michael D <michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>>; Gao, Liming <liming.gao@intel.com<mailto:liming.gao@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>>
Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code



On Mar 6, 2019, at 10:09 PM, Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>> wrote:

Thanks Andrew.
Now I 100% agree with you - I think it is fine to restrict new C inline assembler, or at least have a very high bar to add anything new. Any new inline assembler *should also be simple and just be a function abstracting a CPU op-code* that is not available to C. This is how we prevent the maintenance issues you are worrying about.

And I also agree with your case.

Let’s discuss another case. Below 2 functions SetJump/LongJump are updated recently, by me unfortunately.

It is unclear to me what is the size optimization we can get by inline SetJump/LongJump.
But obviously, it brings the maintenance issue to me.
And I don’t believe it meets the criteria you mentioned above.


Jiewen,

I agree it seems like given the rules we agree on this code should be written in NASM.

Thanks,

Andrew Fish

_declspec (naked)
RETURNS_TWICE
UINTN
EFIAPI
SetJump (
  OUT     BASE_LIBRARY_JUMP_BUFFER  *JumpBuffer
  )
{
  _asm {
    push    [esp + 4]
    call    InternalAssertJumpBuffer
    pop     ecx
    pop     ecx
    mov     edx, [esp]

    xor     eax, eax
    mov     [edx + 24], eax        ; save 0 to SSP

    mov     eax, [PcdGet32 (PcdControlFlowEnforcementPropertyMask)]
    test    eax, eax
    jz      CetDone
    _emit      0x0F
    _emit      0x20
    _emit      0xE0                ; mov     eax, cr4
    bt      eax, 23                ; check if CET is enabled
    jnc     CetDone

    mov     eax, 1
    _emit      0xF3
    _emit      0x0F
    _emit      0xAE
    _emit      0xE8                ; INCSSP EAX to read original SSP
    _emit      0xF3
    _emit      0x0F
    _emit      0x1E
    _emit      0xC8                ; READSSP EAX
    mov     [edx + 0x24], eax      ; save SSP

CetDone:

    mov     [edx], ebx
    mov     [edx + 4], esi
    mov     [edx + 8], edi
    mov     [edx + 12], ebp
    mov     [edx + 16], esp
    mov     [edx + 20], ecx
    xor     eax, eax
    jmp     ecx
  }
}

__declspec (naked)
VOID
EFIAPI
InternalLongJump (
  IN      BASE_LIBRARY_JUMP_BUFFER  *JumpBuffer,
  IN      UINTN                     Value
  )
{
  _asm {
    mov     eax, [PcdGet32 (PcdControlFlowEnforcementPropertyMask)]
    test    eax, eax
    jz      CetDone
    _emit      0x0F
    _emit      0x20
    _emit      0xE0                ; mov     eax, cr4
    bt      eax, 23                ; check if CET is enabled
    jnc     CetDone

    mov     edx, [esp + 4]         ; edx = JumpBuffer
    mov     edx, [edx + 24]        ; edx = target SSP
    _emit      0xF3
    _emit      0x0F
    _emit      0x1E
    _emit      0xC8                ; READSSP EAX
    sub     edx, eax               ; edx = delta
    mov     eax, edx               ; eax = delta

    shr     eax, 2                 ; eax = delta/sizeof(UINT32)
    _emit      0xF3
    _emit      0x0F
    _emit      0xAE
    _emit      0xE8                ; INCSSP EAX

CetDone:

    pop     eax                         ; skip return address
    pop     edx                         ; edx <- JumpBuffer
    pop     eax                         ; eax <- Value
    mov     ebx, [edx]
    mov     esi, [edx + 4]
    mov     edi, [edx + 8]
    mov     ebp, [edx + 12]
    mov     esp, [edx + 16]
    jmp     dword ptr [edx + 20]
  }
}



From: afish@apple.com<mailto:afish@apple.com> [mailto:afish@apple.com]
Sent: Wednesday, March 6, 2019 9:56 PM
To: Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>>
Cc: Kinney, Michael D <michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>>; Gao, Liming <liming.gao@intel.com<mailto:liming.gao@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>>
Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code



On Mar 6, 2019, at 9:06 PM, Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>> wrote:

HI Mike and Andrew
The problem is maintainability.

If we have multiple copy of implementation, a developer need verify multiple copy of implementation, if we make update. Before that, a developer has to be aware that there is multiple copy of implementation. - That increases the complexity.

If we have everything, there MAY be 5 copy - ASM, NASM, S, GCC inline, MS inline, theoretically.
Now, we remove ASM. It is good first step.
But we may still have 4 copies. I suggest we consider do more.

Jiewen,

I think you are trying do the right thing, but are optimize the wrong thing.

Most of the GCC/Clang inline assembler code is in Gccinline.c and since that code is mostly just abstracting an x86 instruction and the functions are very simple and thus it is NOT code that needs ongoing maintenance.

Lets look at the history:
https://github.com/tianocore/edk2/commits/master/MdePkg/Library/BaseLib/X64/GccInline.c
The last logic fix was Jun 1, 2010

https://github.com/tianocore/edk2/commits/master/MdePkg/Library/BaseLib/Ia32/GccInline.c
Ok so Mike had to make a fix in this file in 2015 to make something optional, due to an embedded CPU defeating an instruction. But prior to that it was 2010.

The set of things that are C inline assembler we have should be static and not require much maintenance. More complex code should be written in C and use the C inline assembler functions we already have. If any more complex assembly code is required we should just write it in NASM. So I think it is fine to restrict new C inline assembler, or at least have a very high bar to add anything new. Any new inline assembler should also be simple and just be a function abstracting a CPU op-code that is not available to C. This is how we prevent the maintenance issues you are worrying about.

I gave an example in this  mail thread on how a Breakpoint goes from being 1 byte to 11 bytes if you remove the C inline assembler. For clang with LTO enabled a CpuBreakpoint will always get inlined into the callers code and it will only take the one byte for int 3 instruction. If that code moves to NASM then it get replaces with a 5 byte call instruction and an actual C ABI function for the breakpoint.

VOID
EFIAPI
CpuBreakpoint (
  VOID
  )
{
  __asm__ __volatile__ ("int $3");
}

Today with clang LTO turned on calling CpuBreakpoint() looks like this from the callers point of view.

a.out[0x1fa5] <+6>:  cc              int3

But if we move that to NASM:

a.out[0x1fa6] <+7>:  e8 07 00 00 00  calll  0x1fb2                    ; CpuBreakpoint

plus:
a.out`CpuBreakpoint:
a.out[0x1f99] <+0>: 55     pushl  %ebp
a.out[0x1f9a] <+1>: 89 e5  movl   %esp, %ebp
a.out[0x1f9c] <+3>: cc     int3
a.out[0x1f9d] <+4>: 5d     popl   %ebp
a.out[0x1f9e] <+5>: c3     retl

For any compiler that emits the frame pointer if you move the INT 3 to assembly you need the frame pointer or the Exception Lib is not going to be able to print out the stack backtrace of the code when you hit a breakpoint. So this is how I get to 11 vs. 1 bytes.

Thanks,

Andrew Fish

PS for clang LTO the compiler compiles to bitcode and that is an abstracted assembly language. At link time all that code gets optimized to together and then passed through the CPU specific code gen. For C inline assembler the assembly instructions end up in the bitcode with lots of information about the constraints. That is why these GccInline.c functions almost always get inlined with clang and LTO.

The CpuBreakpoint() function looks like this in bitcode, but the function call gets optimized away by LTO in the code gen.

; Function Attrs: noinline nounwind optnone ssp uwtable
define void @CpuBreakpoint() #0 {
  call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !3
  ret void
}

Even something more complex like AsmReadMsr64() can get inlined, but it contains a lot more info about the constraints:

; Function Attrs: noinline nounwind optnone ssp uwtable
define i64 @AsmReadMsr64(i32) #0 {
  %2 = alloca i32, align 4
  %3 = alloca i64, align 8
  store i32 %0, i32* %2, align 4
  %4 = load i32, i32* %2, align 4
  %5 = call i64 asm sideeffect "rdmsr", "=A,{cx},~{dirflag},~{fpsr},~{flags}"(i32 %4) #2, !srcloc !4
  store i64 %5, i64* %3, align 8
  %6 = load i64, i64* %3, align 8
  ret i64 %6
}

The alloca, store, load, etc are the same bitcode instructions you would see with C code.

A recently case that SetJump/LongJump, I updated the NASM file for both IA32 and X64.

Later, to my surprise, only X64 version NASM works, and IA32 version NASM does not work.
Then I notice that there is a different copy if IA32 Jump.c MS inline - I also need update. That is frustrated.

I think there should be a balance between optimization and code readability/maintainability.

Do we have data on what size benefit we can get with these inline function, from whole image level?
We can do evaluation at function level, case by case.
If we see a huge size benefit, I agree to keep this function.
If we see no size benefit, I suggest we do the cleanup for this function.


Thank you
Yao Jiewen



-----Original Message-----
From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of
Andrew Fish via edk2-devel
Sent: Wednesday, March 6, 2019 5:31 PM
To: Kinney, Michael D <michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>>
Cc: edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>>; Gao, Liming
<liming.gao@intel.com<mailto:liming.gao@intel.com>>
Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
inline X86 assembly code



On Mar 6, 2019, at 4:41 PM, Kinney, Michael D
<michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>> wrote:

Liming,

That does not work either because inline assembly uses compiler specific
syntax.

Please update with the specific list of functions that you think the inline
should be removed to improve maintenance with no impacts in size/perf.


Mike,

It is easy to find the gcc/clang inline assembler, just `git grep "__asm__
__volatile__"`

The main files are:
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseIoLib
Intrinsic/IoLibGcc.c
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/X
64/GccInline.c
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/I
a32/GccInline.c
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync
hronizationLib/Ia32/GccInline.c
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync
hronizationLib/X64/GccInline.c

This is really just compiler optimizer control.
Library/BaseSynchronizationLib/SynchronizationGcc.c:21:#define
_ReadWriteBarrier() do { __asm__ __volatile__ ("": : : "memory"); } while(0)

Looks like this is working around the alignment ASSERT in BaseIoLibIntrinsic.
OvmfPkg/QemuVideoDxe/UnalignedIoGcc.c:43:  __asm__ __volatile__
( "outl %0, %1" : : "a" (Value), "d" ((UINT16)Port) );
OvmfPkg/QemuVideoDxe/UnalignedIoGcc.c:67:  __asm__ __volatile__
( "inl %1, %0" : "=a" (Data) : "d" ((UINT16)Port) );

It seems like we have a reasonable set and should not use the inline
assembly for any more complex code.

Thanks,

Andrew Fish

The issue you pointed to was around SetJump()/LongJump().  Can we limit
this BZ to only those 2 functions?

Mike
  <>
From: Gao, Liming
Sent: Wednesday, March 6, 2019 4:28 PM
To: afish@apple.com<mailto:afish@apple.com>
Cc: Zhang, Shenglei <shenglei.zhang@intel.com<mailto:shenglei.zhang@intel.com>>; edk2-devel-01
<edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>>; Kinney, Michael D
<michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>>
Subject: RE: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
inline X86 assembly code

Andrew:
 I want to keep only one implementation. If inline assembly c source is
preferred, I suggest to remove its nasm implementation.

Thanks
Liming
<>From: afish@apple.com<mailto:afish@apple.com> <mailto:afish@apple.com>
[mailto:afish@apple.com <mailto:afish@apple.com><mailto:afish@apple.com%20%3cmailto:afish@apple.com%3e>]
Sent: Tuesday, March 5, 2019 2:44 PM
To: Gao, Liming <liming.gao@intel.com <mailto:liming.gao@intel.com<mailto:liming.gao@intel.com%20%3cmailto:liming.gao@intel.com>>>
Cc: Zhang, Shenglei <shenglei.zhang@intel.com<mailto:shenglei.zhang@intel.com>
<mailto:shenglei.zhang@intel.com>>; edk2-devel-01
<edk2-devel@lists.01.org <mailto:edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org%20%3cmailto:edk2-devel@lists.01.org>>>; Kinney,
Michael D <michael.d.kinney@intel.com
<mailto:michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com%0b%3cmailto:michael.d.kinney@intel.com>>>
Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
inline X86 assembly code




On Mar 5, 2019, at 1:32 PM, Gao, Liming <liming.gao@intel.com<mailto:liming.gao@intel.com>
<mailto:liming.gao@intel.com>> wrote:

Andrew:
BZ 1163 is to remove inline X86 assembly code in C source file. But, this
patch is wrong. I have gave my comments to update this patch.


Why do we want to remove inline X86 assembly. As I mentioned it will lead
to larger binaries, slower binaries, and less optimized code.

For example take this simple inline assembly function:

VOID
EFIAPI
CpuBreakpoint (
 VOID
 )
{
 __asm__ __volatile__ ("int $3");
}


Today with clang LTO turned on calling CpuBreakpoint() looks like this from
the callers point of view.

a.out[0x1fa5] <+6>:  cc              int3


But if we move that to NASM:


a.out[0x1fa6] <+7>:  e8 07 00 00 00  calll
0x1fb2                    ; CpuBreakpoint


plus:
a.out`CpuBreakpoint:
a.out[0x1fb2] <+0>: cc     int3
a.out[0x1fb3] <+1>: c3     retl


And there is also an extra push and pop on the stack. The other issue is the
call to the function that is unknown to the compiler will act like a
_ReadWriteBarrier (Yikes I see _ReadWriteBarrier is deprecated in VC++
2017). Is the depreciation of some of these intrinsics in VC++ driving the
removal of inline assembly? For GCC inline assembly works great for local
compile, and for clang LTO it works in entire program optimization.

The LTO bitcode includes inline assembly and constraints so that the
optimizer knows what to do so it can get optimized just like the abstract
bitcode during the Link Time Optimization.

This is CpuBreakpoint():
; Function Attrs: noinline nounwind optnone ssp uwtable
define void @CpuBreakpoint() #0 {
 call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"()
#2, !srcloc !3
 ret void
}


This is AsmReadMsr64():
; Function Attrs: noinline nounwind optnone ssp uwtable
define i64 @AsmReadMsr64(i32) #0 {
 %2 = alloca i32, align 4
 %3 = alloca i64, align 8
 store i32 %0, i32* %2, align 4
 %4 = load i32, i32* %2, align 4
 %5 = call i64 asm sideeffect "rdmsr",
"=A,{cx},~{dirflag},~{fpsr},~{flags}"(i32 %4) #2, !srcloc !4
 store i64 %5, i64* %3, align 8
 %6 = load i64, i64* %3, align 8
 ret i64 %6
}



I agree it make sense to remove .S and .asm files and only have .nasm files.

Thanks,

Andrew Fish

PS For the Xcode clang since it emits frame pointers by default we need to
add the extra 4 bytes to save the assembly functions so the stack can get
unwound.

a.out`CpuBreakpoint:
a.out[0x1f99] <+0>: 55     pushl  %ebp
a.out[0x1f9a] <+1>: 89 e5  movl   %esp, %ebp
a.out[0x1f9c] <+3>: cc     int3
a.out[0x1f9d] <+4>: 5d     popl   %ebp
a.out[0x1f9e] <+5>: c3     retl


So breakpoint goes from 1 byte to 11 bytes if we get rid of the intrinsics.


The change is to reduce the duplicated implementation. It will be good
on the code maintain. Recently, one patch has to update .c and .nasm both.
Please see
https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html
<https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html>.
Beside this change, I propose to remove all GAS assembly code for IA32 and
X64 arch. After those change, the patch owner will collect the impact of the
image size.

Thanks
Liming

_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>
https://lists.01.org/mailman/listinfo/edk2-devel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
  2019-03-09  0:07                           ` Kinney, Michael D
@ 2019-03-09  0:15                             ` Andrew Fish
  2019-03-14 22:27                               ` Gao, Liming
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Fish @ 2019-03-09  0:15 UTC (permalink / raw)
  To: Mike Kinney; +Cc: Yao, Jiewen, Gao, Liming, edk2-devel-01



> On Mar 8, 2019, at 4:07 PM, Kinney, Michael D <michael.d.kinney@intel.com> wrote:
> 
> Jiewen,
>  
> There are not many of those functions and they can be on speed paths.
>  
> I do not recommend converting them to only NASM.
>  
> I do recommend we add comments to NASM files and C files with include assembly that state that updates to one require updates to the others.
>  

Mike,

It looks like NASM files only exist to support the INTEL compiler. Could that be an outdated assumption? Overtime has the INTEL compiler added more compatibility with GCC inline assembly, or VC++ intrinsics? I assume people are still using the INTEL compiler? 

Thanks,

Andrew Fish


> Mike
>   <>
> From: Yao, Jiewen 
> Sent: Wednesday, March 6, 2019 11:25 PM
> To: Gao, Liming <liming.gao@intel.com>; afish@apple.com
> Cc: Kinney, Michael D <michael.d.kinney@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org>
> Subject: RE: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
>  
> Thanks.
>  
> I also recommend to take a look at MdePkg\Library\BaseSynchronizationLib.
>  
> That is also complicated and not so readable for me.
>  
> And we have 8 patches to *fix* the real problem in 2018.
>  
> Thank you
> Yao Jiewen
>  
>  
>  <>From: Gao, Liming 
> Sent: Wednesday, March 6, 2019 11:15 PM
> To: afish@apple.com <mailto:afish@apple.com>; Yao, Jiewen <jiewen.yao@intel.com <mailto:jiewen.yao@intel.com>>
> Cc: Kinney, Michael D <michael.d.kinney@intel.com <mailto:michael.d.kinney@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org <mailto:edk2-devel@lists.01.org>>
> Subject: RE: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
>  
> Thanks for your clarification. Now, we will focus on SetJump/LongJump() first.
>  
> Thanks
> Liming
> From: afish@apple.com <mailto:afish@apple.com> [mailto:afish@apple.com <mailto:afish@apple.com>] 
> Sent: Wednesday, March 6, 2019 10:45 PM
> To: Yao, Jiewen <jiewen.yao@intel.com <mailto:jiewen.yao@intel.com>>
> Cc: Kinney, Michael D <michael.d.kinney@intel.com <mailto:michael.d.kinney@intel.com>>; Gao, Liming <liming.gao@intel.com <mailto:liming.gao@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org <mailto:edk2-devel@lists.01.org>>
> Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
>  
>  
>  
> 
> On Mar 6, 2019, at 10:09 PM, Yao, Jiewen <jiewen.yao@intel.com <mailto:jiewen.yao@intel.com>> wrote:
>  
> Thanks Andrew.
> Now I 100% agree with you - I think it is fine to restrict new C inline assembler, or at least have a very high bar to add anything new. Any new inline assembler *should also be simple and just be a function abstracting a CPU op-code* that is not available to C. This is how we prevent the maintenance issues you are worrying about.
>  
> And I also agree with your case.
>  
> Let’s discuss another case. Below 2 functions SetJump/LongJump are updated recently, by me unfortunately.
>  
> It is unclear to me what is the size optimization we can get by inline SetJump/LongJump.
> But obviously, it brings the maintenance issue to me.
> And I don’t believe it meets the criteria you mentioned above.
>  
>  
> Jiewen,
>  
> I agree it seems like given the rules we agree on this code should be written in NASM. 
>  
> Thanks,
>  
> Andrew Fish
>  
> 
> _declspec (naked)
> RETURNS_TWICE
> UINTN
> EFIAPI
> SetJump (
>   OUT     BASE_LIBRARY_JUMP_BUFFER  *JumpBuffer
>   )
> {
>   _asm {
>     push    [esp + 4]
>     call    InternalAssertJumpBuffer
>     pop     ecx
>     pop     ecx
>     mov     edx, [esp]
>  
>     xor     eax, eax
>     mov     [edx + 24], eax        ; save 0 to SSP
>  
>     mov     eax, [PcdGet32 (PcdControlFlowEnforcementPropertyMask)]
>     test    eax, eax
>     jz      CetDone
>     _emit      0x0F
>     _emit      0x20
>     _emit      0xE0                ; mov     eax, cr4
>     bt      eax, 23                ; check if CET is enabled
>     jnc     CetDone
>  
>     mov     eax, 1
>     _emit      0xF3
>     _emit      0x0F
>     _emit      0xAE
>     _emit      0xE8                ; INCSSP EAX to read original SSP
>     _emit      0xF3
>     _emit      0x0F
>     _emit      0x1E
>     _emit      0xC8                ; READSSP EAX
>     mov     [edx + 0x24], eax      ; save SSP
>  
> CetDone:
>  
>     mov     [edx], ebx
>     mov     [edx + 4], esi
>     mov     [edx + 8], edi
>     mov     [edx + 12], ebp
>     mov     [edx + 16], esp
>     mov     [edx + 20], ecx
>     xor     eax, eax
>     jmp     ecx
>   }
> }
>  
> __declspec (naked)
> VOID
> EFIAPI
> InternalLongJump (
>   IN      BASE_LIBRARY_JUMP_BUFFER  *JumpBuffer,
>   IN      UINTN                     Value
>   )
> {
>   _asm {
>     mov     eax, [PcdGet32 (PcdControlFlowEnforcementPropertyMask)]
>     test    eax, eax
>     jz      CetDone
>     _emit      0x0F
>     _emit      0x20
>     _emit      0xE0                ; mov     eax, cr4
>     bt      eax, 23                ; check if CET is enabled
>     jnc     CetDone
>  
>     mov     edx, [esp + 4]         ; edx = JumpBuffer
>     mov     edx, [edx + 24]        ; edx = target SSP
>     _emit      0xF3
>     _emit      0x0F
>     _emit      0x1E
>     _emit      0xC8                ; READSSP EAX
>     sub     edx, eax               ; edx = delta
>     mov     eax, edx               ; eax = delta
>  
>     shr     eax, 2                 ; eax = delta/sizeof(UINT32)
>     _emit      0xF3
>     _emit      0x0F
>     _emit      0xAE
>     _emit      0xE8                ; INCSSP EAX
>  
> CetDone:
>  
>     pop     eax                         ; skip return address
>     pop     edx                         ; edx <- JumpBuffer
>     pop     eax                         ; eax <- Value
>     mov     ebx, [edx]
>     mov     esi, [edx + 4]
>     mov     edi, [edx + 8]
>     mov     ebp, [edx + 12]
>     mov     esp, [edx + 16]
>     jmp     dword ptr [edx + 20]
>   }
> }
>  
>  
>  
> From: afish@apple.com <mailto:afish@apple.com> [mailto:afish@apple.com <mailto:afish@apple.com>] 
> Sent: Wednesday, March 6, 2019 9:56 PM
> To: Yao, Jiewen <jiewen.yao@intel.com <mailto:jiewen.yao@intel.com>>
> Cc: Kinney, Michael D <michael.d.kinney@intel.com <mailto:michael.d.kinney@intel.com>>; Gao, Liming <liming.gao@intel.com <mailto:liming.gao@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org <mailto:edk2-devel@lists.01.org>>
> Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
>  
>  
>  
> 
> On Mar 6, 2019, at 9:06 PM, Yao, Jiewen <jiewen.yao@intel.com <mailto:jiewen.yao@intel.com>> wrote:
>  
> HI Mike and Andrew
> The problem is maintainability.
> 
> If we have multiple copy of implementation, a developer need verify multiple copy of implementation, if we make update. Before that, a developer has to be aware that there is multiple copy of implementation. - That increases the complexity.
> 
> If we have everything, there MAY be 5 copy - ASM, NASM, S, GCC inline, MS inline, theoretically.
> Now, we remove ASM. It is good first step.
> But we may still have 4 copies. I suggest we consider do more.
> 
>  
> Jiewen,
>  
> I think you are trying do the right thing, but are optimize the wrong thing. 
>  
> Most of the GCC/Clang inline assembler code is in Gccinline.c and since that code is mostly just abstracting an x86 instruction and the functions are very simple and thus it is NOT code that needs ongoing maintenance. 
>  
> Lets look at the history:
> https://github.com/tianocore/edk2/commits/master/MdePkg/Library/BaseLib/X64/GccInline.c <https://github.com/tianocore/edk2/commits/master/MdePkg/Library/BaseLib/X64/GccInline.c>
> The last logic fix was Jun 1, 2010
>  
> https://github.com/tianocore/edk2/commits/master/MdePkg/Library/BaseLib/Ia32/GccInline.c <https://github.com/tianocore/edk2/commits/master/MdePkg/Library/BaseLib/Ia32/GccInline.c>
> Ok so Mike had to make a fix in this file in 2015 to make something optional, due to an embedded CPU defeating an instruction. But prior to that it was 2010. 
>  
> The set of things that are C inline assembler we have should be static and not require much maintenance. More complex code should be written in C and use the C inline assembler functions we already have. If any more complex assembly code is required we should just write it in NASM. So I think it is fine to restrict new C inline assembler, or at least have a very high bar to add anything new. Any new inline assembler should also be simple and just be a function abstracting a CPU op-code that is not available to C. This is how we prevent the maintenance issues you are worrying about. 
>  
> I gave an example in this  mail thread on how a Breakpoint goes from being 1 byte to 11 bytes if you remove the C inline assembler. For clang with LTO enabled a CpuBreakpoint will always get inlined into the callers code and it will only take the one byte for int 3 instruction. If that code moves to NASM then it get replaces with a 5 byte call instruction and an actual C ABI function for the breakpoint. 
>  
> VOID
> EFIAPI
> CpuBreakpoint (
>   VOID
>   )
> {
>   __asm__ __volatile__ ("int $3");
> }
>  
> 
> Today with clang LTO turned on calling CpuBreakpoint() looks like this from the callers point of view. 
>  
> a.out[0x1fa5] <+6>:  cc              int3   
>  
> 
> But if we move that to NASM:
>  
> 
> a.out[0x1fa6] <+7>:  e8 07 00 00 00  calll  0x1fb2                    ; CpuBreakpoint
>  
> 
> plus:
> a.out`CpuBreakpoint:
> a.out[0x1f99] <+0>: 55     pushl  %ebp
> a.out[0x1f9a] <+1>: 89 e5  movl   %esp, %ebp
> a.out[0x1f9c] <+3>: cc     int3   
> a.out[0x1f9d] <+4>: 5d     popl   %ebp
> a.out[0x1f9e] <+5>: c3     retl   
>  
> 
> For any compiler that emits the frame pointer if you move the INT 3 to assembly you need the frame pointer or the Exception Lib is not going to be able to print out the stack backtrace of the code when you hit a breakpoint. So this is how I get to 11 vs. 1 bytes. 
>  
> 
> Thanks,
>  
> Andrew Fish
>  
> PS for clang LTO the compiler compiles to bitcode and that is an abstracted assembly language. At link time all that code gets optimized to together and then passed through the CPU specific code gen. For C inline assembler the assembly instructions end up in the bitcode with lots of information about the constraints. That is why these GccInline.c functions almost always get inlined with clang and LTO. 
>  
> The CpuBreakpoint() function looks like this in bitcode, but the function call gets optimized away by LTO in the code gen. 
>  
> ; Function Attrs: noinline nounwind optnone ssp uwtable
> define void @CpuBreakpoint() #0 {
>   call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !3
>   ret void
> }
>  
> 
> Even something more complex like AsmReadMsr64() can get inlined, but it contains a lot more info about the constraints:
>  
> ; Function Attrs: noinline nounwind optnone ssp uwtable
> define i64 @AsmReadMsr64(i32) #0 {
>   %2 = alloca i32, align 4
>   %3 = alloca i64, align 8
>   store i32 %0, i32* %2, align 4
>   %4 = load i32, i32* %2, align 4
>   %5 = call i64 asm sideeffect "rdmsr", "=A,{cx},~{dirflag},~{fpsr},~{flags}"(i32 %4) #2, !srcloc !4
>   store i64 %5, i64* %3, align 8
>   %6 = load i64, i64* %3, align 8
>   ret i64 %6
> }
>  
> 
> The alloca, store, load, etc are the same bitcode instructions you would see with C code. 
>  
> 
> A recently case that SetJump/LongJump, I updated the NASM file for both IA32 and X64.
> 
> Later, to my surprise, only X64 version NASM works, and IA32 version NASM does not work.
> Then I notice that there is a different copy if IA32 Jump.c MS inline - I also need update. That is frustrated.
> 
> I think there should be a balance between optimization and code readability/maintainability.
> 
> Do we have data on what size benefit we can get with these inline function, from whole image level?
> We can do evaluation at function level, case by case.
> If we see a huge size benefit, I agree to keep this function.
> If we see no size benefit, I suggest we do the cleanup for this function.
> 
> 
> Thank you
> Yao Jiewen
> 
> 
> 
> 
> -----Original Message-----
> From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org <mailto:edk2-devel-bounces@lists.01.org>] On Behalf Of
> Andrew Fish via edk2-devel
> Sent: Wednesday, March 6, 2019 5:31 PM
> To: Kinney, Michael D <michael.d.kinney@intel.com <mailto:michael.d.kinney@intel.com>>
> Cc: edk2-devel-01 <edk2-devel@lists.01.org <mailto:edk2-devel@lists.01.org>>; Gao, Liming
> <liming.gao@intel.com <mailto:liming.gao@intel.com>>
> Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
> inline X86 assembly code
> 
> 
> 
> 
> On Mar 6, 2019, at 4:41 PM, Kinney, Michael D
> <michael.d.kinney@intel.com <mailto:michael.d.kinney@intel.com>> wrote:
> 
> 
> Liming,
> 
> That does not work either because inline assembly uses compiler specific
> syntax.
> 
> 
> Please update with the specific list of functions that you think the inline
> should be removed to improve maintenance with no impacts in size/perf.
> 
>  
> 
> Mike,
> 
> It is easy to find the gcc/clang inline assembler, just `git grep "__asm__
> __volatile__"`
> 
> The main files are:
> https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseIoLib <https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseIoLib>
> Intrinsic/IoLibGcc.c
> https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/X <https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/X>
> 64/GccInline.c
> https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/I <https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/I>
> a32/GccInline.c
> https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync <https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync>
> hronizationLib/Ia32/GccInline.c
> https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync <https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync>
> hronizationLib/X64/GccInline.c
> 
> This is really just compiler optimizer control.
> Library/BaseSynchronizationLib/SynchronizationGcc.c:21:#define
> _ReadWriteBarrier() do { __asm__ __volatile__ ("": : : "memory"); } while(0)
> 
> Looks like this is working around the alignment ASSERT in BaseIoLibIntrinsic.
> OvmfPkg/QemuVideoDxe/UnalignedIoGcc.c:43:  __asm__ __volatile__
> ( "outl %0, %1" : : "a" (Value), "d" ((UINT16)Port) );
> OvmfPkg/QemuVideoDxe/UnalignedIoGcc.c:67:  __asm__ __volatile__
> ( "inl %1, %0" : "=a" (Data) : "d" ((UINT16)Port) );
> 
> It seems like we have a reasonable set and should not use the inline
> assembly for any more complex code.
> 
> Thanks,
> 
> Andrew Fish
> 
> 
> The issue you pointed to was around SetJump()/LongJump().  Can we limit
> this BZ to only those 2 functions?
> 
> 
> Mike
>   <>
> From: Gao, Liming
> Sent: Wednesday, March 6, 2019 4:28 PM
> To: afish@apple.com <mailto:afish@apple.com>
> Cc: Zhang, Shenglei <shenglei.zhang@intel.com <mailto:shenglei.zhang@intel.com>>; edk2-devel-01
> <edk2-devel@lists.01.org <mailto:edk2-devel@lists.01.org>>; Kinney, Michael D
> <michael.d.kinney@intel.com <mailto:michael.d.kinney@intel.com>>
> 
> Subject: RE: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
> inline X86 assembly code
> 
> 
> Andrew:
>  I want to keep only one implementation. If inline assembly c source is
> preferred, I suggest to remove its nasm implementation.
> 
> 
> Thanks
> Liming
> <>From: afish@apple.com <mailto:afish@apple.com> <mailto:afish@apple.com <mailto:afish@apple.com>>
> [mailto:afish@apple.com <mailto:afish@apple.com> <mailto:afish@apple.com%20%3cmailto:afish@apple.com%3e>]
> 
> Sent: Tuesday, March 5, 2019 2:44 PM
> To: Gao, Liming <liming.gao@intel.com <mailto:liming.gao@intel.com <mailto:liming.gao@intel.com%20%3cmailto:liming.gao@intel.com>>>
> Cc: Zhang, Shenglei <shenglei.zhang@intel.com <mailto:shenglei.zhang@intel.com>
> <mailto:shenglei.zhang@intel.com <mailto:shenglei.zhang@intel.com>>>; edk2-devel-01
> <edk2-devel@lists.01.org <mailto:edk2-devel@lists.01.org <mailto:edk2-devel@lists.01.org%20%3cmailto:edk2-devel@lists.01.org>>>; Kinney,
> Michael D <michael.d.kinney@intel.com
> <mailto:michael.d.kinney@intel.com <mailto:michael.d.kinney@intel.com%0b%3cmailto:michael.d.kinney@intel.com>>>
> 
> Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
> inline X86 assembly code
> 
> 
> 
> 
> 
> On Mar 5, 2019, at 1:32 PM, Gao, Liming <liming.gao@intel.com <mailto:liming.gao@intel.com>
> <mailto:liming.gao@intel.com <mailto:liming.gao@intel.com>>> wrote:
> 
> 
> Andrew:
> BZ 1163 is to remove inline X86 assembly code in C source file. But, this
> patch is wrong. I have gave my comments to update this patch.
> 
> 
> 
> Why do we want to remove inline X86 assembly. As I mentioned it will lead
> to larger binaries, slower binaries, and less optimized code.
> 
> 
> For example take this simple inline assembly function:
> 
> VOID
> EFIAPI
> CpuBreakpoint (
>  VOID
>  )
> {
>  __asm__ __volatile__ ("int $3");
> }
> 
> 
> Today with clang LTO turned on calling CpuBreakpoint() looks like this from
> the callers point of view.
> 
> 
> a.out[0x1fa5] <+6>:  cc              int3
> 
> 
> But if we move that to NASM:
> 
> 
> a.out[0x1fa6] <+7>:  e8 07 00 00 00  calll
> 0x1fb2                    ; CpuBreakpoint
> 
> 
> 
> plus:
> a.out`CpuBreakpoint:
> a.out[0x1fb2] <+0>: cc     int3
> a.out[0x1fb3] <+1>: c3     retl
> 
> 
> And there is also an extra push and pop on the stack. The other issue is the
> call to the function that is unknown to the compiler will act like a
> _ReadWriteBarrier (Yikes I see _ReadWriteBarrier is deprecated in VC++
> 2017). Is the depreciation of some of these intrinsics in VC++ driving the
> removal of inline assembly? For GCC inline assembly works great for local
> compile, and for clang LTO it works in entire program optimization.
> 
> 
> The LTO bitcode includes inline assembly and constraints so that the
> optimizer knows what to do so it can get optimized just like the abstract
> bitcode during the Link Time Optimization.
> 
> 
> This is CpuBreakpoint():
> ; Function Attrs: noinline nounwind optnone ssp uwtable
> define void @CpuBreakpoint() #0 {
>  call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"()
> #2, !srcloc !3
> 
>  ret void
> }
> 
> 
> This is AsmReadMsr64():
> ; Function Attrs: noinline nounwind optnone ssp uwtable
> define i64 @AsmReadMsr64(i32) #0 {
>  %2 = alloca i32, align 4
>  %3 = alloca i64, align 8
>  store i32 %0, i32* %2, align 4
>  %4 = load i32, i32* %2, align 4
>  %5 = call i64 asm sideeffect "rdmsr",
> "=A,{cx},~{dirflag},~{fpsr},~{flags}"(i32 %4) #2, !srcloc !4
> 
>  store i64 %5, i64* %3, align 8
>  %6 = load i64, i64* %3, align 8
>  ret i64 %6
> }
> 
> 
> 
> I agree it make sense to remove .S and .asm files and only have .nasm files.
> 
> Thanks,
> 
> Andrew Fish
> 
> PS For the Xcode clang since it emits frame pointers by default we need to
> add the extra 4 bytes to save the assembly functions so the stack can get
> unwound.
> 
> 
> a.out`CpuBreakpoint:
> a.out[0x1f99] <+0>: 55     pushl  %ebp
> a.out[0x1f9a] <+1>: 89 e5  movl   %esp, %ebp
> a.out[0x1f9c] <+3>: cc     int3
> a.out[0x1f9d] <+4>: 5d     popl   %ebp
> a.out[0x1f9e] <+5>: c3     retl
> 
> 
> So breakpoint goes from 1 byte to 11 bytes if we get rid of the intrinsics.
> 
> 
> The change is to reduce the duplicated implementation. It will be good
> on the code maintain. Recently, one patch has to update .c and .nasm both.
> Please see
> https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html <https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html>
> <https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html <https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html>>.
> Beside this change, I propose to remove all GAS assembly code for IA32 and
> X64 arch. After those change, the patch owner will collect the impact of the
> image size.
> 
> 
> Thanks
> Liming
> 
> _______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.01.org <mailto:edk2-devel@lists.01.org>
> https://lists.01.org/mailman/listinfo/edk2-devel <https://lists.01.org/mailman/listinfo/edk2-devel>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code
  2019-03-09  0:15                             ` Andrew Fish
@ 2019-03-14 22:27                               ` Gao, Liming
  0 siblings, 0 replies; 21+ messages in thread
From: Gao, Liming @ 2019-03-14 22:27 UTC (permalink / raw)
  To: afish@apple.com, Kinney, Michael D; +Cc: Yao, Jiewen, edk2-devel-01

Andrew:
  I don’t know who still uses Intel compiler. I propose to remove INTEL tool chain (ICC9 and ICC11) from tools_def.txt if no one rejects it. After INTEL tool chain is removed, if all compilers (VS, GCC, XCODE, and CLANG) supports the inline X86 assembly code, I will propose to remove those nasm files only for INTEL compiler.

Thanks
Liming
From: afish@apple.com [mailto:afish@apple.com]
Sent: Friday, March 8, 2019 4:15 PM
To: Kinney, Michael D <michael.d.kinney@intel.com>
Cc: Yao, Jiewen <jiewen.yao@intel.com>; Gao, Liming <liming.gao@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org>
Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code




On Mar 8, 2019, at 4:07 PM, Kinney, Michael D <michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>> wrote:

Jiewen,

There are not many of those functions and they can be on speed paths.

I do not recommend converting them to only NASM.

I do recommend we add comments to NASM files and C files with include assembly that state that updates to one require updates to the others.


Mike,

It looks like NASM files only exist to support the INTEL compiler. Could that be an outdated assumption? Overtime has the INTEL compiler added more compatibility with GCC inline assembly, or VC++ intrinsics? I assume people are still using the INTEL compiler?

Thanks,

Andrew Fish



Mike

From: Yao, Jiewen
Sent: Wednesday, March 6, 2019 11:25 PM
To: Gao, Liming <liming.gao@intel.com<mailto:liming.gao@intel.com>>; afish@apple.com<mailto:afish@apple.com>
Cc: Kinney, Michael D <michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>>
Subject: RE: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code

Thanks.

I also recommend to take a look at MdePkg\Library\BaseSynchronizationLib.

That is also complicated and not so readable for me.

And we have 8 patches to *fix* the real problem in 2018.

Thank you
Yao Jiewen


From: Gao, Liming
Sent: Wednesday, March 6, 2019 11:15 PM
To: afish@apple.com<mailto:afish@apple.com>; Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>>
Cc: Kinney, Michael D <michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>>
Subject: RE: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code

Thanks for your clarification. Now, we will focus on SetJump/LongJump() first.

Thanks
Liming
From: afish@apple.com<mailto:afish@apple.com> [mailto:afish@apple.com]
Sent: Wednesday, March 6, 2019 10:45 PM
To: Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>>
Cc: Kinney, Michael D <michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>>; Gao, Liming <liming.gao@intel.com<mailto:liming.gao@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>>
Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code



On Mar 6, 2019, at 10:09 PM, Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>> wrote:

Thanks Andrew.
Now I 100% agree with you - I think it is fine to restrict new C inline assembler, or at least have a very high bar to add anything new. Any new inline assembler *should also be simple and just be a function abstracting a CPU op-code* that is not available to C. This is how we prevent the maintenance issues you are worrying about.

And I also agree with your case.

Let’s discuss another case. Below 2 functions SetJump/LongJump are updated recently, by me unfortunately.

It is unclear to me what is the size optimization we can get by inline SetJump/LongJump.
But obviously, it brings the maintenance issue to me.
And I don’t believe it meets the criteria you mentioned above.


Jiewen,

I agree it seems like given the rules we agree on this code should be written in NASM.

Thanks,

Andrew Fish

_declspec (naked)
RETURNS_TWICE
UINTN
EFIAPI
SetJump (
  OUT     BASE_LIBRARY_JUMP_BUFFER  *JumpBuffer
  )
{
  _asm {
    push    [esp + 4]
    call    InternalAssertJumpBuffer
    pop     ecx
    pop     ecx
    mov     edx, [esp]

    xor     eax, eax
    mov     [edx + 24], eax        ; save 0 to SSP

    mov     eax, [PcdGet32 (PcdControlFlowEnforcementPropertyMask)]
    test    eax, eax
    jz      CetDone
    _emit      0x0F
    _emit      0x20
    _emit      0xE0                ; mov     eax, cr4
    bt      eax, 23                ; check if CET is enabled
    jnc     CetDone

    mov     eax, 1
    _emit      0xF3
    _emit      0x0F
    _emit      0xAE
    _emit      0xE8                ; INCSSP EAX to read original SSP
    _emit      0xF3
    _emit      0x0F
    _emit      0x1E
    _emit      0xC8                ; READSSP EAX
    mov     [edx + 0x24], eax      ; save SSP

CetDone:

    mov     [edx], ebx
    mov     [edx + 4], esi
    mov     [edx + 8], edi
    mov     [edx + 12], ebp
    mov     [edx + 16], esp
    mov     [edx + 20], ecx
    xor     eax, eax
    jmp     ecx
  }
}

__declspec (naked)
VOID
EFIAPI
InternalLongJump (
  IN      BASE_LIBRARY_JUMP_BUFFER  *JumpBuffer,
  IN      UINTN                     Value
  )
{
  _asm {
    mov     eax, [PcdGet32 (PcdControlFlowEnforcementPropertyMask)]
    test    eax, eax
    jz      CetDone
    _emit      0x0F
    _emit      0x20
    _emit      0xE0                ; mov     eax, cr4
    bt      eax, 23                ; check if CET is enabled
    jnc     CetDone

    mov     edx, [esp + 4]         ; edx = JumpBuffer
    mov     edx, [edx + 24]        ; edx = target SSP
    _emit      0xF3
    _emit      0x0F
    _emit      0x1E
    _emit      0xC8                ; READSSP EAX
    sub     edx, eax               ; edx = delta
    mov     eax, edx               ; eax = delta

    shr     eax, 2                 ; eax = delta/sizeof(UINT32)
    _emit      0xF3
    _emit      0x0F
    _emit      0xAE
    _emit      0xE8                ; INCSSP EAX

CetDone:

    pop     eax                         ; skip return address
    pop     edx                         ; edx <- JumpBuffer
    pop     eax                         ; eax <- Value
    mov     ebx, [edx]
    mov     esi, [edx + 4]
    mov     edi, [edx + 8]
    mov     ebp, [edx + 12]
    mov     esp, [edx + 16]
    jmp     dword ptr [edx + 20]
  }
}



From: afish@apple.com<mailto:afish@apple.com> [mailto:afish@apple.com]
Sent: Wednesday, March 6, 2019 9:56 PM
To: Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>>
Cc: Kinney, Michael D <michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>>; Gao, Liming <liming.gao@intel.com<mailto:liming.gao@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>>
Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code



On Mar 6, 2019, at 9:06 PM, Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>> wrote:

HI Mike and Andrew
The problem is maintainability.

If we have multiple copy of implementation, a developer need verify multiple copy of implementation, if we make update. Before that, a developer has to be aware that there is multiple copy of implementation. - That increases the complexity.

If we have everything, there MAY be 5 copy - ASM, NASM, S, GCC inline, MS inline, theoretically.
Now, we remove ASM. It is good first step.
But we may still have 4 copies. I suggest we consider do more.

Jiewen,

I think you are trying do the right thing, but are optimize the wrong thing.

Most of the GCC/Clang inline assembler code is in Gccinline.c and since that code is mostly just abstracting an x86 instruction and the functions are very simple and thus it is NOT code that needs ongoing maintenance.

Lets look at the history:
https://github.com/tianocore/edk2/commits/master/MdePkg/Library/BaseLib/X64/GccInline.c
The last logic fix was Jun 1, 2010

https://github.com/tianocore/edk2/commits/master/MdePkg/Library/BaseLib/Ia32/GccInline.c
Ok so Mike had to make a fix in this file in 2015 to make something optional, due to an embedded CPU defeating an instruction. But prior to that it was 2010.

The set of things that are C inline assembler we have should be static and not require much maintenance. More complex code should be written in C and use the C inline assembler functions we already have. If any more complex assembly code is required we should just write it in NASM. So I think it is fine to restrict new C inline assembler, or at least have a very high bar to add anything new. Any new inline assembler should also be simple and just be a function abstracting a CPU op-code that is not available to C. This is how we prevent the maintenance issues you are worrying about.

I gave an example in this  mail thread on how a Breakpoint goes from being 1 byte to 11 bytes if you remove the C inline assembler. For clang with LTO enabled a CpuBreakpoint will always get inlined into the callers code and it will only take the one byte for int 3 instruction. If that code moves to NASM then it get replaces with a 5 byte call instruction and an actual C ABI function for the breakpoint.

VOID
EFIAPI
CpuBreakpoint (
  VOID
  )
{
  __asm__ __volatile__ ("int $3");
}

Today with clang LTO turned on calling CpuBreakpoint() looks like this from the callers point of view.

a.out[0x1fa5] <+6>:  cc              int3

But if we move that to NASM:

a.out[0x1fa6] <+7>:  e8 07 00 00 00  calll  0x1fb2                    ; CpuBreakpoint

plus:
a.out`CpuBreakpoint:
a.out[0x1f99] <+0>: 55     pushl  %ebp
a.out[0x1f9a] <+1>: 89 e5  movl   %esp, %ebp
a.out[0x1f9c] <+3>: cc     int3
a.out[0x1f9d] <+4>: 5d     popl   %ebp
a.out[0x1f9e] <+5>: c3     retl

For any compiler that emits the frame pointer if you move the INT 3 to assembly you need the frame pointer or the Exception Lib is not going to be able to print out the stack backtrace of the code when you hit a breakpoint. So this is how I get to 11 vs. 1 bytes.

Thanks,

Andrew Fish

PS for clang LTO the compiler compiles to bitcode and that is an abstracted assembly language. At link time all that code gets optimized to together and then passed through the CPU specific code gen. For C inline assembler the assembly instructions end up in the bitcode with lots of information about the constraints. That is why these GccInline.c functions almost always get inlined with clang and LTO.

The CpuBreakpoint() function looks like this in bitcode, but the function call gets optimized away by LTO in the code gen.

; Function Attrs: noinline nounwind optnone ssp uwtable
define void @CpuBreakpoint() #0 {
  call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !3
  ret void
}

Even something more complex like AsmReadMsr64() can get inlined, but it contains a lot more info about the constraints:

; Function Attrs: noinline nounwind optnone ssp uwtable
define i64 @AsmReadMsr64(i32) #0 {
  %2 = alloca i32, align 4
  %3 = alloca i64, align 8
  store i32 %0, i32* %2, align 4
  %4 = load i32, i32* %2, align 4
  %5 = call i64 asm sideeffect "rdmsr", "=A,{cx},~{dirflag},~{fpsr},~{flags}"(i32 %4) #2, !srcloc !4
  store i64 %5, i64* %3, align 8
  %6 = load i64, i64* %3, align 8
  ret i64 %6
}

The alloca, store, load, etc are the same bitcode instructions you would see with C code.

A recently case that SetJump/LongJump, I updated the NASM file for both IA32 and X64.

Later, to my surprise, only X64 version NASM works, and IA32 version NASM does not work.
Then I notice that there is a different copy if IA32 Jump.c MS inline - I also need update. That is frustrated.

I think there should be a balance between optimization and code readability/maintainability.

Do we have data on what size benefit we can get with these inline function, from whole image level?
We can do evaluation at function level, case by case.
If we see a huge size benefit, I agree to keep this function.
If we see no size benefit, I suggest we do the cleanup for this function.


Thank you
Yao Jiewen




-----Original Message-----
From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of
Andrew Fish via edk2-devel
Sent: Wednesday, March 6, 2019 5:31 PM
To: Kinney, Michael D <michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>>
Cc: edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>>; Gao, Liming
<liming.gao@intel.com<mailto:liming.gao@intel.com>>
Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
inline X86 assembly code




On Mar 6, 2019, at 4:41 PM, Kinney, Michael D
<michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>> wrote:

Liming,

That does not work either because inline assembly uses compiler specific
syntax.

Please update with the specific list of functions that you think the inline
should be removed to improve maintenance with no impacts in size/perf.


Mike,

It is easy to find the gcc/clang inline assembler, just `git grep "__asm__
__volatile__"`

The main files are:
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseIoLib
Intrinsic/IoLibGcc.c
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/X
64/GccInline.c
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/I
a32/GccInline.c
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync
hronizationLib/Ia32/GccInline.c
https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync
hronizationLib/X64/GccInline.c

This is really just compiler optimizer control.
Library/BaseSynchronizationLib/SynchronizationGcc.c:21:#define
_ReadWriteBarrier() do { __asm__ __volatile__ ("": : : "memory"); } while(0)

Looks like this is working around the alignment ASSERT in BaseIoLibIntrinsic.
OvmfPkg/QemuVideoDxe/UnalignedIoGcc.c:43:  __asm__ __volatile__
( "outl %0, %1" : : "a" (Value), "d" ((UINT16)Port) );
OvmfPkg/QemuVideoDxe/UnalignedIoGcc.c:67:  __asm__ __volatile__
( "inl %1, %0" : "=a" (Data) : "d" ((UINT16)Port) );

It seems like we have a reasonable set and should not use the inline
assembly for any more complex code.

Thanks,

Andrew Fish


The issue you pointed to was around SetJump()/LongJump().  Can we limit
this BZ to only those 2 functions?

Mike
  <>
From: Gao, Liming
Sent: Wednesday, March 6, 2019 4:28 PM
To: afish@apple.com<mailto:afish@apple.com>
Cc: Zhang, Shenglei <shenglei.zhang@intel.com<mailto:shenglei.zhang@intel.com>>; edk2-devel-01
<edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>>; Kinney, Michael D
<michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>>
Subject: RE: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
inline X86 assembly code

Andrew:
 I want to keep only one implementation. If inline assembly c source is
preferred, I suggest to remove its nasm implementation.

Thanks
Liming
<>From: afish@apple.com<mailto:afish@apple.com> <mailto:afish@apple.com>
[mailto:afish@apple.com <mailto:afish@apple.com><mailto:afish@apple.com%20%3cmailto:afish@apple.com%3e>]
Sent: Tuesday, March 5, 2019 2:44 PM
To: Gao, Liming <liming.gao@intel.com <mailto:liming.gao@intel.com<mailto:liming.gao@intel.com%20%3cmailto:liming.gao@intel.com>>>
Cc: Zhang, Shenglei <shenglei.zhang@intel.com<mailto:shenglei.zhang@intel.com>
<mailto:shenglei.zhang@intel.com>>; edk2-devel-01
<edk2-devel@lists.01.org <mailto:edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org%20%3cmailto:edk2-devel@lists.01.org>>>; Kinney,
Michael D <michael.d.kinney@intel.com
<mailto:michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com%0b%3cmailto:michael.d.kinney@intel.com>>>
Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove
inline X86 assembly code




On Mar 5, 2019, at 1:32 PM, Gao, Liming <liming.gao@intel.com<mailto:liming.gao@intel.com>
<mailto:liming.gao@intel.com>> wrote:

Andrew:
BZ 1163 is to remove inline X86 assembly code in C source file. But, this
patch is wrong. I have gave my comments to update this patch.


Why do we want to remove inline X86 assembly. As I mentioned it will lead
to larger binaries, slower binaries, and less optimized code.

For example take this simple inline assembly function:

VOID
EFIAPI
CpuBreakpoint (
 VOID
 )
{
 __asm__ __volatile__ ("int $3");
}


Today with clang LTO turned on calling CpuBreakpoint() looks like this from
the callers point of view.

a.out[0x1fa5] <+6>:  cc              int3


But if we move that to NASM:


a.out[0x1fa6] <+7>:  e8 07 00 00 00  calll
0x1fb2                    ; CpuBreakpoint


plus:
a.out`CpuBreakpoint:
a.out[0x1fb2] <+0>: cc     int3
a.out[0x1fb3] <+1>: c3     retl


And there is also an extra push and pop on the stack. The other issue is the
call to the function that is unknown to the compiler will act like a
_ReadWriteBarrier (Yikes I see _ReadWriteBarrier is deprecated in VC++
2017). Is the depreciation of some of these intrinsics in VC++ driving the
removal of inline assembly? For GCC inline assembly works great for local
compile, and for clang LTO it works in entire program optimization.

The LTO bitcode includes inline assembly and constraints so that the
optimizer knows what to do so it can get optimized just like the abstract
bitcode during the Link Time Optimization.

This is CpuBreakpoint():
; Function Attrs: noinline nounwind optnone ssp uwtable
define void @CpuBreakpoint() #0 {
 call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"()
#2, !srcloc !3
 ret void
}


This is AsmReadMsr64():
; Function Attrs: noinline nounwind optnone ssp uwtable
define i64 @AsmReadMsr64(i32) #0 {
 %2 = alloca i32, align 4
 %3 = alloca i64, align 8
 store i32 %0, i32* %2, align 4
 %4 = load i32, i32* %2, align 4
 %5 = call i64 asm sideeffect "rdmsr",
"=A,{cx},~{dirflag},~{fpsr},~{flags}"(i32 %4) #2, !srcloc !4
 store i64 %5, i64* %3, align 8
 %6 = load i64, i64* %3, align 8
 ret i64 %6
}



I agree it make sense to remove .S and .asm files and only have .nasm files.

Thanks,

Andrew Fish

PS For the Xcode clang since it emits frame pointers by default we need to
add the extra 4 bytes to save the assembly functions so the stack can get
unwound.

a.out`CpuBreakpoint:
a.out[0x1f99] <+0>: 55     pushl  %ebp
a.out[0x1f9a] <+1>: 89 e5  movl   %esp, %ebp
a.out[0x1f9c] <+3>: cc     int3
a.out[0x1f9d] <+4>: 5d     popl   %ebp
a.out[0x1f9e] <+5>: c3     retl


So breakpoint goes from 1 byte to 11 bytes if we get rid of the intrinsics.


The change is to reduce the duplicated implementation. It will be good
on the code maintain. Recently, one patch has to update .c and .nasm both.
Please see
https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html
<https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html>.
Beside this change, I propose to remove all GAS assembly code for IA32 and
X64 arch. After those change, the patch owner will collect the impact of the
image size.

Thanks
Liming

_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>
https://lists.01.org/mailman/listinfo/edk2-devel


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-03-14 22:27 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-03-05  1:40 [PATCH 0/3] MdePkg: Remove inline X86 assembly code Shenglei Zhang
2019-03-05  1:40 ` [PATCH 1/3] MdePkg/BaseCpuLib: " Shenglei Zhang
2019-03-05  1:40 ` [PATCH 2/3] MdePkg/BaseLib: " Shenglei Zhang
2019-03-05  1:40 ` [PATCH 3/3] MdePkg/BaseSynchronizationLib: " Shenglei Zhang
2019-03-05 20:57   ` Andrew Fish
2019-03-05 21:32     ` Gao, Liming
2019-03-05 21:45       ` Kinney, Michael D
2019-03-05 22:43       ` Andrew Fish
2019-03-07  0:28         ` Gao, Liming
2019-03-07  0:41           ` Kinney, Michael D
2019-03-07  1:31             ` Andrew Fish
2019-03-07  5:06               ` Yao, Jiewen
2019-03-07  5:56                 ` Andrew Fish
2019-03-07  6:09                   ` Yao, Jiewen
2019-03-07  6:45                     ` Andrew Fish
2019-03-07  7:15                       ` Gao, Liming
2019-03-07  7:25                         ` Yao, Jiewen
2019-03-09  0:07                           ` Kinney, Michael D
2019-03-09  0:15                             ` Andrew Fish
2019-03-14 22:27                               ` Gao, Liming
2019-03-05  2:10 ` [PATCH 0/3] MdePkg: " Gao, Liming

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox