[PATCH v5 0/4] MdePkg: add ARM/AARCH64 support to BaseMemoryLib

public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed

* [PATCH v5 0/4] MdePkg: add ARM/AARCH64 support to BaseMemoryLib
@ 2016-09-09 14:00 Ard Biesheuvel
  2016-09-09 14:00 ` [PATCH v5 1/4] MdePkg/BaseMemoryLib: widen aligned accesses to 32 or 64 bits Ard Biesheuvel
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2016-09-09 14:00 UTC (permalink / raw)
  To: edk2-devel, liming.gao, leif.lindholm, michael.d.kinney; +Cc: Ard Biesheuvel

This adds ARM and AARCH64 support to both BaseMemoryLib (generic C) and
BaseMemoryLibOptDxe (accelerated). The former can be used anywhere, the
latter only in places where the caches are guaranteed to be on, not only
due to the unaligned accesses but also due to the fact that it uses
DC ZVA instructions for clearing memory (AArch64 only).

Liming: I will need your R-b for patch #4 (assuming you are ok with it). Thanks.

I have tested this version of the series with various emulated, virtualized and
bare metal implementations, and I think this is good to go in now. I will follow
up with a series that adds BaseMemoryLibOptDxe to ArmVirtQemu and other
platforms once I have independent confirmation that everything works as expected
(in other wors, Tested-by's are highly appreciated)

Changes since v4:
- update SetMem() for ARM yet again (reduce code size, and minor performance
  tweak)
- add patch #4 to disallow BaseMemoryLibOptDxe in SEC and PEI phases on ARM
  and AARCH64

Branch can be found here
https://git.linaro.org/people/ard.biesheuvel/uefi-next.git/shortlog/refs/heads/arm64-basememorylib-v5

Changes since v3:
- added Liming's R-b
- updated SetMem() to avoid unaligned strd (store pair) instructions, which
  require 32-bit alignment even in cases where ordinary loads and stores do
  tolerate unaligned accesses (#2)
- fix Clang issue in NEON dialect (#3)

Branch can be found here
https://git.linaro.org/people/ard.biesheuvel/uefi-next.git/shortlog/refs/heads/arm64-basememorylib-v4

Changes since v2:
- avoid open coded 64-bit shift (#1)
- tweak SetMem implementation (#2)

Ard Biesheuvel (4):
  MdePkg/BaseMemoryLib: widen aligned accesses to 32 or 64 bits
  MdePkg/BaseMemoryLibOptDxe: add accelerated ARM routines
  MdePkg/BaseMemoryLibOptDxe: add accelerated AARCH64 routines
  MdePkg/BaseMemoryLibOptDxe ARM|AARCH64: disallow use in SEC & PEI
    phases

 MdePkg/Library/BaseMemoryLib/BaseMemoryLib.inf             |   2 +-
 MdePkg/Library/BaseMemoryLib/CopyMem.c                     | 112 +++++++-
 MdePkg/Library/BaseMemoryLib/SetMem.c                      |  40 ++-
 MdePkg/Library/BaseMemoryLibOptDxe/AArch64/CompareMem.S    | 142 ++++++++++
 MdePkg/Library/BaseMemoryLibOptDxe/AArch64/CopyMem.S       | 284 ++++++++++++++++++++
 MdePkg/Library/BaseMemoryLibOptDxe/AArch64/ScanMem.S       | 161 +++++++++++
 MdePkg/Library/BaseMemoryLibOptDxe/AArch64/SetMem.S        | 244 +++++++++++++++++
 MdePkg/Library/BaseMemoryLibOptDxe/Arm/CompareMem.S        | 138 ++++++++++
 MdePkg/Library/BaseMemoryLibOptDxe/Arm/CompareMem.asm      | 140 ++++++++++
 MdePkg/Library/BaseMemoryLibOptDxe/Arm/CopyMem.S           | 172 ++++++++++++
 MdePkg/Library/BaseMemoryLibOptDxe/Arm/CopyMem.asm         | 147 ++++++++++
 MdePkg/Library/BaseMemoryLibOptDxe/Arm/ScanMem.S           | 146 ++++++++++
 MdePkg/Library/BaseMemoryLibOptDxe/Arm/ScanMem.asm         | 147 ++++++++++
 MdePkg/Library/BaseMemoryLibOptDxe/Arm/ScanMemGeneric.c    | 142 ++++++++++
 MdePkg/Library/BaseMemoryLibOptDxe/Arm/SetMem.S            |  77 ++++++
 MdePkg/Library/BaseMemoryLibOptDxe/Arm/SetMem.asm          |  84 ++++++
 MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf |  46 +++-
 17 files changed, 2196 insertions(+), 28 deletions(-)
 create mode 100644 MdePkg/Library/BaseMemoryLibOptDxe/AArch64/CompareMem.S
 create mode 100644 MdePkg/Library/BaseMemoryLibOptDxe/AArch64/CopyMem.S
 create mode 100644 MdePkg/Library/BaseMemoryLibOptDxe/AArch64/ScanMem.S
 create mode 100644 MdePkg/Library/BaseMemoryLibOptDxe/AArch64/SetMem.S
 create mode 100644 MdePkg/Library/BaseMemoryLibOptDxe/Arm/CompareMem.S
 create mode 100644 MdePkg/Library/BaseMemoryLibOptDxe/Arm/CompareMem.asm
 create mode 100644 MdePkg/Library/BaseMemoryLibOptDxe/Arm/CopyMem.S
 create mode 100644 MdePkg/Library/BaseMemoryLibOptDxe/Arm/CopyMem.asm
 create mode 100644 MdePkg/Library/BaseMemoryLibOptDxe/Arm/ScanMem.S
 create mode 100644 MdePkg/Library/BaseMemoryLibOptDxe/Arm/ScanMem.asm
 create mode 100644 MdePkg/Library/BaseMemoryLibOptDxe/Arm/ScanMemGeneric.c
 create mode 100644 MdePkg/Library/BaseMemoryLibOptDxe/Arm/SetMem.S
 create mode 100644 MdePkg/Library/BaseMemoryLibOptDxe/Arm/SetMem.asm

-- 
2.7.4



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v5 1/4] MdePkg/BaseMemoryLib: widen aligned accesses to 32 or 64 bits
  2016-09-09 14:00 [PATCH v5 0/4] MdePkg: add ARM/AARCH64 support to BaseMemoryLib Ard Biesheuvel
@ 2016-09-09 14:00 ` Ard Biesheuvel
  2016-09-09 14:00 ` [PATCH v5 2/4] MdePkg/BaseMemoryLibOptDxe: add accelerated ARM routines Ard Biesheuvel
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2016-09-09 14:00 UTC (permalink / raw)
  To: edk2-devel, liming.gao, leif.lindholm, michael.d.kinney; +Cc: Ard Biesheuvel

Since the default BaseMemoryLib should be callable from any context,
including ones where unaligned accesses are not allowed, it implements
InternalCopyMem() and InternalSetMem() using byte accesses only.
However, especially in a context where the MMU is off, such narrow
accesses may be disproportionately costly, and so if the size and
alignment of the access allow it, use 32-bit or even 64-bit loads and
stores (the latter may be beneficial even on a 32-bit architectures like
ARM, which has load pair/store pair instructions)

Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Liming Gao <liming.gao@intel.com>
---
 MdePkg/Library/BaseMemoryLib/BaseMemoryLib.inf |   2 +-
 MdePkg/Library/BaseMemoryLib/CopyMem.c         | 112 ++++++++++++++++++--
 MdePkg/Library/BaseMemoryLib/SetMem.c          |  40 ++++++-
 3 files changed, 140 insertions(+), 14 deletions(-)

diff --git a/MdePkg/Library/BaseMemoryLib/BaseMemoryLib.inf b/MdePkg/Library/BaseMemoryLib/BaseMemoryLib.inf
index 6d906e93faf3..358eeed4f449 100644
--- a/MdePkg/Library/BaseMemoryLib/BaseMemoryLib.inf
+++ b/MdePkg/Library/BaseMemoryLib/BaseMemoryLib.inf
@@ -26,7 +26,7 @@ [Defines]
 
 
 #
-#  VALID_ARCHITECTURES           = IA32 X64 IPF EBC
+#  VALID_ARCHITECTURES           = IA32 X64 IPF EBC ARM AARCH64
 #
 
 [Sources]
diff --git a/MdePkg/Library/BaseMemoryLib/CopyMem.c b/MdePkg/Library/BaseMemoryLib/CopyMem.c
index 37f03660df5f..6f4fd900df5d 100644
--- a/MdePkg/Library/BaseMemoryLib/CopyMem.c
+++ b/MdePkg/Library/BaseMemoryLib/CopyMem.c
@@ -4,6 +4,9 @@
   particular platform easily if an optimized version is desired.
 
   Copyright (c) 2006 - 2010, Intel Corporation. All rights reserved.<BR>
+  Copyright (c) 2012 - 2013, ARM Ltd. All rights reserved.<BR>
+  Copyright (c) 2016, Linaro Ltd. All rights reserved.<BR>
+
   This program and the accompanying materials
   are licensed and made available under the terms and conditions of the BSD License
   which accompanies this distribution.  The full text of the license may be found at
@@ -44,18 +47,107 @@ InternalMemCopyMem (
   //
   volatile UINT8                    *Destination8;
   CONST UINT8                       *Source8;
+  volatile UINT32                   *Destination32;
+  CONST UINT32                      *Source32;
+  volatile UINT64                   *Destination64;
+  CONST UINT64                      *Source64;
+  UINTN                             Alignment;
+
+  if ((((UINTN)DestinationBuffer & 0x7) == 0) && (((UINTN)SourceBuffer & 0x7) == 0) && (Length >= 8)) {
+    if (SourceBuffer > DestinationBuffer) {
+      Destination64 = (UINT64*)DestinationBuffer;
+      Source64 = (CONST UINT64*)SourceBuffer;
+      while (Length >= 8) {
+        *(Destination64++) = *(Source64++);
+        Length -= 8;
+      }
+
+      // Finish if there are still some bytes to copy
+      Destination8 = (UINT8*)Destination64;
+      Source8 = (CONST UINT8*)Source64;
+      while (Length-- != 0) {
+        *(Destination8++) = *(Source8++);
+      }
+    } else if (SourceBuffer < DestinationBuffer) {
+      Destination64 = (UINT64*)((UINTN)DestinationBuffer + Length);
+      Source64 = (CONST UINT64*)((UINTN)SourceBuffer + Length);
+
+      // Destination64 and Source64 were aligned on a 64-bit boundary
+      // but if length is not a multiple of 8 bytes then they won't be
+      // anymore.
+
+      Alignment = Length & 0x7;
+      if (Alignment != 0) {
+        Destination8 = (UINT8*)Destination64;
+        Source8 = (CONST UINT8*)Source64;
+
+        while (Alignment-- != 0) {
+          *(--Destination8) = *(--Source8);
+          --Length;
+        }
+        Destination64 = (UINT64*)Destination8;
+        Source64 = (CONST UINT64*)Source8;
+      }
+
+      while (Length > 0) {
+        *(--Destination64) = *(--Source64);
+        Length -= 8;
+      }
+    }
+  } else if ((((UINTN)DestinationBuffer & 0x3) == 0) && (((UINTN)SourceBuffer & 0x3) == 0) && (Length >= 4)) {
+    if (SourceBuffer > DestinationBuffer) {
+      Destination32 = (UINT32*)DestinationBuffer;
+      Source32 = (CONST UINT32*)SourceBuffer;
+      while (Length >= 4) {
+        *(Destination32++) = *(Source32++);
+        Length -= 4;
+      }
+
+      // Finish if there are still some bytes to copy
+      Destination8 = (UINT8*)Destination32;
+      Source8 = (CONST UINT8*)Source32;
+      while (Length-- != 0) {
+        *(Destination8++) = *(Source8++);
+      }
+    } else if (SourceBuffer < DestinationBuffer) {
+      Destination32 = (UINT32*)((UINTN)DestinationBuffer + Length);
+      Source32 = (CONST UINT32*)((UINTN)SourceBuffer + Length);
+
+      // Destination32 and Source32 were aligned on a 32-bit boundary
+      // but if length is not a multiple of 4 bytes then they won't be
+      // anymore.
+
+      Alignment = Length & 0x3;
+      if (Alignment != 0) {
+        Destination8 = (UINT8*)Destination32;
+        Source8 = (CONST UINT8*)Source32;
+
+        while (Alignment-- != 0) {
+          *(--Destination8) = *(--Source8);
+          --Length;
+        }
+        Destination32 = (UINT32*)Destination8;
+        Source32 = (CONST UINT32*)Source8;
+      }
 
-  if (SourceBuffer > DestinationBuffer) {
-    Destination8 = (UINT8*)DestinationBuffer;
-    Source8 = (CONST UINT8*)SourceBuffer;
-    while (Length-- != 0) {
-      *(Destination8++) = *(Source8++);
+      while (Length > 0) {
+        *(--Destination32) = *(--Source32);
+        Length -= 4;
+      }
     }
-  } else if (SourceBuffer < DestinationBuffer) {
-    Destination8 = (UINT8*)DestinationBuffer + Length;
-    Source8 = (CONST UINT8*)SourceBuffer + Length;
-    while (Length-- != 0) {
-      *(--Destination8) = *(--Source8);
+  } else {
+    if (SourceBuffer > DestinationBuffer) {
+      Destination8 = (UINT8*)DestinationBuffer;
+      Source8 = (CONST UINT8*)SourceBuffer;
+      while (Length-- != 0) {
+        *(Destination8++) = *(Source8++);
+      }
+    } else if (SourceBuffer < DestinationBuffer) {
+      Destination8 = (UINT8*)DestinationBuffer + Length;
+      Source8 = (CONST UINT8*)SourceBuffer + Length;
+      while (Length-- != 0) {
+        *(--Destination8) = *(--Source8);
+      }
     }
   }
   return DestinationBuffer;
diff --git a/MdePkg/Library/BaseMemoryLib/SetMem.c b/MdePkg/Library/BaseMemoryLib/SetMem.c
index 5e74085c56f0..b6fb811c388a 100644
--- a/MdePkg/Library/BaseMemoryLib/SetMem.c
+++ b/MdePkg/Library/BaseMemoryLib/SetMem.c
@@ -5,6 +5,9 @@
   is desired.
 
   Copyright (c) 2006 - 2010, Intel Corporation. All rights reserved.<BR>
+  Copyright (c) 2012 - 2013, ARM Ltd. All rights reserved.<BR>
+  Copyright (c) 2016, Linaro Ltd. All rights reserved.<BR>
+
   This program and the accompanying materials
   are licensed and made available under the terms and conditions of the BSD License
   which accompanies this distribution.  The full text of the license may be found at
@@ -43,11 +46,42 @@ InternalMemSetMem (
   // volatile to prevent the optimizer from replacing this function with
   // the intrinsic memset()
   //
-  volatile UINT8                    *Pointer;
+  volatile UINT8                    *Pointer8;
+  volatile UINT32                   *Pointer32;
+  volatile UINT64                   *Pointer64;
+  UINT32                            Value32;
+  UINT64                            Value64;
+
+  if ((((UINTN)Buffer & 0x7) == 0) && (Length >= 8)) {
+    // Generate the 64bit value
+    Value32 = (Value << 24) | (Value << 16) | (Value << 8) | Value;
+    Value64 = LShiftU64 (Value32, 32) | Value32;
+
+    Pointer64 = (UINT64*)Buffer;
+    while (Length >= 8) {
+      *(Pointer64++) = Value64;
+      Length -= 8;
+    }
 
-  Pointer = (UINT8*)Buffer;
+    // Finish with bytes if needed
+    Pointer8 = (UINT8*)Pointer64;
+  } else if ((((UINTN)Buffer & 0x3) == 0) && (Length >= 4)) {
+    // Generate the 32bit value
+    Value32 = (Value << 24) | (Value << 16) | (Value << 8) | Value;
+
+    Pointer32 = (UINT32*)Buffer;
+    while (Length >= 4) {
+      *(Pointer32++) = Value32;
+      Length -= 4;
+    }
+
+    // Finish with bytes if needed
+    Pointer8 = (UINT8*)Pointer32;
+  } else {
+    Pointer8 = (UINT8*)Buffer;
+  }
   while (Length-- > 0) {
-    *(Pointer++) = Value;
+    *(Pointer8++) = Value;
   }
   return Buffer;
 }
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 2/4] MdePkg/BaseMemoryLibOptDxe: add accelerated ARM routines
  2016-09-09 14:00 [PATCH v5 0/4] MdePkg: add ARM/AARCH64 support to BaseMemoryLib Ard Biesheuvel
  2016-09-09 14:00 ` [PATCH v5 1/4] MdePkg/BaseMemoryLib: widen aligned accesses to 32 or 64 bits Ard Biesheuvel
@ 2016-09-09 14:00 ` Ard Biesheuvel
  2016-09-09 14:00 ` [PATCH v5 3/4] MdePkg/BaseMemoryLibOptDxe: add accelerated AARCH64 routines Ard Biesheuvel
  2016-09-09 14:00 ` [PATCH v5 4/4] MdePkg/BaseMemoryLibOptDxe ARM|AARCH64: disallow use in SEC & PEI phases Ard Biesheuvel
  3 siblings, 0 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2016-09-09 14:00 UTC (permalink / raw)
  To: edk2-devel, liming.gao, leif.lindholm, michael.d.kinney; +Cc: Ard Biesheuvel

This adds ARM support to BaseMemoryLibOptDxe, partially based on the
cortex-strings library (ScanMem) and the existing CopyMem() implementation
from BaseMemoryLibStm in ArmPkg.

All string routines are accelerated except ScanMem16, ScanMem32,
ScanMem64 and IsZeroBuffer, which can wait for another day. (Very few
occurrences exist in the codebase)

Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Liming Gao <liming.gao@intel.com>
---
 MdePkg/Library/BaseMemoryLibOptDxe/Arm/CompareMem.S        | 138 ++++++++++++++++
 MdePkg/Library/BaseMemoryLibOptDxe/Arm/CompareMem.asm      | 140 ++++++++++++++++
 MdePkg/Library/BaseMemoryLibOptDxe/Arm/CopyMem.S           | 172 ++++++++++++++++++++
 MdePkg/Library/BaseMemoryLibOptDxe/Arm/CopyMem.asm         | 147 +++++++++++++++++
 MdePkg/Library/BaseMemoryLibOptDxe/Arm/ScanMem.S           | 146 +++++++++++++++++
 MdePkg/Library/BaseMemoryLibOptDxe/Arm/ScanMem.asm         | 147 +++++++++++++++++
 MdePkg/Library/BaseMemoryLibOptDxe/Arm/ScanMemGeneric.c    | 142 ++++++++++++++++
 MdePkg/Library/BaseMemoryLibOptDxe/Arm/SetMem.S            |  77 +++++++++
 MdePkg/Library/BaseMemoryLibOptDxe/Arm/SetMem.asm          |  84 ++++++++++
 MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf |  30 ++--
 10 files changed, 1209 insertions(+), 14 deletions(-)

diff --git a/MdePkg/Library/BaseMemoryLibOptDxe/Arm/CompareMem.S b/MdePkg/Library/BaseMemoryLibOptDxe/Arm/CompareMem.S
new file mode 100644
index 000000000000..951d15777a38
--- /dev/null
+++ b/MdePkg/Library/BaseMemoryLibOptDxe/Arm/CompareMem.S
@@ -0,0 +1,138 @@
+//
+// Copyright (c) 2013 - 2016, Linaro Limited
+// All rights reserved.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions are met:
+//     * Redistributions of source code must retain the above copyright
+//       notice, this list of conditions and the following disclaimer.
+//     * Redistributions in binary form must reproduce the above copyright
+//       notice, this list of conditions and the following disclaimer in the
+//       documentation and/or other materials provided with the distribution.
+//     * Neither the name of the Linaro nor the
+//       names of its contributors may be used to endorse or promote products
+//       derived from this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+// "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+// LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+// A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+// HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+// SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+// LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+// DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+// THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//
+
+// Parameters and result.
+#define src1      r0
+#define src2      r1
+#define limit     r2
+#define result    r0
+
+// Internal variables.
+#define data1     r3
+#define data2     r4
+#define limit_wd  r5
+#define diff      r6
+#define tmp1      r7
+#define tmp2      r12
+#define pos       r8
+#define mask      r14
+
+    .text
+    .thumb
+    .syntax unified
+    .align  5
+ASM_GLOBAL ASM_PFX(InternalMemCompareMem)
+ASM_PFX(InternalMemCompareMem):
+    push    {r4-r8, lr}
+    eor     tmp1, src1, src2
+    tst     tmp1, #3
+    bne     .Lmisaligned4
+    ands    tmp1, src1, #3
+    bne     .Lmutual_align
+    add     limit_wd, limit, #3
+    nop.w
+    lsr     limit_wd, limit_wd, #2
+
+    // Start of performance-critical section  -- one 32B cache line.
+.Lloop_aligned:
+    ldr     data1, [src1], #4
+    ldr     data2, [src2], #4
+.Lstart_realigned:
+    subs    limit_wd, limit_wd, #1
+    eor     diff, data1, data2        // Non-zero if differences found.
+    cbnz    diff, 0f
+    bne     .Lloop_aligned
+    // End of performance-critical section  -- one 32B cache line.
+
+    // Not reached the limit, must have found a diff.
+0:  cbnz    limit_wd, .Lnot_limit
+
+    // Limit % 4 == 0 => all bytes significant.
+    ands    limit, limit, #3
+    beq     .Lnot_limit
+
+    lsl     limit, limit, #3              // Bits -> bytes.
+    mov     mask, #~0
+    lsl     mask, mask, limit
+    bic     data1, data1, mask
+    bic     data2, data2, mask
+
+    orr     diff, diff, mask
+
+.Lnot_limit:
+    rev     diff, diff
+    rev     data1, data1
+    rev     data2, data2
+
+    // The MS-non-zero bit of DIFF marks either the first bit
+    // that is different, or the end of the significant data.
+    // Shifting left now will bring the critical information into the
+    // top bits.
+    clz     pos, diff
+    lsl     data1, data1, pos
+    lsl     data2, data2, pos
+
+    // But we need to zero-extend (char is unsigned) the value and then
+    // perform a signed 32-bit subtraction.
+    lsr     data1, data1, #28
+    sub     result, data1, data2, lsr #28
+    pop     {r4-r8, pc}
+
+.Lmutual_align:
+    // Sources are mutually aligned, but are not currently at an
+    // alignment boundary.  Round down the addresses and then mask off
+    // the bytes that precede the start point.
+    bic     src1, src1, #3
+    bic     src2, src2, #3
+    add     limit, limit, tmp1          // Adjust the limit for the extra.
+    lsl     tmp1, tmp1, #2              // Bytes beyond alignment -> bits.
+    ldr     data1, [src1], #4
+    neg     tmp1, tmp1                  // Bits to alignment -32.
+    ldr     data2, [src2], #4
+    mov     tmp2, #~0
+
+    // Little-endian.  Early bytes are at LSB.
+    lsr     tmp2, tmp2, tmp1            // Shift (tmp1 & 31).
+    add     limit_wd, limit, #3
+    orr     data1, data1, tmp2
+    orr     data2, data2, tmp2
+    lsr     limit_wd, limit_wd, #2
+    b       .Lstart_realigned
+
+.Lmisaligned4:
+    sub     limit, limit, #1
+1:
+    // Perhaps we can do better than this.
+    ldrb    data1, [src1], #1
+    ldrb    data2, [src2], #1
+    subs    limit, limit, #1
+    it      cs
+    cmpcs   data1, data2
+    beq     1b
+    sub     result, data1, data2
+    pop     {r4-r8, pc}
diff --git a/MdePkg/Library/BaseMemoryLibOptDxe/Arm/CompareMem.asm b/MdePkg/Library/BaseMemoryLibOptDxe/Arm/CompareMem.asm
new file mode 100644
index 000000000000..47b49ee16473
--- /dev/null
+++ b/MdePkg/Library/BaseMemoryLibOptDxe/Arm/CompareMem.asm
@@ -0,0 +1,140 @@
+;
+; Copyright (c) 2013 - 2016, Linaro Limited
+; All rights reserved.
+;
+; Redistribution and use in source and binary forms, with or without
+; modification, are permitted provided that the following conditions are met:
+;     * Redistributions of source code must retain the above copyright
+;       notice, this list of conditions and the following disclaimer.
+;     * Redistributions in binary form must reproduce the above copyright
+;       notice, this list of conditions and the following disclaimer in the
+;       documentation and/or other materials provided with the distribution.
+;     * Neither the name of the Linaro nor the
+;       names of its contributors may be used to endorse or promote products
+;       derived from this software without specific prior written permission.
+;
+; THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+; "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+; LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+; A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+; HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+; SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+; LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+; DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+; THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+; (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+; OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+;
+
+; Parameters and result.
+#define src1      r0
+#define src2      r1
+#define limit     r2
+#define result    r0
+
+; Internal variables.
+#define data1     r3
+#define data2     r4
+#define limit_wd  r5
+#define diff      r6
+#define tmp1      r7
+#define tmp2      r12
+#define pos       r8
+#define mask      r14
+
+    EXPORT  InternalMemCompareMem
+    THUMB
+    AREA    CompareMem, CODE, READONLY
+
+InternalMemCompareMem
+    push    {r4-r8, lr}
+    eor     tmp1, src1, src2
+    tst     tmp1, #3
+    bne     Lmisaligned4
+    ands    tmp1, src1, #3
+    bne     Lmutual_align
+    add     limit_wd, limit, #3
+    nop.w
+    lsr     limit_wd, limit_wd, #2
+
+    ; Start of performance-critical section  -- one 32B cache line.
+Lloop_aligned
+    ldr     data1, [src1], #4
+    ldr     data2, [src2], #4
+Lstart_realigned
+    subs    limit_wd, limit_wd, #1
+    eor     diff, data1, data2        ; Non-zero if differences found.
+    cbnz    diff, L0
+    bne     Lloop_aligned
+    ; End of performance-critical section  -- one 32B cache line.
+
+    ; Not reached the limit, must have found a diff.
+L0
+    cbnz    limit_wd, Lnot_limit
+
+    // Limit % 4 == 0 => all bytes significant.
+    ands    limit, limit, #3
+    beq     Lnot_limit
+
+    lsl     limit, limit, #3              // Bits -> bytes.
+    mov     mask, #~0
+    lsl     mask, mask, limit
+    bic     data1, data1, mask
+    bic     data2, data2, mask
+
+    orr     diff, diff, mask
+
+Lnot_limit
+    rev     diff, diff
+    rev     data1, data1
+    rev     data2, data2
+
+    ; The MS-non-zero bit of DIFF marks either the first bit
+    ; that is different, or the end of the significant data.
+    ; Shifting left now will bring the critical information into the
+    ; top bits.
+    clz     pos, diff
+    lsl     data1, data1, pos
+    lsl     data2, data2, pos
+
+    ; But we need to zero-extend (char is unsigned) the value and then
+    ; perform a signed 32-bit subtraction.
+    lsr     data1, data1, #28
+    sub     result, data1, data2, lsr #28
+    pop     {r4-r8, pc}
+
+Lmutual_align
+    ; Sources are mutually aligned, but are not currently at an
+    ; alignment boundary.  Round down the addresses and then mask off
+    ; the bytes that precede the start point.
+    bic     src1, src1, #3
+    bic     src2, src2, #3
+    add     limit, limit, tmp1          ; Adjust the limit for the extra.
+    lsl     tmp1, tmp1, #2              ; Bytes beyond alignment -> bits.
+    ldr     data1, [src1], #4
+    neg     tmp1, tmp1                  ; Bits to alignment -32.
+    ldr     data2, [src2], #4
+    mov     tmp2, #~0
+
+    ; Little-endian.  Early bytes are at LSB.
+    lsr     tmp2, tmp2, tmp1            ; Shift (tmp1 & 31).
+    add     limit_wd, limit, #3
+    orr     data1, data1, tmp2
+    orr     data2, data2, tmp2
+    lsr     limit_wd, limit_wd, #2
+    b       Lstart_realigned
+
+Lmisaligned4
+    sub     limit, limit, #1
+L1
+    // Perhaps we can do better than this.
+    ldrb    data1, [src1], #1
+    ldrb    data2, [src2], #1
+    subs    limit, limit, #1
+    it      cs
+    cmpcs   data1, data2
+    beq     L1
+    sub     result, data1, data2
+    pop     {r4-r8, pc}
+
+    END
diff --git a/MdePkg/Library/BaseMemoryLibOptDxe/Arm/CopyMem.S b/MdePkg/Library/BaseMemoryLibOptDxe/Arm/CopyMem.S
new file mode 100644
index 000000000000..fb5293befc10
--- /dev/null
+++ b/MdePkg/Library/BaseMemoryLibOptDxe/Arm/CopyMem.S
@@ -0,0 +1,172 @@
+#------------------------------------------------------------------------------
+#
+# CopyMem() worker for ARM
+#
+# This file started out as C code that did 64 bit moves if the buffer was
+# 32-bit aligned, else it does a byte copy. It also does a byte copy for
+# any trailing bytes. It was updated to do 32-byte copies using stm/ldm.
+#
+# Copyright (c) 2008 - 2010, Apple Inc. All rights reserved.<BR>
+# Copyright (c) 2016, Linaro Ltd. All rights reserved.<BR>
+# This program and the accompanying materials
+# are licensed and made available under the terms and conditions of the BSD License
+# which accompanies this distribution.  The full text of the license may be found at
+# http://opensource.org/licenses/bsd-license.php
+#
+# THE PROGRAM IS DISTRIBUTED UNDER THE BSD LICENSE ON AN "AS IS" BASIS,
+# WITHOUT WARRANTIES OR REPRESENTATIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED.
+#
+#------------------------------------------------------------------------------
+
+    .text
+    .thumb
+    .syntax unified
+
+/**
+  Copy Length bytes from Source to Destination. Overlap is OK.
+
+  This implementation
+
+  @param  Destination Target of copy
+  @param  Source      Place to copy from
+  @param  Length      Number of bytes to copy
+
+  @return Destination
+
+
+VOID *
+EFIAPI
+InternalMemCopyMem (
+  OUT     VOID                      *DestinationBuffer,
+  IN      CONST VOID                *SourceBuffer,
+  IN      UINTN                     Length
+  )
+**/
+ASM_GLOBAL ASM_PFX(InternalMemCopyMem)
+ASM_PFX(InternalMemCopyMem):
+    push    {r4-r11, lr}
+    // Save the input parameters in extra registers (r11 = destination, r14 = source, r12 = length)
+    mov     r11, r0
+    mov     r10, r0
+    mov     r12, r2
+    mov     r14, r1
+
+    cmp     r11, r1
+    // If (dest < source)
+    bcc     memcopy_check_optim_default
+
+    // If (source + length < dest)
+    rsb     r3, r1, r11
+    cmp     r12, r3
+    bcc     memcopy_check_optim_default
+    b       memcopy_check_optim_overlap
+
+memcopy_check_optim_default:
+    // Check if we can use an optimized path ((length >= 32) && destination word-aligned && source word-aligned) for the memcopy (optimized path if r0 == 1)
+    tst     r0, #0xF
+    it      ne
+    movne   r0, #0
+    bne     memcopy_default
+    tst     r1, #0xF
+    ite     ne
+    movne   r3, #0
+    moveq   r3, #1
+    cmp     r2, #31
+    ite     ls
+    movls   r0, #0
+    andhi   r0, r3, #1
+    b       memcopy_default
+
+memcopy_check_optim_overlap:
+    // r10 = dest_end, r14 = source_end
+    add     r10, r11, r12
+    add     r14, r12, r1
+
+    // Are we in the optimized case ((length >= 32) && dest_end word-aligned && source_end word-aligned)
+    cmp     r2, #31
+    ite     ls
+    movls   r0, #0
+    movhi   r0, #1
+    tst     r10, #0xF
+    it      ne
+    movne   r0, #0
+    tst     r14, #0xF
+    it      ne
+    movne   r0, #0
+    b       memcopy_overlapped
+
+memcopy_overlapped_non_optim:
+    // We read 1 byte from the end of the source buffer
+    sub     r3, r14, #1
+    sub     r12, r12, #1
+    ldrb    r3, [r3, #0]
+    sub     r2, r10, #1
+    cmp     r12, #0
+    // We write 1 byte at the end of the dest buffer
+    sub     r10, r10, #1
+    sub     r14, r14, #1
+    strb    r3, [r2, #0]
+    bne     memcopy_overlapped_non_optim
+    b       memcopy_end
+
+// r10 = dest_end, r14 = source_end
+memcopy_overlapped:
+    // Are we in the optimized case ?
+    cmp     r0, #0
+    beq     memcopy_overlapped_non_optim
+
+    // Optimized Overlapped - Read 32 bytes
+    sub     r14, r14, #32
+    sub     r12, r12, #32
+    cmp     r12, #31
+    ldmia   r14, {r2-r9}
+
+    // If length is less than 32 then disable optim
+    it      ls
+    movls   r0, #0
+
+    cmp     r12, #0
+
+    // Optimized Overlapped - Write 32 bytes
+    sub     r10, r10, #32
+    stmia   r10, {r2-r9}
+
+    // while (length != 0)
+    bne     memcopy_overlapped
+    b       memcopy_end
+
+memcopy_default_non_optim:
+    // Byte copy
+    ldrb    r3, [r14], #1
+    sub     r12, r12, #1
+    strb    r3, [r10], #1
+
+memcopy_default:
+    cmp     r12, #0
+    beq     memcopy_end
+
+// r10 = dest, r14 = source
+memcopy_default_loop:
+    cmp     r0, #0
+    beq     memcopy_default_non_optim
+
+    // Optimized memcopy - Read 32 Bytes
+    sub     r12, r12, #32
+    cmp     r12, #31
+    ldmia   r14!, {r2-r9}
+
+    // If length is less than 32 then disable optim
+    it      ls
+    movls   r0, #0
+
+    cmp     r12, #0
+
+    // Optimized memcopy - Write 32 Bytes
+    stmia   r10!, {r2-r9}
+
+    // while (length != 0)
+    bne     memcopy_default_loop
+
+memcopy_end:
+    mov     r0, r11
+    pop     {r4-r11, pc}
diff --git a/MdePkg/Library/BaseMemoryLibOptDxe/Arm/CopyMem.asm b/MdePkg/Library/BaseMemoryLibOptDxe/Arm/CopyMem.asm
new file mode 100644
index 000000000000..2034807954d7
--- /dev/null
+++ b/MdePkg/Library/BaseMemoryLibOptDxe/Arm/CopyMem.asm
@@ -0,0 +1,147 @@
+;------------------------------------------------------------------------------
+;
+; CopyMem() worker for ARM
+;
+; This file started out as C code that did 64 bit moves if the buffer was
+; 32-bit aligned, else it does a byte copy. It also does a byte copy for
+; any trailing bytes. It was updated to do 32-byte copies using stm/ldm.
+;
+; Copyright (c) 2008 - 2010, Apple Inc. All rights reserved.<BR>
+; Copyright (c) 2016, Linaro Ltd. All rights reserved.<BR>
+; This program and the accompanying materials
+; are licensed and made available under the terms and conditions of the BSD License
+; which accompanies this distribution.  The full text of the license may be found at
+; http://opensource.org/licenses/bsd-license.php
+;
+; THE PROGRAM IS DISTRIBUTED UNDER THE BSD LICENSE ON AN "AS IS" BASIS,
+; WITHOUT WARRANTIES OR REPRESENTATIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED.
+;
+;------------------------------------------------------------------------------
+
+    EXPORT  InternalMemCopyMem
+    AREA    SetMem, CODE, READONLY
+    THUMB
+
+InternalMemCopyMem
+  stmfd  sp!, {r4-r11, lr}
+  // Save the input parameters in extra registers (r11 = destination, r14 = source, r12 = length)
+  mov  r11, r0
+  mov  r10, r0
+  mov  r12, r2
+  mov  r14, r1
+
+memcopy_check_overlapped
+  cmp  r11, r1
+  // If (dest < source)
+  bcc  memcopy_check_optim_default
+
+  // If (source + length < dest)
+  rsb  r3, r1, r11
+  cmp  r12, r3
+  bcc  memcopy_check_optim_default
+  b     memcopy_check_optim_overlap
+
+memcopy_check_optim_default
+  // Check if we can use an optimized path ((length >= 32) && destination word-aligned && source word-aligned) for the memcopy (optimized path if r0 == 1)
+  tst  r0, #0xF
+  movne  r0, #0
+  bne   memcopy_default
+  tst  r1, #0xF
+  movne  r3, #0
+  moveq  r3, #1
+  cmp  r2, #31
+  movls  r0, #0
+  andhi  r0, r3, #1
+  b     memcopy_default
+
+memcopy_check_optim_overlap
+  // r10 = dest_end, r14 = source_end
+  add  r10, r11, r12
+  add  r14, r12, r1
+
+  // Are we in the optimized case ((length >= 32) && dest_end word-aligned && source_end word-aligned)
+  cmp  r2, #31
+  movls  r0, #0
+  movhi  r0, #1
+  tst  r10, #0xF
+  movne  r0, #0
+  tst  r14, #0xF
+  movne  r0, #0
+  b  memcopy_overlapped
+
+memcopy_overlapped_non_optim
+  // We read 1 byte from the end of the source buffer
+  sub  r3, r14, #1
+  sub  r12, r12, #1
+  ldrb  r3, [r3, #0]
+  sub  r2, r10, #1
+  cmp  r12, #0
+  // We write 1 byte at the end of the dest buffer
+  sub  r10, r10, #1
+  sub  r14, r14, #1
+  strb  r3, [r2, #0]
+  bne  memcopy_overlapped_non_optim
+  b   memcopy_end
+
+// r10 = dest_end, r14 = source_end
+memcopy_overlapped
+  // Are we in the optimized case ?
+  cmp  r0, #0
+  beq  memcopy_overlapped_non_optim
+
+  // Optimized Overlapped - Read 32 bytes
+  sub  r14, r14, #32
+  sub  r12, r12, #32
+  cmp  r12, #31
+  ldmia  r14, {r2-r9}
+
+  // If length is less than 32 then disable optim
+  movls  r0, #0
+
+  cmp  r12, #0
+
+  // Optimized Overlapped - Write 32 bytes
+  sub  r10, r10, #32
+  stmia  r10, {r2-r9}
+
+  // while (length != 0)
+  bne  memcopy_overlapped
+  b   memcopy_end
+
+memcopy_default_non_optim
+  // Byte copy
+  ldrb  r3, [r14], #1
+  sub  r12, r12, #1
+  strb  r3, [r10], #1
+
+memcopy_default
+  cmp  r12, #0
+  beq  memcopy_end
+
+// r10 = dest, r14 = source
+memcopy_default_loop
+  cmp  r0, #0
+  beq  memcopy_default_non_optim
+
+  // Optimized memcopy - Read 32 Bytes
+  sub  r12, r12, #32
+  cmp  r12, #31
+  ldmia  r14!, {r2-r9}
+
+  // If length is less than 32 then disable optim
+  movls  r0, #0
+
+  cmp  r12, #0
+
+  // Optimized memcopy - Write 32 Bytes
+  stmia  r10!, {r2-r9}
+
+  // while (length != 0)
+  bne  memcopy_default_loop
+
+memcopy_end
+  mov  r0, r11
+  ldmfd  sp!, {r4-r11, pc}
+
+  END
+
diff --git a/MdePkg/Library/BaseMemoryLibOptDxe/Arm/ScanMem.S b/MdePkg/Library/BaseMemoryLibOptDxe/Arm/ScanMem.S
new file mode 100644
index 000000000000..dc0e74e8657c
--- /dev/null
+++ b/MdePkg/Library/BaseMemoryLibOptDxe/Arm/ScanMem.S
@@ -0,0 +1,146 @@
+// Copyright (c) 2010-2011, Linaro Limited
+// All rights reserved.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+//
+//    * Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+//
+//    * Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+//
+//    * Neither the name of Linaro Limited nor the names of its
+//    contributors may be used to endorse or promote products derived
+//    from this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+// "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+// LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+// A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+// HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+// SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+// LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+// DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+// THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//
+
+//
+// Written by Dave Gilbert <david.gilbert@linaro.org>
+//
+// This memchr routine is optimised on a Cortex-A9 and should work on
+// all ARMv7 processors.   It has a fast past for short sizes, and has
+// an optimised path for large data sets; the worst case is finding the
+// match early in a large data set.
+//
+
+
+// 2011-02-07 david.gilbert@linaro.org
+//    Extracted from local git a5b438d861
+// 2011-07-14 david.gilbert@linaro.org
+//    Import endianness fix from local git ea786f1b
+// 2011-12-07 david.gilbert@linaro.org
+//    Removed unneeded cbz from align loop
+
+// this lets us check a flag in a 00/ff byte easily in either endianness
+#define CHARTSTMASK(c) 1<<(c*8)
+
+    .text
+    .thumb
+    .syntax unified
+
+    .type ASM_PFX(InternalMemScanMem8), %function
+ASM_GLOBAL ASM_PFX(InternalMemScanMem8)
+ASM_PFX(InternalMemScanMem8):
+    // r0 = start of memory to scan
+    // r1 = length
+    // r2 = character to look for
+    // returns r0 = pointer to character or NULL if not found
+    uxtb    r2, r2        // Don't think we can trust the caller to actually pass a char
+
+    cmp     r1, #16       // If it's short don't bother with anything clever
+    blt     20f
+
+    tst     r0, #7        // If it's already aligned skip the next bit
+    beq     10f
+
+    // Work up to an aligned point
+5:
+    ldrb    r3, [r0],#1
+    subs    r1, r1, #1
+    cmp     r3, r2
+    beq     50f           // If it matches exit found
+    tst     r0, #7
+    bne     5b            // If not aligned yet then do next byte
+
+10:
+    // At this point, we are aligned, we know we have at least 8 bytes to work with
+    push    {r4-r7}
+    orr     r2, r2, r2, lsl #8  // expand the match word across to all bytes
+    orr     r2, r2, r2, lsl #16
+    bic     r4, r1, #7    // Number of double words to work with
+    mvns    r7, #0        // all F's
+    movs    r3, #0
+
+15:
+    ldmia   r0!, {r5,r6}
+    subs    r4, r4, #8
+    eor     r5, r5, r2    // Get it so that r5,r6 have 00's where the bytes match the target
+    eor     r6, r6, r2
+    uadd8   r5, r5, r7    // Parallel add 0xff - sets the GE bits for anything that wasn't 0
+    sel     r5, r3, r7    // bytes are 00 for none-00 bytes, or ff for 00 bytes - NOTE INVERSION
+    uadd8   r6, r6, r7    // Parallel add 0xff - sets the GE bits for anything that wasn't 0
+    sel     r6, r5, r7    // chained....bytes are 00 for none-00 bytes, or ff for 00 bytes - NOTE INVERSION
+    cbnz    r6, 60f
+    bne     15b           // (Flags from the subs above) If not run out of bytes then go around again
+
+    pop     {r4-r7}
+    and     r2, r2, #0xff // Get r2 back to a single character from the expansion above
+    and     r1, r1, #7    // Leave the count remaining as the number after the double words have been done
+
+20:
+    cbz     r1, 40f       // 0 length or hit the end already then not found
+
+21: // Post aligned section, or just a short call
+    ldrb    r3, [r0], #1
+    subs    r1, r1, #1
+    eor     r3, r3, r2    // r3 = 0 if match - doesn't break flags from sub
+    cbz     r3, 50f
+    bne     21b           // on r1 flags
+
+40:
+    movs    r0, #0        // not found
+    bx      lr
+
+50:
+    subs    r0, r0, #1    // found
+    bx      lr
+
+60: // We're here because the fast path found a hit - now we have to track down exactly which word it was
+    // r0 points to the start of the double word after the one that was tested
+    // r5 has the 00/ff pattern for the first word, r6 has the chained value
+    cmp     r5, #0
+    itte    eq
+    moveq   r5, r6        // the end is in the 2nd word
+    subeq   r0, r0, #3    // Points to 2nd byte of 2nd word
+    subne   r0, r0, #7    // or 2nd byte of 1st word
+
+    // r0 currently points to the 3rd byte of the word containing the hit
+    tst     r5, #CHARTSTMASK(0)     // 1st character
+    bne     61f
+    adds    r0, r0, #1
+    tst     r5, #CHARTSTMASK(1)     // 2nd character
+    ittt    eq
+    addeq   r0, r0 ,#1
+    tsteq   r5, #(3 << 15)          // 2nd & 3rd character
+    // If not the 3rd must be the last one
+    addeq   r0, r0, #1
+
+61:
+    pop     {r4-r7}
+    subs    r0, r0, #1
+    bx      lr
diff --git a/MdePkg/Library/BaseMemoryLibOptDxe/Arm/ScanMem.asm b/MdePkg/Library/BaseMemoryLibOptDxe/Arm/ScanMem.asm
new file mode 100644
index 000000000000..abda87320e37
--- /dev/null
+++ b/MdePkg/Library/BaseMemoryLibOptDxe/Arm/ScanMem.asm
@@ -0,0 +1,147 @@
+; Copyright (c) 2010-2011, Linaro Limited
+; All rights reserved.
+;
+; Redistribution and use in source and binary forms, with or without
+; modification, are permitted provided that the following conditions
+; are met:
+;
+;    * Redistributions of source code must retain the above copyright
+;    notice, this list of conditions and the following disclaimer.
+;
+;    * Redistributions in binary form must reproduce the above copyright
+;    notice, this list of conditions and the following disclaimer in the
+;    documentation and/or other materials provided with the distribution.
+;
+;    * Neither the name of Linaro Limited nor the names of its
+;    contributors may be used to endorse or promote products derived
+;    from this software without specific prior written permission.
+;
+; THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+; "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+; LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+; A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+; HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+; SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+; LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+; DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+; THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+; (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+; OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+;
+
+;
+; Written by Dave Gilbert <david.gilbert@linaro.org>
+;
+; This memchr routine is optimised on a Cortex-A9 and should work on
+; all ARMv7 processors.   It has a fast past for short sizes, and has
+; an optimised path for large data sets; the worst case is finding the
+; match early in a large data set.
+;
+
+
+; 2011-02-07 david.gilbert@linaro.org
+;    Extracted from local git a5b438d861
+; 2011-07-14 david.gilbert@linaro.org
+;    Import endianness fix from local git ea786f1b
+; 2011-12-07 david.gilbert@linaro.org
+;    Removed unneeded cbz from align loop
+
+; this lets us check a flag in a 00/ff byte easily in either endianness
+#define CHARTSTMASK(c) 1<<(c*8)
+
+    EXPORT  InternalMemScanMem8
+    AREA    ScanMem, CODE, READONLY
+    THUMB
+
+InternalMemScanMem8
+    ; r0 = start of memory to scan
+    ; r1 = length
+    ; r2 = character to look for
+    ; returns r0 = pointer to character or NULL if not found
+    uxtb    r2, r2        ; Don't think we can trust the caller to actually pass a char
+
+    cmp     r1, #16       ; If it's short don't bother with anything clever
+    blt     L20
+
+    tst     r0, #7        ; If it's already aligned skip the next bit
+    beq     L10
+
+    ; Work up to an aligned point
+L5
+    ldrb    r3, [r0],#1
+    subs    r1, r1, #1
+    cmp     r3, r2
+    beq     L50           ; If it matches exit found
+    tst     r0, #7
+    bne     L5            ; If not aligned yet then do next byte
+
+L10
+    ; At this point, we are aligned, we know we have at least 8 bytes to work with
+    push    {r4-r7}
+    orr     r2, r2, r2, lsl #8  ; expand the match word across to all bytes
+    orr     r2, r2, r2, lsl #16
+    bic     r4, r1, #7    ; Number of double words to work with
+    mvns    r7, #0        ; all F's
+    movs    r3, #0
+
+L15
+    ldmia   r0!, {r5,r6}
+    subs    r4, r4, #8
+    eor     r5, r5, r2    ; Get it so that r5,r6 have 00's where the bytes match the target
+    eor     r6, r6, r2
+    uadd8   r5, r5, r7    ; Parallel add 0xff - sets the GE bits for anything that wasn't 0
+    sel     r5, r3, r7    ; bytes are 00 for none-00 bytes, or ff for 00 bytes - NOTE INVERSION
+    uadd8   r6, r6, r7    ; Parallel add 0xff - sets the GE bits for anything that wasn't 0
+    sel     r6, r5, r7    ; chained....bytes are 00 for none-00 bytes, or ff for 00 bytes - NOTE INVERSION
+    cbnz    r6, L60
+    bne     L15           ; (Flags from the subs above) If not run out of bytes then go around again
+
+    pop     {r4-r7}
+    and     r2, r2, #0xff ; Get r2 back to a single character from the expansion above
+    and     r1, r1, #7    ; Leave the count remaining as the number after the double words have been done
+
+L20
+    cbz     r1, L40       ; 0 length or hit the end already then not found
+
+L21 ; Post aligned section, or just a short call
+    ldrb    r3, [r0], #1
+    subs    r1, r1, #1
+    eor     r3, r3, r2    ; r3 = 0 if match - doesn't break flags from sub
+    cbz     r3, L50
+    bne     L21           ; on r1 flags
+
+L40
+    movs    r0, #0        ; not found
+    bx      lr
+
+L50
+    subs    r0, r0, #1    ; found
+    bx      lr
+
+L60 ; We're here because the fast path found a hit - now we have to track down exactly which word it was
+    ; r0 points to the start of the double word after the one that was tested
+    ; r5 has the 00/ff pattern for the first word, r6 has the chained value
+    cmp     r5, #0
+    itte    eq
+    moveq   r5, r6        ; the end is in the 2nd word
+    subeq   r0, r0, #3    ; Points to 2nd byte of 2nd word
+    subne   r0, r0, #7    ; or 2nd byte of 1st word
+
+    ; r0 currently points to the 3rd byte of the word containing the hit
+    tst     r5, #CHARTSTMASK(0)     ; 1st character
+    bne     L61
+    adds    r0, r0, #1
+    tst     r5, #CHARTSTMASK(1)     ; 2nd character
+    ittt    eq
+    addeq   r0, r0 ,#1
+    tsteq   r5, #(3 << 15)          ; 2nd & 3rd character
+    ; If not the 3rd must be the last one
+    addeq   r0, r0, #1
+
+L61
+    pop     {r4-r7}
+    subs    r0, r0, #1
+    bx      lr
+
+    END
+
diff --git a/MdePkg/Library/BaseMemoryLibOptDxe/Arm/ScanMemGeneric.c b/MdePkg/Library/BaseMemoryLibOptDxe/Arm/ScanMemGeneric.c
new file mode 100644
index 000000000000..20fa7e9be697
--- /dev/null
+++ b/MdePkg/Library/BaseMemoryLibOptDxe/Arm/ScanMemGeneric.c
@@ -0,0 +1,142 @@
+/** @file
+  Architecture Independent Base Memory Library Implementation.
+
+  The following BaseMemoryLib instances contain the same copy of this file:
+    BaseMemoryLib
+    PeiMemoryLib
+    UefiMemoryLib
+
+  Copyright (c) 2006 - 2016, Intel Corporation. All rights reserved.<BR>
+  This program and the accompanying materials
+  are licensed and made available under the terms and conditions of the BSD License
+  which accompanies this distribution.  The full text of the license may be found at
+  http://opensource.org/licenses/bsd-license.php.
+
+  THE PROGRAM IS DISTRIBUTED UNDER THE BSD LICENSE ON AN "AS IS" BASIS,
+  WITHOUT WARRANTIES OR REPRESENTATIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED.
+
+**/
+
+#include "../MemLibInternals.h"
+
+/**
+  Scans a target buffer for a 16-bit value, and returns a pointer to the
+  matching 16-bit value in the target buffer.
+
+  @param  Buffer  The pointer to the target buffer to scan.
+  @param  Length  The count of 16-bit value to scan. Must be non-zero.
+  @param  Value   The value to search for in the target buffer.
+
+  @return The pointer to the first occurrence, or NULL if not found.
+
+**/
+CONST VOID *
+EFIAPI
+InternalMemScanMem16 (
+  IN      CONST VOID                *Buffer,
+  IN      UINTN                     Length,
+  IN      UINT16                    Value
+  )
+{
+  CONST UINT16                      *Pointer;
+
+  Pointer = (CONST UINT16*)Buffer;
+  do {
+    if (*Pointer == Value) {
+      return Pointer;
+    }
+    ++Pointer;
+  } while (--Length != 0);
+  return NULL;
+}
+
+/**
+  Scans a target buffer for a 32-bit value, and returns a pointer to the
+  matching 32-bit value in the target buffer.
+
+  @param  Buffer  The pointer to the target buffer to scan.
+  @param  Length  The count of 32-bit value to scan. Must be non-zero.
+  @param  Value   The value to search for in the target buffer.
+
+  @return The pointer to the first occurrence, or NULL if not found.
+
+**/
+CONST VOID *
+EFIAPI
+InternalMemScanMem32 (
+  IN      CONST VOID                *Buffer,
+  IN      UINTN                     Length,
+  IN      UINT32                    Value
+  )
+{
+  CONST UINT32                      *Pointer;
+
+  Pointer = (CONST UINT32*)Buffer;
+  do {
+    if (*Pointer == Value) {
+      return Pointer;
+    }
+    ++Pointer;
+  } while (--Length != 0);
+  return NULL;
+}
+
+/**
+  Scans a target buffer for a 64-bit value, and returns a pointer to the
+  matching 64-bit value in the target buffer.
+
+  @param  Buffer  The pointer to the target buffer to scan.
+  @param  Length  The count of 64-bit value to scan. Must be non-zero.
+  @param  Value   The value to search for in the target buffer.
+
+  @return The pointer to the first occurrence, or NULL if not found.
+
+**/
+CONST VOID *
+EFIAPI
+InternalMemScanMem64 (
+  IN      CONST VOID                *Buffer,
+  IN      UINTN                     Length,
+  IN      UINT64                    Value
+  )
+{
+  CONST UINT64                      *Pointer;
+
+  Pointer = (CONST UINT64*)Buffer;
+  do {
+    if (*Pointer == Value) {
+      return Pointer;
+    }
+    ++Pointer;
+  } while (--Length != 0);
+  return NULL;
+}
+
+/**
+  Checks whether the contents of a buffer are all zeros.
+
+  @param  Buffer  The pointer to the buffer to be checked.
+  @param  Length  The size of the buffer (in bytes) to be checked.
+
+  @retval TRUE    Contents of the buffer are all zeros.
+  @retval FALSE   Contents of the buffer are not all zeros.
+
+**/
+BOOLEAN
+EFIAPI
+InternalMemIsZeroBuffer (
+  IN CONST VOID  *Buffer,
+  IN UINTN       Length
+  )
+{
+  CONST UINT8 *BufferData;
+  UINTN       Index;
+
+  BufferData = Buffer;
+  for (Index = 0; Index < Length; Index++) {
+    if (BufferData[Index] != 0) {
+      return FALSE;
+    }
+  }
+  return TRUE;
+}
diff --git a/MdePkg/Library/BaseMemoryLibOptDxe/Arm/SetMem.S b/MdePkg/Library/BaseMemoryLibOptDxe/Arm/SetMem.S
new file mode 100644
index 000000000000..c1755539d36a
--- /dev/null
+++ b/MdePkg/Library/BaseMemoryLibOptDxe/Arm/SetMem.S
@@ -0,0 +1,77 @@
+#------------------------------------------------------------------------------
+#
+# Copyright (c) 2016, Linaro Ltd. All rights reserved.<BR>
+#
+# This program and the accompanying materials are licensed and made available
+# under the terms and conditions of the BSD License which accompanies this
+# distribution.  The full text of the license may be found at
+# http://opensource.org/licenses/bsd-license.php
+#
+# THE PROGRAM IS DISTRIBUTED UNDER THE BSD LICENSE ON AN "AS IS" BASIS,
+# WITHOUT WARRANTIES OR REPRESENTATIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED.
+#
+#------------------------------------------------------------------------------
+
+    .text
+    .thumb
+    .syntax unified
+    .align  5
+ASM_GLOBAL ASM_PFX(InternalMemZeroMem)
+ASM_PFX(InternalMemZeroMem):
+    movs    r2, #0
+
+ASM_GLOBAL ASM_PFX(InternalMemSetMem)
+ASM_PFX(InternalMemSetMem):
+    uxtb    r2, r2
+    orr     r2, r2, r2, lsl #8
+
+ASM_GLOBAL ASM_PFX(InternalMemSetMem16)
+ASM_PFX(InternalMemSetMem16):
+    uxth    r2, r2
+    orr     r2, r2, r2, lsl #16
+
+ASM_GLOBAL ASM_PFX(InternalMemSetMem32)
+ASM_PFX(InternalMemSetMem32):
+    mov     r3, r2
+
+ASM_GLOBAL ASM_PFX(InternalMemSetMem64)
+ASM_PFX(InternalMemSetMem64):
+    push    {r4, lr}
+    cmp     r1, #16                 // fewer than 16 bytes of input?
+    add     r1, r1, r0              // r1 := dst + length
+    add     lr, r0, #16
+    blt     2f
+    bic     lr, lr, #15             // align output pointer
+
+    str     r2, [r0]                // potentially unaligned store of 4 bytes
+    str     r3, [r0, #4]            // potentially unaligned store of 4 bytes
+    str     r2, [r0, #8]            // potentially unaligned store of 4 bytes
+    str     r3, [r0, #12]           // potentially unaligned store of 4 bytes
+    beq     1f
+
+0:  add     lr, lr, #16             // advance the output pointer by 16 bytes
+    subs    r4, r1, lr              // past the output?
+    blt     3f                      // break out of the loop
+    strd    r2, r3, [lr, #-16]      // aligned store of 16 bytes
+    strd    r2, r3, [lr, #-8]
+    bne     0b                      // goto beginning of loop
+1:  pop     {r4, pc}
+
+2:  subs    r4, r1, lr
+3:  adds    r4, r4, #16
+    subs    r1, r1, #8
+    cmp     r4, #4                  // between 4 and 15 bytes?
+    blt     4f
+    cmp     r4, #8                  // between 8 and 15 bytes?
+    str     r2, [lr, #-16]          // overlapping store of 4 + (4 + 4) + 4 bytes
+    itt     gt
+    strgt   r3, [lr, #-12]
+    strgt   r2, [r1]
+    str     r3, [r1, #4]
+    pop     {r4, pc}
+
+4:  cmp     r4, #2                  // 2 or 3 bytes?
+    strb    r2, [lr, #-16]          // store 1 byte
+    it      ge
+    strhge  r2, [r1, #6]            // store 2 bytes
+    pop     {r4, pc}
diff --git a/MdePkg/Library/BaseMemoryLibOptDxe/Arm/SetMem.asm b/MdePkg/Library/BaseMemoryLibOptDxe/Arm/SetMem.asm
new file mode 100644
index 000000000000..2a8dc7d019f4
--- /dev/null
+++ b/MdePkg/Library/BaseMemoryLibOptDxe/Arm/SetMem.asm
@@ -0,0 +1,84 @@
+;------------------------------------------------------------------------------
+;
+; Copyright (c) 2016, Linaro Ltd. All rights reserved.<BR>
+;
+; This program and the accompanying materials are licensed and made available
+; under the terms and conditions of the BSD License which accompanies this
+; distribution.  The full text of the license may be found at
+; http://opensource.org/licenses/bsd-license.php
+;
+; THE PROGRAM IS DISTRIBUTED UNDER THE BSD LICENSE ON AN "AS IS" BASIS,
+; WITHOUT WARRANTIES OR REPRESENTATIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED.
+;
+;------------------------------------------------------------------------------
+
+    EXPORT  InternalMemZeroMem
+    EXPORT  InternalMemSetMem
+    EXPORT  InternalMemSetMem16
+    EXPORT  InternalMemSetMem32
+    EXPORT  InternalMemSetMem64
+
+    AREA    SetMem, CODE, READONLY, CODEALIGN, ALIGN=5
+    THUMB
+
+InternalMemZeroMem
+    movs    r2, #0
+
+InternalMemSetMem
+    uxtb    r2, r2
+    orr     r2, r2, r2, lsl #8
+
+InternalMemSetMem16
+    uxth    r2, r2
+    orr     r2, r2, r2, lsr #16
+
+InternalMemSetMem32
+    mov     r3, r2
+
+InternalMemSetMem64
+    push    {r4, lr}
+    cmp     r1, #16                 ; fewer than 16 bytes of input?
+    add     r1, r1, r0              ; r1 := dst + length
+    add     lr, r0, #16
+    blt     L2
+    bic     lr, lr, #15             ; align output pointer
+
+    str     r2, [r0]                ; potentially unaligned store of 4 bytes
+    str     r3, [r0, #4]            ; potentially unaligned store of 4 bytes
+    str     r2, [r0, #8]            ; potentially unaligned store of 4 bytes
+    str     r3, [r0, #12]           ; potentially unaligned store of 4 bytes
+    beq     L1
+
+L0
+    add     lr, lr, #16             ; advance the output pointer by 16 bytes
+    subs    r4, r1, lr              ; past the output?
+    blt     L3                      ; break out of the loop
+    strd    r2, r3, [lr, #-16]      ; aligned store of 16 bytes
+    strd    r2, r3, [lr, #-8]
+    bne     L0                      ; goto beginning of loop
+L1
+    pop     {r4, pc}
+
+L2
+    subs    r4, r1, lr
+L3
+    adds    r4, r4, #16
+    subs    r1, r1, #8
+    cmp     r4, #4                  ; between 4 and 15 bytes?
+    blt     L4
+    cmp     r4, #8                  ; between 8 and 15 bytes?
+    str     r2, [lr, #-16]          ; overlapping store of 4 + (4 + 4) + 4 bytes
+    itt     gt
+    strgt   r3, [lr, #-12]
+    strgt   r2, [r1]
+    str     r3, [r1, #4]
+    pop     {r4, pc}
+
+L4
+    cmp     r4, #2                  ; 2 or 3 bytes?
+    strb    r2, [lr, #-16]          ; store 1 byte
+    it      ge
+    strhge  r2, [r1, #6]            ; store 2 bytes
+    pop     {r4, pc}
+
+    END
diff --git a/MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf b/MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf
index 71691b9859e3..d95eb599ea9e 100644
--- a/MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf
+++ b/MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf
@@ -27,7 +27,7 @@ [Defines]
 
 
 #
-#  VALID_ARCHITECTURES           = IA32 X64
+#  VALID_ARCHITECTURES           = IA32 X64 ARM
 #
 
 [Sources]
@@ -79,19 +79,6 @@ [Sources.Ia32]
   Ia32/CopyMem.nasm
   Ia32/CopyMem.asm
   Ia32/IsZeroBuffer.nasm
-  ScanMem64Wrapper.c
-  ScanMem32Wrapper.c
-  ScanMem16Wrapper.c
-  ScanMem8Wrapper.c
-  ZeroMemWrapper.c
-  CompareMemWrapper.c
-  SetMem64Wrapper.c
-  SetMem32Wrapper.c
-  SetMem16Wrapper.c
-  SetMemWrapper.c
-  CopyMemWrapper.c
-  IsZeroBufferWrapper.c
-  MemLibGuid.c
 
 [Sources.X64]
   X64/ScanMem64.nasm
@@ -128,6 +115,21 @@ [Sources.X64]
   X64/CopyMem.asm
   X64/CopyMem.S
   X64/IsZeroBuffer.nasm
+
+[Sources.ARM]
+  Arm/ScanMem.S       |GCC
+  Arm/SetMem.S        |GCC
+  Arm/CopyMem.S       |GCC
+  Arm/CompareMem.S    |GCC
+
+  Arm/ScanMem.asm     |RVCT
+  Arm/SetMem.asm      |RVCT
+  Arm/CopyMem.asm     |RVCT
+  Arm/CompareMem.asm  |RVCT
+
+  Arm/ScanMemGeneric.c
+
+[Sources]
   ScanMem64Wrapper.c
   ScanMem32Wrapper.c
   ScanMem16Wrapper.c
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 3/4] MdePkg/BaseMemoryLibOptDxe: add accelerated AARCH64 routines
  2016-09-09 14:00 [PATCH v5 0/4] MdePkg: add ARM/AARCH64 support to BaseMemoryLib Ard Biesheuvel
  2016-09-09 14:00 ` [PATCH v5 1/4] MdePkg/BaseMemoryLib: widen aligned accesses to 32 or 64 bits Ard Biesheuvel
  2016-09-09 14:00 ` [PATCH v5 2/4] MdePkg/BaseMemoryLibOptDxe: add accelerated ARM routines Ard Biesheuvel
@ 2016-09-09 14:00 ` Ard Biesheuvel
  2016-09-09 14:00 ` [PATCH v5 4/4] MdePkg/BaseMemoryLibOptDxe ARM|AARCH64: disallow use in SEC & PEI phases Ard Biesheuvel
  3 siblings, 0 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2016-09-09 14:00 UTC (permalink / raw)
  To: edk2-devel, liming.gao, leif.lindholm, michael.d.kinney; +Cc: Ard Biesheuvel

This adds AARCH64 support to BaseMemoryLibOptDxe, based on the cortex-strings
library. All string routines are accelerated except ScanMem16, ScanMem32,
ScanMem64 and IsZeroBuffer, which can wait for another day. (Very few
occurrences exist in the codebase)

Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Liming Gao <liming.gao@intel.com>
---
 MdePkg/Library/BaseMemoryLibOptDxe/AArch64/CompareMem.S    | 142 ++++++++++
 MdePkg/Library/BaseMemoryLibOptDxe/AArch64/CopyMem.S       | 284 ++++++++++++++++++++
 MdePkg/Library/BaseMemoryLibOptDxe/AArch64/ScanMem.S       | 161 +++++++++++
 MdePkg/Library/BaseMemoryLibOptDxe/AArch64/SetMem.S        | 244 +++++++++++++++++
 MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf |   9 +-
 5 files changed, 839 insertions(+), 1 deletion(-)

diff --git a/MdePkg/Library/BaseMemoryLibOptDxe/AArch64/CompareMem.S b/MdePkg/Library/BaseMemoryLibOptDxe/AArch64/CompareMem.S
new file mode 100644
index 000000000000..a54de6948be1
--- /dev/null
+++ b/MdePkg/Library/BaseMemoryLibOptDxe/AArch64/CompareMem.S
@@ -0,0 +1,142 @@
+//
+// Copyright (c) 2013, Linaro Limited
+// All rights reserved.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions are met:
+//     * Redistributions of source code must retain the above copyright
+//       notice, this list of conditions and the following disclaimer.
+//     * Redistributions in binary form must reproduce the above copyright
+//       notice, this list of conditions and the following disclaimer in the
+//       documentation and/or other materials provided with the distribution.
+//     * Neither the name of the Linaro nor the
+//       names of its contributors may be used to endorse or promote products
+//       derived from this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+// "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+// LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+// A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+// HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+// SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+// LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+// DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+// THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//
+
+// Assumptions:
+//
+// ARMv8-a, AArch64
+//
+
+
+// Parameters and result.
+#define src1      x0
+#define src2      x1
+#define limit     x2
+#define result    x0
+
+// Internal variables.
+#define data1     x3
+#define data1w    w3
+#define data2     x4
+#define data2w    w4
+#define diff      x6
+#define endloop   x7
+#define tmp1      x8
+#define tmp2      x9
+#define pos       x11
+#define limit_wd  x12
+#define mask      x13
+
+    .p2align 6
+ASM_GLOBAL ASM_PFX(InternalMemCompareMem)
+ASM_PFX(InternalMemCompareMem):
+    eor     tmp1, src1, src2
+    tst     tmp1, #7
+    b.ne    .Lmisaligned8
+    ands    tmp1, src1, #7
+    b.ne    .Lmutual_align
+    add     limit_wd, limit, #7
+    lsr     limit_wd, limit_wd, #3
+
+    // Start of performance-critical section  -- one 64B cache line.
+.Lloop_aligned:
+    ldr     data1, [src1], #8
+    ldr     data2, [src2], #8
+.Lstart_realigned:
+    subs    limit_wd, limit_wd, #1
+    eor     diff, data1, data2        // Non-zero if differences found.
+    csinv   endloop, diff, xzr, ne    // Last Dword or differences.
+    cbz     endloop, .Lloop_aligned
+    // End of performance-critical section  -- one 64B cache line.
+
+    // Not reached the limit, must have found a diff.
+    cbnz    limit_wd, .Lnot_limit
+
+    // Limit % 8 == 0 => all bytes significant.
+    ands    limit, limit, #7
+    b.eq    .Lnot_limit
+
+    lsl     limit, limit, #3              // Bits -> bytes.
+    mov     mask, #~0
+    lsl     mask, mask, limit
+    bic     data1, data1, mask
+    bic     data2, data2, mask
+
+    orr     diff, diff, mask
+
+.Lnot_limit:
+    rev     diff, diff
+    rev     data1, data1
+    rev     data2, data2
+
+    // The MS-non-zero bit of DIFF marks either the first bit
+    // that is different, or the end of the significant data.
+    // Shifting left now will bring the critical information into the
+    // top bits.
+    clz     pos, diff
+    lsl     data1, data1, pos
+    lsl     data2, data2, pos
+
+    // But we need to zero-extend (char is unsigned) the value and then
+    // perform a signed 32-bit subtraction.
+    lsr     data1, data1, #56
+    sub     result, data1, data2, lsr #56
+    ret
+
+.Lmutual_align:
+    // Sources are mutually aligned, but are not currently at an
+    // alignment boundary.  Round down the addresses and then mask off
+    // the bytes that precede the start point.
+    bic     src1, src1, #7
+    bic     src2, src2, #7
+    add     limit, limit, tmp1          // Adjust the limit for the extra.
+    lsl     tmp1, tmp1, #3              // Bytes beyond alignment -> bits.
+    ldr     data1, [src1], #8
+    neg     tmp1, tmp1                  // Bits to alignment -64.
+    ldr     data2, [src2], #8
+    mov     tmp2, #~0
+
+    // Little-endian.  Early bytes are at LSB.
+    lsr     tmp2, tmp2, tmp1            // Shift (tmp1 & 63).
+    add     limit_wd, limit, #7
+    orr     data1, data1, tmp2
+    orr     data2, data2, tmp2
+    lsr     limit_wd, limit_wd, #3
+    b       .Lstart_realigned
+
+    .p2align 6
+.Lmisaligned8:
+    sub     limit, limit, #1
+1:
+    // Perhaps we can do better than this.
+    ldrb    data1w, [src1], #1
+    ldrb    data2w, [src2], #1
+    subs    limit, limit, #1
+    ccmp    data1w, data2w, #0, cs      // NZCV = 0b0000.
+    b.eq    1b
+    sub     result, data1, data2
+    ret
diff --git a/MdePkg/Library/BaseMemoryLibOptDxe/AArch64/CopyMem.S b/MdePkg/Library/BaseMemoryLibOptDxe/AArch64/CopyMem.S
new file mode 100644
index 000000000000..10b55b065c47
--- /dev/null
+++ b/MdePkg/Library/BaseMemoryLibOptDxe/AArch64/CopyMem.S
@@ -0,0 +1,284 @@
+//
+// Copyright (c) 2012 - 2016, Linaro Limited
+// All rights reserved.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions are met:
+//     * Redistributions of source code must retain the above copyright
+//       notice, this list of conditions and the following disclaimer.
+//     * Redistributions in binary form must reproduce the above copyright
+//       notice, this list of conditions and the following disclaimer in the
+//       documentation and/or other materials provided with the distribution.
+//     * Neither the name of the Linaro nor the
+//       names of its contributors may be used to endorse or promote products
+//       derived from this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+// "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+// LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+// A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+// HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+// SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+// LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+// DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+// THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//
+
+//
+// Copyright (c) 2015 ARM Ltd
+// All rights reserved.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+// 1. Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+// 2. Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+// 3. The name of the company may not be used to endorse or promote
+//    products derived from this software without specific prior written
+//    permission.
+//
+// THIS SOFTWARE IS PROVIDED BY ARM LTD ``AS IS'' AND ANY EXPRESS OR IMPLIED
+// WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+// MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+// IN NO EVENT SHALL ARM LTD BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+// SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
+// TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//
+
+// Assumptions:
+//
+// ARMv8-a, AArch64, unaligned accesses.
+//
+//
+
+#define dstin     x0
+#define src       x1
+#define count     x2
+#define dst       x3
+#define srcend    x4
+#define dstend    x5
+#define A_l       x6
+#define A_lw      w6
+#define A_h       x7
+#define A_hw      w7
+#define B_l       x8
+#define B_lw      w8
+#define B_h       x9
+#define C_l       x10
+#define C_h       x11
+#define D_l       x12
+#define D_h       x13
+#define E_l       x14
+#define E_h       x15
+#define F_l       srcend
+#define F_h       dst
+#define tmp1      x9
+#define tmp2      x3
+
+#define L(l) .L ## l
+
+// Copies are split into 3 main cases: small copies of up to 16 bytes,
+// medium copies of 17..96 bytes which are fully unrolled. Large copies
+// of more than 96 bytes align the destination and use an unrolled loop
+// processing 64 bytes per iteration.
+// Small and medium copies read all data before writing, allowing any
+// kind of overlap, and memmove tailcalls memcpy for these cases as
+// well as non-overlapping copies.
+
+__memcpy:
+    prfm    PLDL1KEEP, [src]
+    add     srcend, src, count
+    add     dstend, dstin, count
+    cmp     count, 16
+    b.ls    L(copy16)
+    cmp     count, 96
+    b.hi    L(copy_long)
+
+    // Medium copies: 17..96 bytes.
+    sub     tmp1, count, 1
+    ldp     A_l, A_h, [src]
+    tbnz    tmp1, 6, L(copy96)
+    ldp     D_l, D_h, [srcend, -16]
+    tbz     tmp1, 5, 1f
+    ldp     B_l, B_h, [src, 16]
+    ldp     C_l, C_h, [srcend, -32]
+    stp     B_l, B_h, [dstin, 16]
+    stp     C_l, C_h, [dstend, -32]
+1:
+    stp     A_l, A_h, [dstin]
+    stp     D_l, D_h, [dstend, -16]
+    ret
+
+    .p2align 4
+    // Small copies: 0..16 bytes.
+L(copy16):
+    cmp     count, 8
+    b.lo    1f
+    ldr     A_l, [src]
+    ldr     A_h, [srcend, -8]
+    str     A_l, [dstin]
+    str     A_h, [dstend, -8]
+    ret
+    .p2align 4
+1:
+    tbz     count, 2, 1f
+    ldr     A_lw, [src]
+    ldr     A_hw, [srcend, -4]
+    str     A_lw, [dstin]
+    str     A_hw, [dstend, -4]
+    ret
+
+    // Copy 0..3 bytes.  Use a branchless sequence that copies the same
+    // byte 3 times if count==1, or the 2nd byte twice if count==2.
+1:
+    cbz     count, 2f
+    lsr     tmp1, count, 1
+    ldrb    A_lw, [src]
+    ldrb    A_hw, [srcend, -1]
+    ldrb    B_lw, [src, tmp1]
+    strb    A_lw, [dstin]
+    strb    B_lw, [dstin, tmp1]
+    strb    A_hw, [dstend, -1]
+2:  ret
+
+    .p2align 4
+    // Copy 64..96 bytes.  Copy 64 bytes from the start and
+    // 32 bytes from the end.
+L(copy96):
+    ldp     B_l, B_h, [src, 16]
+    ldp     C_l, C_h, [src, 32]
+    ldp     D_l, D_h, [src, 48]
+    ldp     E_l, E_h, [srcend, -32]
+    ldp     F_l, F_h, [srcend, -16]
+    stp     A_l, A_h, [dstin]
+    stp     B_l, B_h, [dstin, 16]
+    stp     C_l, C_h, [dstin, 32]
+    stp     D_l, D_h, [dstin, 48]
+    stp     E_l, E_h, [dstend, -32]
+    stp     F_l, F_h, [dstend, -16]
+    ret
+
+    // Align DST to 16 byte alignment so that we don't cross cache line
+    // boundaries on both loads and stores. There are at least 96 bytes
+    // to copy, so copy 16 bytes unaligned and then align.	The loop
+    // copies 64 bytes per iteration and prefetches one iteration ahead.
+
+    .p2align 4
+L(copy_long):
+    and     tmp1, dstin, 15
+    bic     dst, dstin, 15
+    ldp     D_l, D_h, [src]
+    sub     src, src, tmp1
+    add     count, count, tmp1      // Count is now 16 too large.
+    ldp     A_l, A_h, [src, 16]
+    stp     D_l, D_h, [dstin]
+    ldp     B_l, B_h, [src, 32]
+    ldp     C_l, C_h, [src, 48]
+    ldp     D_l, D_h, [src, 64]!
+    subs    count, count, 128 + 16  // Test and readjust count.
+    b.ls    2f
+1:
+    stp     A_l, A_h, [dst, 16]
+    ldp     A_l, A_h, [src, 16]
+    stp     B_l, B_h, [dst, 32]
+    ldp     B_l, B_h, [src, 32]
+    stp     C_l, C_h, [dst, 48]
+    ldp     C_l, C_h, [src, 48]
+    stp     D_l, D_h, [dst, 64]!
+    ldp     D_l, D_h, [src, 64]!
+    subs    count, count, 64
+    b.hi    1b
+
+    // Write the last full set of 64 bytes.	 The remainder is at most 64
+    // bytes, so it is safe to always copy 64 bytes from the end even if
+    // there is just 1 byte left.
+2:
+    ldp     E_l, E_h, [srcend, -64]
+    stp     A_l, A_h, [dst, 16]
+    ldp     A_l, A_h, [srcend, -48]
+    stp     B_l, B_h, [dst, 32]
+    ldp     B_l, B_h, [srcend, -32]
+    stp     C_l, C_h, [dst, 48]
+    ldp     C_l, C_h, [srcend, -16]
+    stp     D_l, D_h, [dst, 64]
+    stp     E_l, E_h, [dstend, -64]
+    stp     A_l, A_h, [dstend, -48]
+    stp     B_l, B_h, [dstend, -32]
+    stp     C_l, C_h, [dstend, -16]
+    ret
+
+
+//
+// All memmoves up to 96 bytes are done by memcpy as it supports overlaps.
+// Larger backwards copies are also handled by memcpy. The only remaining
+// case is forward large copies.  The destination is aligned, and an
+// unrolled loop processes 64 bytes per iteration.
+//
+
+ASM_GLOBAL ASM_PFX(InternalMemCopyMem)
+ASM_PFX(InternalMemCopyMem):
+    sub     tmp2, dstin, src
+    cmp     count, 96
+    ccmp    tmp2, count, 2, hi
+    b.hs    __memcpy
+
+    cbz     tmp2, 3f
+    add     dstend, dstin, count
+    add     srcend, src, count
+
+    // Align dstend to 16 byte alignment so that we don't cross cache line
+    // boundaries on both loads and stores. There are at least 96 bytes
+    // to copy, so copy 16 bytes unaligned and then align. The loop
+    // copies 64 bytes per iteration and prefetches one iteration ahead.
+
+    and     tmp2, dstend, 15
+    ldp     D_l, D_h, [srcend, -16]
+    sub     srcend, srcend, tmp2
+    sub     count, count, tmp2
+    ldp     A_l, A_h, [srcend, -16]
+    stp     D_l, D_h, [dstend, -16]
+    ldp     B_l, B_h, [srcend, -32]
+    ldp     C_l, C_h, [srcend, -48]
+    ldp     D_l, D_h, [srcend, -64]!
+    sub     dstend, dstend, tmp2
+    subs    count, count, 128
+    b.ls    2f
+    nop
+1:
+    stp     A_l, A_h, [dstend, -16]
+    ldp     A_l, A_h, [srcend, -16]
+    stp     B_l, B_h, [dstend, -32]
+    ldp     B_l, B_h, [srcend, -32]
+    stp     C_l, C_h, [dstend, -48]
+    ldp     C_l, C_h, [srcend, -48]
+    stp     D_l, D_h, [dstend, -64]!
+    ldp     D_l, D_h, [srcend, -64]!
+    subs    count, count, 64
+    b.hi    1b
+
+    // Write the last full set of 64 bytes. The remainder is at most 64
+    // bytes, so it is safe to always copy 64 bytes from the start even if
+    // there is just 1 byte left.
+2:
+    ldp     E_l, E_h, [src, 48]
+    stp     A_l, A_h, [dstend, -16]
+    ldp     A_l, A_h, [src, 32]
+    stp     B_l, B_h, [dstend, -32]
+    ldp     B_l, B_h, [src, 16]
+    stp     C_l, C_h, [dstend, -48]
+    ldp     C_l, C_h, [src]
+    stp     D_l, D_h, [dstend, -64]
+    stp     E_l, E_h, [dstin, 48]
+    stp     A_l, A_h, [dstin, 32]
+    stp     B_l, B_h, [dstin, 16]
+    stp     C_l, C_h, [dstin]
+3:  ret
diff --git a/MdePkg/Library/BaseMemoryLibOptDxe/AArch64/ScanMem.S b/MdePkg/Library/BaseMemoryLibOptDxe/AArch64/ScanMem.S
new file mode 100644
index 000000000000..08e1fbb17082
--- /dev/null
+++ b/MdePkg/Library/BaseMemoryLibOptDxe/AArch64/ScanMem.S
@@ -0,0 +1,161 @@
+//
+// Copyright (c) 2014, ARM Limited
+// All rights Reserved.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions are met:
+//     * Redistributions of source code must retain the above copyright
+//       notice, this list of conditions and the following disclaimer.
+//     * Redistributions in binary form must reproduce the above copyright
+//       notice, this list of conditions and the following disclaimer in the
+//       documentation and/or other materials provided with the distribution.
+//     * Neither the name of the company nor the names of its contributors
+//       may be used to endorse or promote products derived from this
+//       software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+// "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+// LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+// A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+// HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+// SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+// LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+// DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+// THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//
+
+// Assumptions:
+//
+// ARMv8-a, AArch64
+// Neon Available.
+//
+
+// Arguments and results.
+#define srcin     x0
+#define cntin     x1
+#define chrin     w2
+
+#define result    x0
+
+#define src       x3
+#define	tmp       x4
+#define wtmp2     w5
+#define synd      x6
+#define soff      x9
+#define cntrem    x10
+
+#define vrepchr   v0
+#define vdata1    v1
+#define vdata2    v2
+#define vhas_chr1 v3
+#define vhas_chr2 v4
+#define vrepmask  v5
+#define vend      v6
+
+//
+// Core algorithm:
+//
+// For each 32-byte chunk we calculate a 64-bit syndrome value, with two bits
+// per byte. For each tuple, bit 0 is set if the relevant byte matched the
+// requested character and bit 1 is not used (faster than using a 32bit
+// syndrome). Since the bits in the syndrome reflect exactly the order in which
+// things occur in the original string, counting trailing zeros allows to
+// identify exactly which byte has matched.
+//
+
+ASM_GLOBAL ASM_PFX(InternalMemScanMem8)
+ASM_PFX(InternalMemScanMem8):
+    // Do not dereference srcin if no bytes to compare.
+    cbz	cntin, .Lzero_length
+    //
+    // Magic constant 0x40100401 allows us to identify which lane matches
+    // the requested byte.
+    //
+    mov     wtmp2, #0x0401
+    movk    wtmp2, #0x4010, lsl #16
+    dup     vrepchr.16b, chrin
+    // Work with aligned 32-byte chunks
+    bic     src, srcin, #31
+    dup     vrepmask.4s, wtmp2
+    ands    soff, srcin, #31
+    and     cntrem, cntin, #31
+    b.eq    .Lloop
+
+    //
+    // Input string is not 32-byte aligned. We calculate the syndrome
+    // value for the aligned 32 bytes block containing the first bytes
+    // and mask the irrelevant part.
+    //
+
+    ld1     {vdata1.16b, vdata2.16b}, [src], #32
+    sub     tmp, soff, #32
+    adds    cntin, cntin, tmp
+    cmeq    vhas_chr1.16b, vdata1.16b, vrepchr.16b
+    cmeq    vhas_chr2.16b, vdata2.16b, vrepchr.16b
+    and     vhas_chr1.16b, vhas_chr1.16b, vrepmask.16b
+    and     vhas_chr2.16b, vhas_chr2.16b, vrepmask.16b
+    addp    vend.16b, vhas_chr1.16b, vhas_chr2.16b        // 256->128
+    addp    vend.16b, vend.16b, vend.16b                  // 128->64
+    mov     synd, vend.d[0]
+    // Clear the soff*2 lower bits
+    lsl     tmp, soff, #1
+    lsr     synd, synd, tmp
+    lsl     synd, synd, tmp
+    // The first block can also be the last
+    b.ls    .Lmasklast
+    // Have we found something already?
+    cbnz    synd, .Ltail
+
+.Lloop:
+    ld1     {vdata1.16b, vdata2.16b}, [src], #32
+    subs    cntin, cntin, #32
+    cmeq    vhas_chr1.16b, vdata1.16b, vrepchr.16b
+    cmeq    vhas_chr2.16b, vdata2.16b, vrepchr.16b
+    // If we're out of data we finish regardless of the result
+    b.ls    .Lend
+    // Use a fast check for the termination condition
+    orr     vend.16b, vhas_chr1.16b, vhas_chr2.16b
+    addp    vend.2d, vend.2d, vend.2d
+    mov     synd, vend.d[0]
+    // We're not out of data, loop if we haven't found the character
+    cbz     synd, .Lloop
+
+.Lend:
+    // Termination condition found, let's calculate the syndrome value
+    and     vhas_chr1.16b, vhas_chr1.16b, vrepmask.16b
+    and     vhas_chr2.16b, vhas_chr2.16b, vrepmask.16b
+    addp    vend.16b, vhas_chr1.16b, vhas_chr2.16b      // 256->128
+    addp    vend.16b, vend.16b, vend.16b                // 128->64
+    mov     synd, vend.d[0]
+    // Only do the clear for the last possible block
+    b.hi    .Ltail
+
+.Lmasklast:
+    // Clear the (32 - ((cntrem + soff) % 32)) * 2 upper bits
+    add     tmp, cntrem, soff
+    and     tmp, tmp, #31
+    sub     tmp, tmp, #32
+    neg     tmp, tmp, lsl #1
+    lsl     synd, synd, tmp
+    lsr     synd, synd, tmp
+
+.Ltail:
+    // Count the trailing zeros using bit reversing
+    rbit    synd, synd
+    // Compensate the last post-increment
+    sub     src, src, #32
+    // Check that we have found a character
+    cmp     synd, #0
+    // And count the leading zeros
+    clz     synd, synd
+    // Compute the potential result
+    add     result, src, synd, lsr #1
+    // Select result or NULL
+    csel    result, xzr, result, eq
+    ret
+
+.Lzero_length:
+    mov   result, #0
+    ret
diff --git a/MdePkg/Library/BaseMemoryLibOptDxe/AArch64/SetMem.S b/MdePkg/Library/BaseMemoryLibOptDxe/AArch64/SetMem.S
new file mode 100644
index 000000000000..7f361110d4fe
--- /dev/null
+++ b/MdePkg/Library/BaseMemoryLibOptDxe/AArch64/SetMem.S
@@ -0,0 +1,244 @@
+//
+// Copyright (c) 2012 - 2016, Linaro Limited
+// All rights reserved.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions are met:
+//     * Redistributions of source code must retain the above copyright
+//       notice, this list of conditions and the following disclaimer.
+//     * Redistributions in binary form must reproduce the above copyright
+//       notice, this list of conditions and the following disclaimer in the
+//       documentation and/or other materials provided with the distribution.
+//     * Neither the name of the Linaro nor the
+//       names of its contributors may be used to endorse or promote products
+//       derived from this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+// "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+// LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+// A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+// HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+// SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+// LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+// DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+// THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//
+
+//
+// Copyright (c) 2015 ARM Ltd
+// All rights reserved.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions
+// are met:
+// 1. Redistributions of source code must retain the above copyright
+//    notice, this list of conditions and the following disclaimer.
+// 2. Redistributions in binary form must reproduce the above copyright
+//    notice, this list of conditions and the following disclaimer in the
+//    documentation and/or other materials provided with the distribution.
+// 3. The name of the company may not be used to endorse or promote
+//    products derived from this software without specific prior written
+//    permission.
+//
+// THIS SOFTWARE IS PROVIDED BY ARM LTD ``AS IS'' AND ANY EXPRESS OR IMPLIED
+// WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+// MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+// IN NO EVENT SHALL ARM LTD BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+// SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
+// TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+//
+
+// Assumptions:
+//
+// ARMv8-a, AArch64, unaligned accesses
+//
+//
+
+#define dstin     x0
+#define count     x1
+#define val       x2
+#define valw      w2
+#define dst       x3
+#define dstend    x4
+#define tmp1      x5
+#define tmp1w     w5
+#define tmp2      x6
+#define tmp2w     w6
+#define zva_len   x7
+#define zva_lenw  w7
+
+#define L(l) .L ## l
+
+ASM_GLOBAL ASM_PFX(InternalMemSetMem16)
+ASM_PFX(InternalMemSetMem16):
+    dup     v0.8H, valw
+    b       0f
+
+ASM_GLOBAL ASM_PFX(InternalMemSetMem32)
+ASM_PFX(InternalMemSetMem32):
+    dup     v0.4S, valw
+    b       0f
+
+ASM_GLOBAL ASM_PFX(InternalMemSetMem64)
+ASM_PFX(InternalMemSetMem64):
+    dup     v0.2D, val
+    b       0f
+
+ASM_GLOBAL ASM_PFX(InternalMemZeroMem)
+ASM_PFX(InternalMemZeroMem):
+    movi    v0.16B, #0
+    b       0f
+
+ASM_GLOBAL ASM_PFX(InternalMemSetMem)
+ASM_PFX(InternalMemSetMem):
+    dup     v0.16B, valw
+0:  add     dstend, dstin, count
+    mov     val, v0.D[0]
+
+    cmp     count, 96
+    b.hi    L(set_long)
+    cmp     count, 16
+    b.hs    L(set_medium)
+
+    // Set 0..15 bytes.
+    tbz     count, 3, 1f
+    str     val, [dstin]
+    str     val, [dstend, -8]
+    ret
+    nop
+1:  tbz     count, 2, 2f
+    str     valw, [dstin]
+    str     valw, [dstend, -4]
+    ret
+2:  cbz     count, 3f
+    strb    valw, [dstin]
+    tbz     count, 1, 3f
+    strh    valw, [dstend, -2]
+3:  ret
+
+    // Set 17..96 bytes.
+L(set_medium):
+    str     q0, [dstin]
+    tbnz    count, 6, L(set96)
+    str     q0, [dstend, -16]
+    tbz     count, 5, 1f
+    str     q0, [dstin, 16]
+    str     q0, [dstend, -32]
+1:  ret
+
+    .p2align 4
+    // Set 64..96 bytes.  Write 64 bytes from the start and
+    // 32 bytes from the end.
+L(set96):
+    str     q0, [dstin, 16]
+    stp     q0, q0, [dstin, 32]
+    stp     q0, q0, [dstend, -32]
+    ret
+
+    .p2align 3
+    nop
+L(set_long):
+    bic     dst, dstin, 15
+    str     q0, [dstin]
+    cmp     count, 256
+    ccmp    val, 0, 0, cs
+    b.eq    L(try_zva)
+L(no_zva):
+    sub     count, dstend, dst        // Count is 16 too large.
+    add     dst, dst, 16
+    sub     count, count, 64 + 16     // Adjust count and bias for loop.
+1:  stp     q0, q0, [dst], 64
+    stp     q0, q0, [dst, -32]
+L(tail64):
+    subs    count, count, 64
+    b.hi    1b
+2:  stp     q0, q0, [dstend, -64]
+    stp     q0, q0, [dstend, -32]
+    ret
+
+    .p2align 3
+L(try_zva):
+    mrs     tmp1, dczid_el0
+    tbnz    tmp1w, 4, L(no_zva)
+    and     tmp1w, tmp1w, 15
+    cmp     tmp1w, 4                  // ZVA size is 64 bytes.
+    b.ne    L(zva_128)
+
+    // Write the first and last 64 byte aligned block using stp rather
+    // than using DC ZVA.  This is faster on some cores.
+L(zva_64):
+    str     q0, [dst, 16]
+    stp     q0, q0, [dst, 32]
+    bic     dst, dst, 63
+    stp     q0, q0, [dst, 64]
+    stp     q0, q0, [dst, 96]
+    sub     count, dstend, dst         // Count is now 128 too large.
+    sub     count, count, 128+64+64    // Adjust count and bias for loop.
+    add     dst, dst, 128
+    nop
+1:  dc      zva, dst
+    add     dst, dst, 64
+    subs    count, count, 64
+    b.hi    1b
+    stp     q0, q0, [dst, 0]
+    stp     q0, q0, [dst, 32]
+    stp     q0, q0, [dstend, -64]
+    stp     q0, q0, [dstend, -32]
+    ret
+
+    .p2align 3
+L(zva_128):
+    cmp     tmp1w, 5                    // ZVA size is 128 bytes.
+    b.ne    L(zva_other)
+
+    str     q0, [dst, 16]
+    stp     q0, q0, [dst, 32]
+    stp     q0, q0, [dst, 64]
+    stp     q0, q0, [dst, 96]
+    bic     dst, dst, 127
+    sub     count, dstend, dst          // Count is now 128 too large.
+    sub     count, count, 128+128       // Adjust count and bias for loop.
+    add     dst, dst, 128
+1:  dc      zva, dst
+    add     dst, dst, 128
+    subs    count, count, 128
+    b.hi    1b
+    stp     q0, q0, [dstend, -128]
+    stp     q0, q0, [dstend, -96]
+    stp     q0, q0, [dstend, -64]
+    stp     q0, q0, [dstend, -32]
+    ret
+
+L(zva_other):
+    mov     tmp2w, 4
+    lsl     zva_lenw, tmp2w, tmp1w
+    add     tmp1, zva_len, 64           // Max alignment bytes written.
+    cmp     count, tmp1
+    blo     L(no_zva)
+
+    sub     tmp2, zva_len, 1
+    add     tmp1, dst, zva_len
+    add     dst, dst, 16
+    subs    count, tmp1, dst            // Actual alignment bytes to write.
+    bic     tmp1, tmp1, tmp2            // Aligned dc zva start address.
+    beq     2f
+1:  stp     q0, q0, [dst], 64
+    stp     q0, q0, [dst, -32]
+    subs    count, count, 64
+    b.hi    1b
+2:  mov     dst, tmp1
+    sub     count, dstend, tmp1         // Remaining bytes to write.
+    subs    count, count, zva_len
+    b.lo    4f
+3:  dc      zva, dst
+    add     dst, dst, zva_len
+    subs    count, count, zva_len
+    b.hs    3b
+4:  add     count, count, zva_len
+    b       L(tail64)
diff --git a/MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf b/MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf
index d95eb599ea9e..64d11b09ef06 100644
--- a/MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf
+++ b/MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf
@@ -27,7 +27,7 @@ [Defines]
 
 
 #
-#  VALID_ARCHITECTURES           = IA32 X64 ARM
+#  VALID_ARCHITECTURES           = IA32 X64 ARM AARCH64
 #
 
 [Sources]
@@ -127,6 +127,13 @@ [Sources.ARM]
   Arm/CopyMem.asm     |RVCT
   Arm/CompareMem.asm  |RVCT
 
+[Sources.AARCH64]
+  AArch64/ScanMem.S
+  AArch64/SetMem.S
+  AArch64/CopyMem.S
+  AArch64/CompareMem.S
+
+[Sources.ARM, Sources.AARCH64]
   Arm/ScanMemGeneric.c
 
 [Sources]
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v5 4/4] MdePkg/BaseMemoryLibOptDxe ARM|AARCH64: disallow use in SEC & PEI phases
  2016-09-09 14:00 [PATCH v5 0/4] MdePkg: add ARM/AARCH64 support to BaseMemoryLib Ard Biesheuvel
                   ` (2 preceding siblings ...)
  2016-09-09 14:00 ` [PATCH v5 3/4] MdePkg/BaseMemoryLibOptDxe: add accelerated AARCH64 routines Ard Biesheuvel
@ 2016-09-09 14:00 ` Ard Biesheuvel
  2016-09-13 14:49   ` Ard Biesheuvel
  2017-04-05 20:12   ` Jeremy Linton
  3 siblings, 2 replies; 14+ messages in thread
From: Ard Biesheuvel @ 2016-09-09 14:00 UTC (permalink / raw)
  To: edk2-devel, liming.gao, leif.lindholm, michael.d.kinney; +Cc: Ard Biesheuvel

The new accelerated ARM and AARCH64 implementations take advantage of
features that are only available when the MMU and Dcache are on. So
restrict the use of this library to the DXE phase or later.

Contributed-under: TianoCore Contribution Agreement 1.0
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf b/MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf
index 64d11b09ef06..5ddc0cbc2d77 100644
--- a/MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf
+++ b/MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf
@@ -116,6 +116,15 @@ [Sources.X64]
   X64/CopyMem.S
   X64/IsZeroBuffer.nasm
 
+[Defines.ARM, Defines.AARCH64]
+  #
+  # The ARM implementations of this library may perform unaligned accesses, and
+  # may use DC ZVA instructions that are only allowed when the MMU and D-cache
+  # are on. Since SEC, PEI_CORE and PEIM modules may execute with the MMU off,
+  # omit them from the supported module types list for this library.
+  #
+  LIBRARY_CLASS = BaseMemoryLib|DXE_CORE DXE_DRIVER DXE_RUNTIME_DRIVER UEFI_DRIVER UEFI_APPLICATION
+
 [Sources.ARM]
   Arm/ScanMem.S       |GCC
   Arm/SetMem.S        |GCC
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 4/4] MdePkg/BaseMemoryLibOptDxe ARM|AARCH64: disallow use in SEC & PEI phases
  2016-09-09 14:00 ` [PATCH v5 4/4] MdePkg/BaseMemoryLibOptDxe ARM|AARCH64: disallow use in SEC & PEI phases Ard Biesheuvel
@ 2016-09-13 14:49   ` Ard Biesheuvel
  2016-09-13 15:00     ` Gao, Liming
  2017-04-05 20:12   ` Jeremy Linton
  1 sibling, 1 reply; 14+ messages in thread
From: Ard Biesheuvel @ 2016-09-13 14:49 UTC (permalink / raw)
  To: edk2-devel-01, Gao, Liming, Leif Lindholm, Kinney, Michael D
  Cc: Ard Biesheuvel

Liming: do you have any comments on this patch?


On 9 September 2016 at 15:00, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> The new accelerated ARM and AARCH64 implementations take advantage of
> features that are only available when the MMU and Dcache are on. So
> restrict the use of this library to the DXE phase or later.
>
> Contributed-under: TianoCore Contribution Agreement 1.0
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
>  MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf b/MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf
> index 64d11b09ef06..5ddc0cbc2d77 100644
> --- a/MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf
> +++ b/MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf
> @@ -116,6 +116,15 @@ [Sources.X64]
>    X64/CopyMem.S
>    X64/IsZeroBuffer.nasm
>
> +[Defines.ARM, Defines.AARCH64]
> +  #
> +  # The ARM implementations of this library may perform unaligned accesses, and
> +  # may use DC ZVA instructions that are only allowed when the MMU and D-cache
> +  # are on. Since SEC, PEI_CORE and PEIM modules may execute with the MMU off,
> +  # omit them from the supported module types list for this library.
> +  #
> +  LIBRARY_CLASS = BaseMemoryLib|DXE_CORE DXE_DRIVER DXE_RUNTIME_DRIVER UEFI_DRIVER UEFI_APPLICATION
> +
>  [Sources.ARM]
>    Arm/ScanMem.S       |GCC
>    Arm/SetMem.S        |GCC
> --
> 2.7.4
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 4/4] MdePkg/BaseMemoryLibOptDxe ARM|AARCH64: disallow use in SEC & PEI phases
  2016-09-13 14:49   ` Ard Biesheuvel
@ 2016-09-13 15:00     ` Gao, Liming
  0 siblings, 0 replies; 14+ messages in thread
From: Gao, Liming @ 2016-09-13 15:00 UTC (permalink / raw)
  To: Ard Biesheuvel, edk2-devel-01, Leif Lindholm, Kinney, Michael D

I have no comment.
Reviewed-by: Liming Gao <liming.gao@intel.com>

From: Ard Biesheuvel [mailto:ard.biesheuvel@linaro.org]
Sent: Tuesday, September 13, 2016 10:50 PM
To: edk2-devel-01 <edk2-devel@lists.01.org>; Gao, Liming <liming.gao@intel.com>; Leif Lindholm <leif.lindholm@linaro.org>; Kinney, Michael D <michael.d.kinney@intel.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Subject: Re: [PATCH v5 4/4] MdePkg/BaseMemoryLibOptDxe ARM|AARCH64: disallow use in SEC & PEI phases

Liming: do you have any comments on this patch?


On 9 September 2016 at 15:00, Ard Biesheuvel wrote:
> The new accelerated ARM and AARCH64 implementations take advantage of
> features that are only available when the MMU and Dcache are on. So
> restrict the use of this library to the DXE phase or later.
>
> Contributed-under: TianoCore Contribution Agreement 1.0
> Signed-off-by: Ard Biesheuvel
> ---
> MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf b/MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf
> index 64d11b09ef06..5ddc0cbc2d77 100644
> --- a/MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf
> +++ b/MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf
> @@ -116,6 +116,15 @@ [Sources.X64]
> X64/CopyMem.S
> X64/IsZeroBuffer.nasm
>
> +[Defines.ARM, Defines.AARCH64]
> + #
> + # The ARM implementations of this library may perform unaligned accesses, and
> + # may use DC ZVA instructions that are only allowed when the MMU and D-cache
> + # are on. Since SEC, PEI_CORE and PEIM modules may execute with the MMU off,
> + # omit them from the supported module types list for this library.
> + #
> + LIBRARY_CLASS = BaseMemoryLib|DXE_CORE DXE_DRIVER DXE_RUNTIME_DRIVER UEFI_DRIVER UEFI_APPLICATION
> +
> [Sources.ARM]
> Arm/ScanMem.S |GCC
> Arm/SetMem.S |GCC
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 4/4] MdePkg/BaseMemoryLibOptDxe ARM|AARCH64: disallow use in SEC & PEI phases
  2016-09-09 14:00 ` [PATCH v5 4/4] MdePkg/BaseMemoryLibOptDxe ARM|AARCH64: disallow use in SEC & PEI phases Ard Biesheuvel
  2016-09-13 14:49   ` Ard Biesheuvel
@ 2017-04-05 20:12   ` Jeremy Linton
  2017-04-05 20:34     ` Ard Biesheuvel
  1 sibling, 1 reply; 14+ messages in thread
From: Jeremy Linton @ 2017-04-05 20:12 UTC (permalink / raw)
  To: Ard Biesheuvel, edk2-devel, liming.gao, leif.lindholm,
	michael.d.kinney

Hi,

On 09/09/2016 09:00 AM, Ard Biesheuvel wrote:
> The new accelerated ARM and AARCH64 implementations take advantage of
> features that are only available when the MMU and Dcache are on. So
> restrict the use of this library to the DXE phase or later.

I don't think this is sufficient because DC ZVA doesn't work against 
device memory/etc. That means that users have to somehow know the 
page/etc attributes of memory regions before they call SetMemXX() on them.

I think this is a problem because nowhere in the UEFI specs do I see 
such restrictions on those memory operations.

For a specific problematic example, the LcdGraphicsOutputBlt.c uses it 
for BltVideoFill() and the target of that is likely not regular cached 
video memory.



>
> Contributed-under: TianoCore Contribution Agreement 1.0
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
>  MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf b/MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf
> index 64d11b09ef06..5ddc0cbc2d77 100644
> --- a/MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf
> +++ b/MdePkg/Library/BaseMemoryLibOptDxe/BaseMemoryLibOptDxe.inf
> @@ -116,6 +116,15 @@ [Sources.X64]
>    X64/CopyMem.S
>    X64/IsZeroBuffer.nasm
>
> +[Defines.ARM, Defines.AARCH64]
> +  #
> +  # The ARM implementations of this library may perform unaligned accesses, and
> +  # may use DC ZVA instructions that are only allowed when the MMU and D-cache
> +  # are on. Since SEC, PEI_CORE and PEIM modules may execute with the MMU off,
> +  # omit them from the supported module types list for this library.
> +  #
> +  LIBRARY_CLASS = BaseMemoryLib|DXE_CORE DXE_DRIVER DXE_RUNTIME_DRIVER UEFI_DRIVER UEFI_APPLICATION
> +
>  [Sources.ARM]
>    Arm/ScanMem.S       |GCC
>    Arm/SetMem.S        |GCC
>



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 4/4] MdePkg/BaseMemoryLibOptDxe ARM|AARCH64: disallow use in SEC & PEI phases
  2017-04-05 20:12   ` Jeremy Linton
@ 2017-04-05 20:34     ` Ard Biesheuvel
  2017-04-05 21:28       ` Jeremy Linton
  0 siblings, 1 reply; 14+ messages in thread
From: Ard Biesheuvel @ 2017-04-05 20:34 UTC (permalink / raw)
  To: Jeremy Linton, Leif Lindholm
  Cc: edk2-devel@lists.01.org, Gao, Liming, Kinney, Michael D

On 5 April 2017 at 21:12, Jeremy Linton <jeremy.linton@arm.com> wrote:
> Hi,
>
> On 09/09/2016 09:00 AM, Ard Biesheuvel wrote:
>>
>> The new accelerated ARM and AARCH64 implementations take advantage of
>> features that are only available when the MMU and Dcache are on. So
>> restrict the use of this library to the DXE phase or later.
>
>
> I don't think this is sufficient because DC ZVA doesn't work against device
> memory/etc. That means that users have to somehow know the page/etc
> attributes of memory regions before they call SetMemXX() on them.
>

Yes. I literally found this out myself yesterday. Note that this
applies equally to unaligned accesses.


> I think this is a problem because nowhere in the UEFI specs do I see such
> restrictions on those memory operations.
>

Using device attributes for memory is something we should ban for
AArch64 in the spec.

> For a specific problematic example, the LcdGraphicsOutputBlt.c uses it for
> BltVideoFill() and the target of that is likely not regular cached video
> memory.
>

Those drivers should be using EFI_MEMORY_WC not EFI_MEMORY_UC for the
VRAM mapping. Note that EFI_MEMORY_UC is nGnRnE which is unnecessarily
restrictive.

I agree there is a general issue here which we should address by
tightening the spec. I don't see a lot of value in avoiding DC ZVA and
unaligned accesses altogether, I'd rather fix the code instead.

Thanks,
Ard.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 4/4] MdePkg/BaseMemoryLibOptDxe ARM|AARCH64: disallow use in SEC & PEI phases
  2017-04-05 20:34     ` Ard Biesheuvel
@ 2017-04-05 21:28       ` Jeremy Linton
  2017-04-05 21:55         ` Ard Biesheuvel
  0 siblings, 1 reply; 14+ messages in thread
From: Jeremy Linton @ 2017-04-05 21:28 UTC (permalink / raw)
  To: Ard Biesheuvel, Leif Lindholm
  Cc: edk2-devel@lists.01.org, Gao, Liming, Kinney, Michael D

Hi,

On 04/05/2017 03:34 PM, Ard Biesheuvel wrote:
> On 5 April 2017 at 21:12, Jeremy Linton <jeremy.linton@arm.com> wrote:
>> Hi,
>>
>> On 09/09/2016 09:00 AM, Ard Biesheuvel wrote:
>>>
>>> The new accelerated ARM and AARCH64 implementations take advantage of
>>> features that are only available when the MMU and Dcache are on. So
>>> restrict the use of this library to the DXE phase or later.
>>
>>
>> I don't think this is sufficient because DC ZVA doesn't work against device
>> memory/etc. That means that users have to somehow know the page/etc
>> attributes of memory regions before they call SetMemXX() on them.
>>
>
> Yes. I literally found this out myself yesterday. Note that this
> applies equally to unaligned accesses.
>
>
>> I think this is a problem because nowhere in the UEFI specs do I see such
>> restrictions on those memory operations.
>>
>
> Using device attributes for memory is something we should ban for
> AArch64 in the spec.
>
>> For a specific problematic example, the LcdGraphicsOutputBlt.c uses it for
>> BltVideoFill() and the target of that is likely not regular cached video
>> memory.
>>
>
> Those drivers should be using EFI_MEMORY_WC not EFI_MEMORY_UC for the
> VRAM mapping. Note that EFI_MEMORY_UC is nGnRnE which is unnecessarily
> restrictive.
>
> I agree there is a general issue here which we should address by
> tightening the spec. I don't see a lot of value in avoiding DC ZVA and
> unaligned accesses altogether, I'd rather fix the code instead.


While I agree with the general sentiment, I find the result brittle. If 
it were used as a DEBUG build way to locate sub-optmimal code I would be 
more on board. But shipping it like this, puts it into situations where 
the user inadvertently changes something (say making the background 
black and therefore triggering the DC) or some obscure option ROM (we 
will get there right??!!) triggers it in a place where it can't be 
debugged.

Particularly since we are talking boot, where the few percent perf 
improvement on this operation is likely completely undetectable. The one 
place where I can think it might even be measurable is in routines to 
clear system memory, and those routines could be a special case anyway.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 4/4] MdePkg/BaseMemoryLibOptDxe ARM|AARCH64: disallow use in SEC & PEI phases
  2017-04-05 21:28       ` Jeremy Linton
@ 2017-04-05 21:55         ` Ard Biesheuvel
  2017-04-06  9:35           ` Leif Lindholm
  0 siblings, 1 reply; 14+ messages in thread
From: Ard Biesheuvel @ 2017-04-05 21:55 UTC (permalink / raw)
  To: Jeremy Linton
  Cc: Leif Lindholm, edk2-devel@lists.01.org, Gao, Liming,
	Kinney, Michael D

On 5 April 2017 at 22:28, Jeremy Linton <jeremy.linton@arm.com> wrote:
> Hi,
>
>
> On 04/05/2017 03:34 PM, Ard Biesheuvel wrote:
>>
>> On 5 April 2017 at 21:12, Jeremy Linton <jeremy.linton@arm.com> wrote:
>>>
>>> Hi,
>>>
>>> On 09/09/2016 09:00 AM, Ard Biesheuvel wrote:
>>>>
>>>>
>>>> The new accelerated ARM and AARCH64 implementations take advantage of
>>>> features that are only available when the MMU and Dcache are on. So
>>>> restrict the use of this library to the DXE phase or later.
>>>
>>>
>>>
>>> I don't think this is sufficient because DC ZVA doesn't work against
>>> device
>>> memory/etc. That means that users have to somehow know the page/etc
>>> attributes of memory regions before they call SetMemXX() on them.
>>>
>>
>> Yes. I literally found this out myself yesterday. Note that this
>> applies equally to unaligned accesses.
>>
>>
>>> I think this is a problem because nowhere in the UEFI specs do I see such
>>> restrictions on those memory operations.
>>>
>>
>> Using device attributes for memory is something we should ban for
>> AArch64 in the spec.
>>
>>> For a specific problematic example, the LcdGraphicsOutputBlt.c uses it
>>> for
>>> BltVideoFill() and the target of that is likely not regular cached video
>>> memory.
>>>
>>
>> Those drivers should be using EFI_MEMORY_WC not EFI_MEMORY_UC for the
>> VRAM mapping. Note that EFI_MEMORY_UC is nGnRnE which is unnecessarily
>> restrictive.
>>
>> I agree there is a general issue here which we should address by
>> tightening the spec. I don't see a lot of value in avoiding DC ZVA and
>> unaligned accesses altogether, I'd rather fix the code instead.
>
>
>
> While I agree with the general sentiment, I find the result brittle. If it
> were used as a DEBUG build way to locate sub-optmimal code I would be more
> on board. But shipping it like this, puts it into situations where the user
> inadvertently changes something (say making the background black and
> therefore triggering the DC) or some obscure option ROM (we will get there
> right??!!) triggers it in a place where it can't be debugged.
>
> Particularly since we are talking boot, where the few percent perf
> improvement on this operation is likely completely undetectable. The one
> place where I can think it might even be measurable is in routines to clear
> system memory, and those routines could be a special case anyway.
>

I guess this depends on the use case. For server, it may not matter,
but the case is different for mobile, and the Broadcom engineers that
did some benchmarks on this code were very pleased with the result
(and the speedup was significant, although I don't know which routines
are the hotspots)

As for option ROMs: those will link to their own BaseMemoryLib
implementation (assuming that they are EDK2 based) so the only way
they would have access to these routines is via the CopyMem() and
SetMem() boot services. Note that that does not dismiss the concern at
all, it is just a clarification.

Leif, any thoughts?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 4/4] MdePkg/BaseMemoryLibOptDxe ARM|AARCH64: disallow use in SEC & PEI phases
  2017-04-05 21:55         ` Ard Biesheuvel
@ 2017-04-06  9:35           ` Leif Lindholm
  2017-04-06  9:43             ` Ard Biesheuvel
  0 siblings, 1 reply; 14+ messages in thread
From: Leif Lindholm @ 2017-04-06  9:35 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Jeremy Linton, edk2-devel@lists.01.org, Gao, Liming,
	Kinney, Michael D, Charles Garcia-Tobin, Dong Wei, Evan Lloyd

On Wed, Apr 05, 2017 at 10:55:49PM +0100, Ard Biesheuvel wrote:
> >>> I think this is a problem because nowhere in the UEFI specs do I see such
> >>> restrictions on those memory operations.
> >>
> >> Using device attributes for memory is something we should ban for
> >> AArch64 in the spec.

Yes, completely agree. And doing so is generally the result of
misinderstanding the memory model (i.e., it probably won't provide the
guarantee that was sought).
Charles/Dong? Something to add to list?

Can we insert a test preventing device memory type to be set for
regions with _WB attribute? Or is that already only possible through
manual trickery?

> >>> For a specific problematic example, the LcdGraphicsOutputBlt.c uses it
> >>> for
> >>> BltVideoFill() and the target of that is likely not regular cached video
> >>> memory.
> >>
> >> Those drivers should be using EFI_MEMORY_WC not EFI_MEMORY_UC for the
> >> VRAM mapping. Note that EFI_MEMORY_UC is nGnRnE which is unnecessarily
> >> restrictive.
> >>
> >> I agree there is a general issue here which we should address by
> >> tightening the spec. I don't see a lot of value in avoiding DC ZVA and
> >> unaligned accesses altogether, I'd rather fix the code instead.
> >
> > While I agree with the general sentiment, I find the result brittle. If it
> > were used as a DEBUG build way to locate sub-optmimal code I would be more
> > on board. But shipping it like this, puts it into situations where the user
> > inadvertently changes something (say making the background black and
> > therefore triggering the DC) or some obscure option ROM (we will get there
> > right??!!) triggers it in a place where it can't be debugged.
> >
> > Particularly since we are talking boot, where the few percent perf
> > improvement on this operation is likely completely undetectable. The one
> > place where I can think it might even be measurable is in routines to clear
> > system memory, and those routines could be a special case anyway.
> 
> I guess this depends on the use case. For server, it may not matter,
> but the case is different for mobile, and the Broadcom engineers that
> did some benchmarks on this code were very pleased with the result
> (and the speedup was significant, although I don't know which routines
> are the hotspots)
> 
> As for option ROMs: those will link to their own BaseMemoryLib
> implementation (assuming that they are EDK2 based) so the only way
> they would have access to these routines is via the CopyMem() and
> SetMem() boot services. Note that that does not dismiss the concern at
> all, it is just a clarification.
>
> Leif, any thoughts?

I would prefer if we could resolve this without waiting for a new spec
version.

My gut feeling is that the (end-user, I care a _lot_ less
about development platforms) devices that _could_ be affected by this
won't be releasing updated firmwares completely rebased onto a newer
edk2 HEAD. Rather they're likely to be cherry-picking individual
bugfixes and improvements.

But certainly having some input from abovementioned Broadcom team,
Evan & co, and others is important before we make a call.

/
    Leif


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 4/4] MdePkg/BaseMemoryLibOptDxe ARM|AARCH64: disallow use in SEC & PEI phases
  2017-04-06  9:35           ` Leif Lindholm
@ 2017-04-06  9:43             ` Ard Biesheuvel
  2017-04-06 10:16               ` Leif Lindholm
  0 siblings, 1 reply; 14+ messages in thread
From: Ard Biesheuvel @ 2017-04-06  9:43 UTC (permalink / raw)
  To: Leif Lindholm
  Cc: Jeremy Linton, edk2-devel@lists.01.org, Gao, Liming,
	Kinney, Michael D, Charles Garcia-Tobin, Dong Wei, Evan Lloyd

On 6 April 2017 at 10:35, Leif Lindholm <leif.lindholm@linaro.org> wrote:
> On Wed, Apr 05, 2017 at 10:55:49PM +0100, Ard Biesheuvel wrote:
>> >>> I think this is a problem because nowhere in the UEFI specs do I see such
>> >>> restrictions on those memory operations.
>> >>
>> >> Using device attributes for memory is something we should ban for
>> >> AArch64 in the spec.
>
> Yes, completely agree. And doing so is generally the result of
> misinderstanding the memory model (i.e., it probably won't provide the
> guarantee that was sought).
> Charles/Dong? Something to add to list?
>

As an additional note, the UEFI spec mandates that unaligned accesses
are enabled for AArch64, which clearly expresses the intent that
routines operating on memory should be able to do so without going out
of its way to avoid unaligned accesses.

> Can we insert a test preventing device memory type to be set for
> regions with _WB attribute? Or is that already only possible through
> manual trickery?
>

We should simply remove the _UC attribute from all memory. I have
already done so for many of the platforms I more or less maintain (and
for virt, we removed _WT and _WC as well, because KVM only supports
_WB)

Note that this does not prevent the NOR and RTC drivers from creating
_UC regions for their own MMIO registers, it just prevents them from
being remapped _UC via the DXE services.

>> >>> For a specific problematic example, the LcdGraphicsOutputBlt.c uses it
>> >>> for
>> >>> BltVideoFill() and the target of that is likely not regular cached video
>> >>> memory.
>> >>
>> >> Those drivers should be using EFI_MEMORY_WC not EFI_MEMORY_UC for the
>> >> VRAM mapping. Note that EFI_MEMORY_UC is nGnRnE which is unnecessarily
>> >> restrictive.
>> >>
>> >> I agree there is a general issue here which we should address by
>> >> tightening the spec. I don't see a lot of value in avoiding DC ZVA and
>> >> unaligned accesses altogether, I'd rather fix the code instead.
>> >
>> > While I agree with the general sentiment, I find the result brittle. If it
>> > were used as a DEBUG build way to locate sub-optmimal code I would be more
>> > on board. But shipping it like this, puts it into situations where the user
>> > inadvertently changes something (say making the background black and
>> > therefore triggering the DC) or some obscure option ROM (we will get there
>> > right??!!) triggers it in a place where it can't be debugged.
>> >
>> > Particularly since we are talking boot, where the few percent perf
>> > improvement on this operation is likely completely undetectable. The one
>> > place where I can think it might even be measurable is in routines to clear
>> > system memory, and those routines could be a special case anyway.
>>
>> I guess this depends on the use case. For server, it may not matter,
>> but the case is different for mobile, and the Broadcom engineers that
>> did some benchmarks on this code were very pleased with the result
>> (and the speedup was significant, although I don't know which routines
>> are the hotspots)
>>
>> As for option ROMs: those will link to their own BaseMemoryLib
>> implementation (assuming that they are EDK2 based) so the only way
>> they would have access to these routines is via the CopyMem() and
>> SetMem() boot services. Note that that does not dismiss the concern at
>> all, it is just a clarification.
>>
>> Leif, any thoughts?
>
> I would prefer if we could resolve this without waiting for a new spec
> version.
>
> My gut feeling is that the (end-user, I care a _lot_ less
> about development platforms) devices that _could_ be affected by this
> won't be releasing updated firmwares completely rebased onto a newer
> edk2 HEAD. Rather they're likely to be cherry-picking individual
> bugfixes and improvements.
>
> But certainly having some input from abovementioned Broadcom team,
> Evan & co, and others is important before we make a call.
>
> /
>     Leif


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v5 4/4] MdePkg/BaseMemoryLibOptDxe ARM|AARCH64: disallow use in SEC & PEI phases
  2017-04-06  9:43             ` Ard Biesheuvel
@ 2017-04-06 10:16               ` Leif Lindholm
  0 siblings, 0 replies; 14+ messages in thread
From: Leif Lindholm @ 2017-04-06 10:16 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Jeremy Linton, edk2-devel@lists.01.org, Gao, Liming,
	Kinney, Michael D, Charles Garcia-Tobin, Dong Wei, Evan Lloyd

On Thu, Apr 06, 2017 at 10:43:57AM +0100, Ard Biesheuvel wrote:
> On 6 April 2017 at 10:35, Leif Lindholm <leif.lindholm@linaro.org> wrote:
> > On Wed, Apr 05, 2017 at 10:55:49PM +0100, Ard Biesheuvel wrote:
> >> >>> I think this is a problem because nowhere in the UEFI specs do I see such
> >> >>> restrictions on those memory operations.
> >> >>
> >> >> Using device attributes for memory is something we should ban for
> >> >> AArch64 in the spec.
> >
> > Yes, completely agree. And doing so is generally the result of
> > misinderstanding the memory model (i.e., it probably won't provide the
> > guarantee that was sought).
> > Charles/Dong? Something to add to list?
> 
> As an additional note, the UEFI spec mandates that unaligned accesses
> are enabled for AArch64, which clearly expresses the intent that
> routines operating on memory should be able to do so without going out
> of its way to avoid unaligned accesses.

It does - but only if you already understand the memory model.

> > Can we insert a test preventing device memory type to be set for
> > regions with _WB attribute? Or is that already only possible through
> > manual trickery?
> 
> We should simply remove the _UC attribute from all memory. I have
> already done so for many of the platforms I more or less maintain (and
> for virt, we removed _WT and _WC as well, because KVM only supports
> _WB)

Agreed.

> Note that this does not prevent the NOR and RTC drivers from creating
> _UC regions for their own MMIO registers, it just prevents them from
> being remapped _UC via the DXE services.

OK, good.

/
    Leif


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-04-06 10:16 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-09-09 14:00 [PATCH v5 0/4] MdePkg: add ARM/AARCH64 support to BaseMemoryLib Ard Biesheuvel
2016-09-09 14:00 ` [PATCH v5 1/4] MdePkg/BaseMemoryLib: widen aligned accesses to 32 or 64 bits Ard Biesheuvel
2016-09-09 14:00 ` [PATCH v5 2/4] MdePkg/BaseMemoryLibOptDxe: add accelerated ARM routines Ard Biesheuvel
2016-09-09 14:00 ` [PATCH v5 3/4] MdePkg/BaseMemoryLibOptDxe: add accelerated AARCH64 routines Ard Biesheuvel
2016-09-09 14:00 ` [PATCH v5 4/4] MdePkg/BaseMemoryLibOptDxe ARM|AARCH64: disallow use in SEC & PEI phases Ard Biesheuvel
2016-09-13 14:49   ` Ard Biesheuvel
2016-09-13 15:00     ` Gao, Liming
2017-04-05 20:12   ` Jeremy Linton
2017-04-05 20:34     ` Ard Biesheuvel
2017-04-05 21:28       ` Jeremy Linton
2017-04-05 21:55         ` Ard Biesheuvel
2017-04-06  9:35           ` Leif Lindholm
2017-04-06  9:43             ` Ard Biesheuvel
2017-04-06 10:16               ` Leif Lindholm

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox