public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed
From: "Compostella, Jeremy" <jeremy.compostella@intel.com>
To: gaoliming <gaoliming@byosoft.com.cn>, <devel@edk2.groups.io>
Subject: Re: 回复: [edk2-devel] [PATCH] BaseMemoryLibSse2: Take advantage of write combining buffers
Date: Thu, 15 Oct 2020 10:47:49 -0700	[thread overview]
Message-ID: <87ft6f5qka.fsf@jcompost-mobl.ch.intel.com> (raw)
In-Reply-To: <006801d6a110$7e0d5540$7a27ffc0$@byosoft.com.cn> (gaoliming@byosoft.com.cn's message of "Tue, 13 Oct 2020 11:25:25 +0800")

Thank for the review Liming.  Is there anything else I should to help
with the review/merge process ?

Regards,
Jeremy

gaoliming <gaoliming@byosoft.com.cn> writes:

> This is a good enhancement. The change is good to me. 
>
> Reviewed-by: Liming Gao <gaoliming@byosoft.com.cn>
>
> Thanks
> Liming
>> -----邮件原件-----
>> 发件人: bounce+27952+66093+4905953+8761045@groups.io
>> <bounce+27952+66093+4905953+8761045@groups.io> 代表 Compostella,
>> Jeremy
>> 发送时间: 2020年10月10日 4:43
>> 收件人: devel@edk2.groups.io
>> 主题: [edk2-devel] [PATCH] BaseMemoryLibSse2: Take advantage of write
>> combining buffers
>> 
>> The current SSE2 implementation of the ZeroMem(), SetMem(),
>> SetMem16(), SetMem32 and SetMem64 functions is writing 16 bytes per 16
>> bytes. It hurts the performances so bad that this is even slower than
>> a simple 'rep stos' (4% slower) in regular DRAM.
>> 
>> To take full advantages of the 'movntdq' instruction it is better to
>> "queue" a total of 64 bytes in the write combining buffers.  This
>> patch implement such a change.  Below is a table where I measured
>> (with 'rdtsc') the time to write an entire 100MB RAM buffer. These
>> functions operate almost two times faster.
>> 
>> | Function | Arch | Untouched | 64 bytes | Result |
>> |----------+------+-----------+----------+--------|
>> | ZeroMem  | Ia32 |  17765947 |  9136062 | 1.945x |
>> | ZeroMem  | X64  |  17525170 |  9233391 | 1.898x |
>> | SetMem   | Ia32 |  17522291 |  9137272 | 1.918x |
>> | SetMem   | X64  |  17949261 |  9176978 | 1.956x |
>> | SetMem16 | Ia32 |  18219673 |  9372062 | 1.944x |
>> | SetMem16 | X64  |  17523331 |  9275184 | 1.889x |
>> | SetMem32 | Ia32 |  18495036 |  9273053 | 1.994x |
>> | SetMem32 | X64  |  17368864 |  9285885 | 1.870x |
>> | SetMem64 | Ia32 |  18564473 |  9241362 | 2.009x |
>> | SetMem64 | X64  |  17506951 |  9280148 | 1.886x |
>> 
>> Signed-off-by: Jeremy Compostella <jeremy.compostella@intel.com>
>> ---
>>  .../BaseMemoryLibSse2/Ia32/SetMem.nasm        | 11 ++++++----
>>  .../BaseMemoryLibSse2/Ia32/SetMem16.nasm      | 11 ++++++----
>>  .../BaseMemoryLibSse2/Ia32/SetMem32.nasm      |  9 ++++++---
>>  .../BaseMemoryLibSse2/Ia32/SetMem64.nasm      | 20
>> +++++++++++++++----
>>  .../BaseMemoryLibSse2/Ia32/ZeroMem.nasm       | 11 ++++++----
>>  .../Library/BaseMemoryLibSse2/X64/SetMem.nasm |  9 ++++++---
>>  .../BaseMemoryLibSse2/X64/SetMem16.nasm       | 11 ++++++----
>>  .../BaseMemoryLibSse2/X64/SetMem32.nasm       |  9 ++++++---
>>  .../BaseMemoryLibSse2/X64/SetMem64.nasm       | 19
>> ++++++++++++++----
>>  .../BaseMemoryLibSse2/X64/ZeroMem.nasm        | 13 +++++++-----
>>  10 files changed, 85 insertions(+), 38 deletions(-)
>> 
>> diff --git a/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem.nasm
>> b/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem.nasm
>> index 24313cb4b3..a8744300c6 100644
>> --- a/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem.nasm
>> +++ b/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem.nasm
>> @@ -34,7 +34,7 @@ ASM_PFX(InternalMemSetMem):
>>      mov     al, [esp + 16]              ; al <- Value
>>      xor     ecx, ecx
>>      sub     ecx, edi
>> -    and     ecx, 15                     ; ecx + edi aligns on 16-byte
>> boundary
>> +    and     ecx, 63                     ; ecx + edi aligns on 16-byte
>> boundary
>>      jz      .0
>>      cmp     ecx, edx
>>      cmova   ecx, edx
>> @@ -42,8 +42,8 @@ ASM_PFX(InternalMemSetMem):
>>      rep     stosb
>>  .0:
>>      mov     ecx, edx
>> -    and     edx, 15
>> -    shr     ecx, 4                      ; ecx <- # of DQwords to set
>> +    and     edx, 63
>> +    shr     ecx, 6                      ; ecx <- # of DQwords to set
>>      jz      @SetBytes
>>      mov     ah, al                      ; ax <- Value | (Value << 8)
>>      add     esp, -16
>> @@ -53,7 +53,10 @@ ASM_PFX(InternalMemSetMem):
>>      movlhps xmm0, xmm0                  ; xmm0 <- Value repeats
>> 16 times
>>  .1:
>>      movntdq [edi], xmm0                 ; edi should be 16-byte
>> aligned
>> -    add     edi, 16
>> +    movntdq [edi + 16], xmm0
>> +    movntdq [edi + 32], xmm0
>> +    movntdq [edi + 48], xmm0
>> +    add     edi, 64
>>      loop    .1
>>      mfence
>>      movdqu  xmm0, [esp]                 ; restore xmm0
>> diff --git a/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem16.nasm
>> b/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem16.nasm
>> index 6e308b5594..d461ee086c 100644
>> --- a/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem16.nasm
>> +++ b/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem16.nasm
>> @@ -33,7 +33,7 @@ ASM_PFX(InternalMemSetMem16):
>>      mov     edi, [esp + 8]
>>      xor     ecx, ecx
>>      sub     ecx, edi
>> -    and     ecx, 15                     ; ecx + edi aligns on 16-byte
>> boundary
>> +    and     ecx, 63                     ; ecx + edi aligns on 16-byte
>> boundary
>>      mov     eax, [esp + 16]
>>      jz      .0
>>      shr     ecx, 1
>> @@ -43,15 +43,18 @@ ASM_PFX(InternalMemSetMem16):
>>      rep     stosw
>>  .0:
>>      mov     ecx, edx
>> -    and     edx, 7
>> -    shr     ecx, 3
>> +    and     edx, 31
>> +    shr     ecx, 5
>>      jz      @SetWords
>>      movd    xmm0, eax
>>      pshuflw xmm0, xmm0, 0
>>      movlhps xmm0, xmm0
>>  .1:
>>      movntdq [edi], xmm0                 ; edi should be 16-byte
>> aligned
>> -    add     edi, 16
>> +    movntdq [edi + 16], xmm0
>> +    movntdq [edi + 32], xmm0
>> +    movntdq [edi + 48], xmm0
>> +    add     edi, 64
>>      loop    .1
>>      mfence
>>  @SetWords:
>> diff --git a/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem32.nasm
>> b/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem32.nasm
>> index 2cfc8cb0dd..3ffdcd07d7 100644
>> --- a/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem32.nasm
>> +++ b/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem32.nasm
>> @@ -43,14 +43,17 @@ ASM_PFX(InternalMemSetMem32):
>>      rep     stosd
>>  .0:
>>      mov     ecx, edx
>> -    and     edx, 3
>> -    shr     ecx, 2
>> +    and     edx, 15
>> +    shr     ecx, 4
>>      jz      @SetDwords
>>      movd    xmm0, eax
>>      pshufd  xmm0, xmm0, 0
>>  .1:
>>      movntdq [edi], xmm0
>> -    add     edi, 16
>> +    movntdq [edi + 16], xmm0
>> +    movntdq [edi + 32], xmm0
>> +    movntdq [edi + 48], xmm0
>> +    add     edi, 64
>>      loop    .1
>>      mfence
>>  @SetDwords:
>> diff --git a/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem64.nasm
>> b/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem64.nasm
>> index e153495a68..cd000648ae 100644
>> --- a/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem64.nasm
>> +++ b/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem64.nasm
>> @@ -38,17 +38,29 @@ ASM_PFX(InternalMemSetMem64):
>>      add     edx, 8
>>      dec     ecx
>>  .0:
>> -    shr     ecx, 1
>> +    push    ebx
>> +    mov     ebx, ecx
>> +    and     ebx, 7
>> +    shr     ecx, 3
>>      jz      @SetQwords
>>      movlhps xmm0, xmm0
>>  .1:
>>      movntdq [edx], xmm0
>> -    lea     edx, [edx + 16]
>> +    movntdq [edx + 16], xmm0
>> +    movntdq [edx + 32], xmm0
>> +    movntdq [edx + 48], xmm0
>> +    lea     edx, [edx + 64]
>>      loop    .1
>>      mfence
>>  @SetQwords:
>> -    jnc     .2
>> +    test    ebx, ebx
>> +    jz .3
>> +    mov     ecx, ebx
>> +.2
>>      movq    qword [edx], xmm0
>> -.2:
>> +    lea     edx, [edx + 8]
>> +    loop    .2
>> +.3:
>> +    pop ebx
>>      ret
>> 
>> diff --git a/MdePkg/Library/BaseMemoryLibSse2/Ia32/ZeroMem.nasm
>> b/MdePkg/Library/BaseMemoryLibSse2/Ia32/ZeroMem.nasm
>> index cd34006f59..0e0828551b 100644
>> --- a/MdePkg/Library/BaseMemoryLibSse2/Ia32/ZeroMem.nasm
>> +++ b/MdePkg/Library/BaseMemoryLibSse2/Ia32/ZeroMem.nasm
>> @@ -33,7 +33,7 @@ ASM_PFX(InternalMemZeroMem):
>>      xor     ecx, ecx
>>      sub     ecx, edi
>>      xor     eax, eax
>> -    and     ecx, 15
>> +    and     ecx, 63
>>      jz      .0
>>      cmp     ecx, edx
>>      cmova   ecx, edx
>> @@ -41,13 +41,16 @@ ASM_PFX(InternalMemZeroMem):
>>      rep     stosb
>>  .0:
>>      mov     ecx, edx
>> -    and     edx, 15
>> -    shr     ecx, 4
>> +    and     edx, 63
>> +    shr     ecx, 6
>>      jz      @ZeroBytes
>>      pxor    xmm0, xmm0
>>  .1:
>>      movntdq [edi], xmm0
>> -    add     edi, 16
>> +    movntdq [edi + 16], xmm0
>> +    movntdq [edi + 32], xmm0
>> +    movntdq [edi + 48], xmm0
>> +    add     edi, 64
>>      loop    .1
>>      mfence
>>  @ZeroBytes:
>> diff --git a/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem.nasm
>> b/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem.nasm
>> index 5bd1c2262d..28b11ee586 100644
>> --- a/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem.nasm
>> +++ b/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem.nasm
>> @@ -42,8 +42,8 @@ ASM_PFX(InternalMemSetMem):
>>      rep     stosb
>>  .0:
>>      mov     rcx, rdx
>> -    and     rdx, 15
>> -    shr     rcx, 4
>> +    and     rdx, 63
>> +    shr     rcx, 6
>>      jz      @SetBytes
>>      mov     ah, al                      ; ax <- Value repeats twice
>>      movdqa  [rsp + 0x10], xmm0           ; save xmm0
>> @@ -52,7 +52,10 @@ ASM_PFX(InternalMemSetMem):
>>      movlhps xmm0, xmm0                  ; xmm0 <- Value repeats
>> 16 times
>>  .1:
>>      movntdq [rdi], xmm0                 ; rdi should be 16-byte
>> aligned
>> -    add     rdi, 16
>> +    movntdq [rdi + 16], xmm0
>> +    movntdq [rdi + 32], xmm0
>> +    movntdq [rdi + 48], xmm0
>> +    add     rdi, 64
>>      loop    .1
>>      mfence
>>      movdqa  xmm0, [rsp + 0x10]           ; restore xmm0
>> diff --git a/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem16.nasm
>> b/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem16.nasm
>> index 90d159820a..375be19313 100644
>> --- a/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem16.nasm
>> +++ b/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem16.nasm
>> @@ -33,7 +33,7 @@ ASM_PFX(InternalMemSetMem16):
>>      mov     r9, rdi
>>      xor     rcx, rcx
>>      sub     rcx, rdi
>> -    and     rcx, 15
>> +    and     rcx, 63
>>      mov     rax, r8
>>      jz      .0
>>      shr     rcx, 1
>> @@ -43,15 +43,18 @@ ASM_PFX(InternalMemSetMem16):
>>      rep     stosw
>>  .0:
>>      mov     rcx, rdx
>> -    and     edx, 7
>> -    shr     rcx, 3
>> +    and     edx, 31
>> +    shr     rcx, 5
>>      jz      @SetWords
>>      movd    xmm0, eax
>>      pshuflw xmm0, xmm0, 0
>>      movlhps xmm0, xmm0
>>  .1:
>>      movntdq [rdi], xmm0
>> -    add     rdi, 16
>> +    movntdq [rdi + 16], xmm0
>> +    movntdq [rdi + 32], xmm0
>> +    movntdq [rdi + 48], xmm0
>> +    add     rdi, 64
>>      loop    .1
>>      mfence
>>  @SetWords:
>> diff --git a/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem32.nasm
>> b/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem32.nasm
>> index 928e086889..5d12beaa9a 100644
>> --- a/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem32.nasm
>> +++ b/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem32.nasm
>> @@ -43,14 +43,17 @@ ASM_PFX(InternalMemSetMem32):
>>      rep     stosd
>>  .0:
>>      mov     rcx, rdx
>> -    and     edx, 3
>> -    shr     rcx, 2
>> +    and     edx, 15
>> +    shr     rcx, 4
>>      jz      @SetDwords
>>      movd    xmm0, eax
>>      pshufd  xmm0, xmm0, 0
>>  .1:
>>      movntdq [rdi], xmm0
>> -    add     rdi, 16
>> +    movntdq [rdi + 16], xmm0
>> +    movntdq [rdi + 32], xmm0
>> +    movntdq [rdi + 48], xmm0
>> +    add     rdi, 64
>>      loop    .1
>>      mfence
>>  @SetDwords:
>> diff --git a/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem64.nasm
>> b/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem64.nasm
>> index d771810542..265983d5ad 100644
>> --- a/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem64.nasm
>> +++ b/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem64.nasm
>> @@ -37,17 +37,28 @@ ASM_PFX(InternalMemSetMem64):
>>      add     rdx, 8
>>      dec     rcx
>>  .0:
>> -    shr     rcx, 1
>> +    push    rbx
>> +    mov     rbx, rcx
>> +    and     rbx, 7
>> +    shr     rcx, 3
>>      jz      @SetQwords
>>      movlhps xmm0, xmm0
>>  .1:
>>      movntdq [rdx], xmm0
>> -    lea     rdx, [rdx + 16]
>> +    movntdq [rdx + 16], xmm0
>> +    movntdq [rdx + 32], xmm0
>> +    movntdq [rdx + 48], xmm0
>> +    lea     rdx, [rdx + 64]
>>      loop    .1
>>      mfence
>>  @SetQwords:
>> -    jnc     .2
>> -    mov     [rdx], r8
>> +    push    rdi
>> +    mov     rcx, rbx
>> +    mov     rax, r8
>> +    mov     rdi, rdx
>> +    rep     stosq
>> +    pop     rdi
>>  .2:
>> +    pop rbx
>>      ret
>> 
>> diff --git a/MdePkg/Library/BaseMemoryLibSse2/X64/ZeroMem.nasm
>> b/MdePkg/Library/BaseMemoryLibSse2/X64/ZeroMem.nasm
>> index 5ddcae9ca5..21f504e3b7 100644
>> --- a/MdePkg/Library/BaseMemoryLibSse2/X64/ZeroMem.nasm
>> +++ b/MdePkg/Library/BaseMemoryLibSse2/X64/ZeroMem.nasm
>> @@ -32,7 +32,7 @@ ASM_PFX(InternalMemZeroMem):
>>      xor     rcx, rcx
>>      xor     eax, eax
>>      sub     rcx, rdi
>> -    and     rcx, 15
>> +    and     rcx, 63
>>      mov     r8, rdi
>>      jz      .0
>>      cmp     rcx, rdx
>> @@ -41,13 +41,16 @@ ASM_PFX(InternalMemZeroMem):
>>      rep     stosb
>>  .0:
>>      mov     rcx, rdx
>> -    and     edx, 15
>> -    shr     rcx, 4
>> +    and     edx, 63
>> +    shr     rcx, 6
>>      jz      @ZeroBytes
>>      pxor    xmm0, xmm0
>>  .1:
>> -    movntdq [rdi], xmm0                 ; rdi should be 16-byte
>> aligned
>> -    add     rdi, 16
>> +    movntdq [rdi], xmm0
>> +    movntdq [rdi + 16], xmm0
>> +    movntdq [rdi + 32], xmm0
>> +    movntdq [rdi + 48], xmm0
>> +    add     rdi, 64
>>      loop    .1
>>      mfence
>>  @ZeroBytes:
>> --
>> 2.25.3
>> 
>> 
>> 
>> 
>> 

  reply	other threads:[~2020-10-15 17:47 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-09 20:42 [PATCH] BaseMemoryLibSse2: Take advantage of write combining buffers Compostella, Jeremy
2020-10-13  3:25 ` 回复: [edk2-devel] " gaoliming
2020-10-15 17:47   ` Compostella, Jeremy [this message]
2020-10-16  0:59     ` 回复: " gaoliming

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87ft6f5qka.fsf@jcompost-mobl.ch.intel.com \
    --to=devel@edk2.groups.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox