From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail.byosoft.com.cn (mail.byosoft.com.cn [58.240.74.242]) by mx.groups.io with SMTP id smtpd.web10.3691.1602559527564721027 for ; Mon, 12 Oct 2020 20:25:28 -0700 Authentication-Results: mx.groups.io; dkim=missing; spf=none, err=permanent DNS error (domain: byosoft.com.cn, ip: 58.240.74.242, mailfrom: gaoliming@byosoft.com.cn) Received: from DESKTOPS6D0PVI ([58.246.60.130]) (envelope-sender ) by 192.168.6.13 with ESMTP for ; Tue, 13 Oct 2020 11:25:24 +0800 X-WM-Sender: gaoliming@byosoft.com.cn X-WM-AuthFlag: YES X-WM-AuthUser: gaoliming@byosoft.com.cn From: "gaoliming" To: , References: <87v9fj6shx.fsf@jcompost-mobl.ch.intel.com> In-Reply-To: <87v9fj6shx.fsf@jcompost-mobl.ch.intel.com> Subject: =?UTF-8?B?5Zue5aSNOiBbZWRrMi1kZXZlbF0gW1BBVENIXSBCYXNlTWVtb3J5TGliU3NlMjogVGFrZSBhZHZhbnRhZ2Ugb2Ygd3JpdGUgY29tYmluaW5nIGJ1ZmZlcnM=?= Date: Tue, 13 Oct 2020 11:25:25 +0800 Message-ID: <006801d6a110$7e0d5540$7a27ffc0$@byosoft.com.cn> MIME-Version: 1.0 X-Mailer: Microsoft Outlook 16.0 Thread-Index: AQJbqOYgGlNXEA2jnZ7h0FjHjLBe3KiKzuVA Content-Type: text/plain; charset="gb2312" Content-Transfer-Encoding: quoted-printable Content-Language: zh-cn This is a good enhancement. The change is good to me.=20 Reviewed-by: Liming Gao Thanks Liming > -----=D3=CA=BC=FE=D4=AD=BC=FE----- > =B7=A2=BC=FE=C8=CB: bounce+27952+66093+4905953+8761045@groups.io > =B4=FA=B1=ED Compostella, > Jeremy > =B7=A2=CB=CD=CA=B1=BC=E4: 2020=C4=EA10=D4=C210=C8=D5 4:43 > =CA=D5=BC=FE=C8=CB: devel@edk2.groups.io > =D6=F7=CC=E2: [edk2-devel] [PATCH] BaseMemoryLibSse2: Take advantage of = write > combining buffers >=20 > The current SSE2 implementation of the ZeroMem(), SetMem(), > SetMem16(), SetMem32 and SetMem64 functions is writing 16 bytes per 16 > bytes. It hurts the performances so bad that this is even slower than > a simple 'rep stos' (4% slower) in regular DRAM. >=20 > To take full advantages of the 'movntdq' instruction it is better to > "queue" a total of 64 bytes in the write combining buffers. This > patch implement such a change. Below is a table where I measured > (with 'rdtsc') the time to write an entire 100MB RAM buffer. These > functions operate almost two times faster. >=20 > | Function | Arch | Untouched | 64 bytes | Result | > |----------+------+-----------+----------+--------| > | ZeroMem | Ia32 | 17765947 | 9136062 | 1.945x | > | ZeroMem | X64 | 17525170 | 9233391 | 1.898x | > | SetMem | Ia32 | 17522291 | 9137272 | 1.918x | > | SetMem | X64 | 17949261 | 9176978 | 1.956x | > | SetMem16 | Ia32 | 18219673 | 9372062 | 1.944x | > | SetMem16 | X64 | 17523331 | 9275184 | 1.889x | > | SetMem32 | Ia32 | 18495036 | 9273053 | 1.994x | > | SetMem32 | X64 | 17368864 | 9285885 | 1.870x | > | SetMem64 | Ia32 | 18564473 | 9241362 | 2.009x | > | SetMem64 | X64 | 17506951 | 9280148 | 1.886x | >=20 > Signed-off-by: Jeremy Compostella > --- > .../BaseMemoryLibSse2/Ia32/SetMem.nasm | 11 ++++++---- > .../BaseMemoryLibSse2/Ia32/SetMem16.nasm | 11 ++++++---- > .../BaseMemoryLibSse2/Ia32/SetMem32.nasm | 9 ++++++--- > .../BaseMemoryLibSse2/Ia32/SetMem64.nasm | 20 > +++++++++++++++---- > .../BaseMemoryLibSse2/Ia32/ZeroMem.nasm | 11 ++++++---- > .../Library/BaseMemoryLibSse2/X64/SetMem.nasm | 9 ++++++--- > .../BaseMemoryLibSse2/X64/SetMem16.nasm | 11 ++++++---- > .../BaseMemoryLibSse2/X64/SetMem32.nasm | 9 ++++++--- > .../BaseMemoryLibSse2/X64/SetMem64.nasm | 19 > ++++++++++++++---- > .../BaseMemoryLibSse2/X64/ZeroMem.nasm | 13 +++++++----- > 10 files changed, 85 insertions(+), 38 deletions(-) >=20 > diff --git a/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem.nasm > b/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem.nasm > index 24313cb4b3..a8744300c6 100644 > --- a/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem.nasm > +++ b/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem.nasm > @@ -34,7 +34,7 @@ ASM_PFX(InternalMemSetMem): > mov al, [esp + 16] ; al <- Value > xor ecx, ecx > sub ecx, edi > - and ecx, 15 ; ecx + edi aligns on 16-byte > boundary > + and ecx, 63 ; ecx + edi aligns on 16-byte > boundary > jz .0 > cmp ecx, edx > cmova ecx, edx > @@ -42,8 +42,8 @@ ASM_PFX(InternalMemSetMem): > rep stosb > .0: > mov ecx, edx > - and edx, 15 > - shr ecx, 4 ; ecx <- # of DQwords to set > + and edx, 63 > + shr ecx, 6 ; ecx <- # of DQwords to set > jz @SetBytes > mov ah, al ; ax <- Value | (Value << 8) > add esp, -16 > @@ -53,7 +53,10 @@ ASM_PFX(InternalMemSetMem): > movlhps xmm0, xmm0 ; xmm0 <- Value repeats > 16 times > .1: > movntdq [edi], xmm0 ; edi should be 16-byte > aligned > - add edi, 16 > + movntdq [edi + 16], xmm0 > + movntdq [edi + 32], xmm0 > + movntdq [edi + 48], xmm0 > + add edi, 64 > loop .1 > mfence > movdqu xmm0, [esp] ; restore xmm0 > diff --git a/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem16.nasm > b/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem16.nasm > index 6e308b5594..d461ee086c 100644 > --- a/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem16.nasm > +++ b/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem16.nasm > @@ -33,7 +33,7 @@ ASM_PFX(InternalMemSetMem16): > mov edi, [esp + 8] > xor ecx, ecx > sub ecx, edi > - and ecx, 15 ; ecx + edi aligns on 16-byte > boundary > + and ecx, 63 ; ecx + edi aligns on 16-byte > boundary > mov eax, [esp + 16] > jz .0 > shr ecx, 1 > @@ -43,15 +43,18 @@ ASM_PFX(InternalMemSetMem16): > rep stosw > .0: > mov ecx, edx > - and edx, 7 > - shr ecx, 3 > + and edx, 31 > + shr ecx, 5 > jz @SetWords > movd xmm0, eax > pshuflw xmm0, xmm0, 0 > movlhps xmm0, xmm0 > .1: > movntdq [edi], xmm0 ; edi should be 16-byte > aligned > - add edi, 16 > + movntdq [edi + 16], xmm0 > + movntdq [edi + 32], xmm0 > + movntdq [edi + 48], xmm0 > + add edi, 64 > loop .1 > mfence > @SetWords: > diff --git a/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem32.nasm > b/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem32.nasm > index 2cfc8cb0dd..3ffdcd07d7 100644 > --- a/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem32.nasm > +++ b/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem32.nasm > @@ -43,14 +43,17 @@ ASM_PFX(InternalMemSetMem32): > rep stosd > .0: > mov ecx, edx > - and edx, 3 > - shr ecx, 2 > + and edx, 15 > + shr ecx, 4 > jz @SetDwords > movd xmm0, eax > pshufd xmm0, xmm0, 0 > .1: > movntdq [edi], xmm0 > - add edi, 16 > + movntdq [edi + 16], xmm0 > + movntdq [edi + 32], xmm0 > + movntdq [edi + 48], xmm0 > + add edi, 64 > loop .1 > mfence > @SetDwords: > diff --git a/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem64.nasm > b/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem64.nasm > index e153495a68..cd000648ae 100644 > --- a/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem64.nasm > +++ b/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem64.nasm > @@ -38,17 +38,29 @@ ASM_PFX(InternalMemSetMem64): > add edx, 8 > dec ecx > .0: > - shr ecx, 1 > + push ebx > + mov ebx, ecx > + and ebx, 7 > + shr ecx, 3 > jz @SetQwords > movlhps xmm0, xmm0 > .1: > movntdq [edx], xmm0 > - lea edx, [edx + 16] > + movntdq [edx + 16], xmm0 > + movntdq [edx + 32], xmm0 > + movntdq [edx + 48], xmm0 > + lea edx, [edx + 64] > loop .1 > mfence > @SetQwords: > - jnc .2 > + test ebx, ebx > + jz .3 > + mov ecx, ebx > +.2 > movq qword [edx], xmm0 > -.2: > + lea edx, [edx + 8] > + loop .2 > +.3: > + pop ebx > ret >=20 > diff --git a/MdePkg/Library/BaseMemoryLibSse2/Ia32/ZeroMem.nasm > b/MdePkg/Library/BaseMemoryLibSse2/Ia32/ZeroMem.nasm > index cd34006f59..0e0828551b 100644 > --- a/MdePkg/Library/BaseMemoryLibSse2/Ia32/ZeroMem.nasm > +++ b/MdePkg/Library/BaseMemoryLibSse2/Ia32/ZeroMem.nasm > @@ -33,7 +33,7 @@ ASM_PFX(InternalMemZeroMem): > xor ecx, ecx > sub ecx, edi > xor eax, eax > - and ecx, 15 > + and ecx, 63 > jz .0 > cmp ecx, edx > cmova ecx, edx > @@ -41,13 +41,16 @@ ASM_PFX(InternalMemZeroMem): > rep stosb > .0: > mov ecx, edx > - and edx, 15 > - shr ecx, 4 > + and edx, 63 > + shr ecx, 6 > jz @ZeroBytes > pxor xmm0, xmm0 > .1: > movntdq [edi], xmm0 > - add edi, 16 > + movntdq [edi + 16], xmm0 > + movntdq [edi + 32], xmm0 > + movntdq [edi + 48], xmm0 > + add edi, 64 > loop .1 > mfence > @ZeroBytes: > diff --git a/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem.nasm > b/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem.nasm > index 5bd1c2262d..28b11ee586 100644 > --- a/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem.nasm > +++ b/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem.nasm > @@ -42,8 +42,8 @@ ASM_PFX(InternalMemSetMem): > rep stosb > .0: > mov rcx, rdx > - and rdx, 15 > - shr rcx, 4 > + and rdx, 63 > + shr rcx, 6 > jz @SetBytes > mov ah, al ; ax <- Value repeats twice > movdqa [rsp + 0x10], xmm0 ; save xmm0 > @@ -52,7 +52,10 @@ ASM_PFX(InternalMemSetMem): > movlhps xmm0, xmm0 ; xmm0 <- Value repeats > 16 times > .1: > movntdq [rdi], xmm0 ; rdi should be 16-byte > aligned > - add rdi, 16 > + movntdq [rdi + 16], xmm0 > + movntdq [rdi + 32], xmm0 > + movntdq [rdi + 48], xmm0 > + add rdi, 64 > loop .1 > mfence > movdqa xmm0, [rsp + 0x10] ; restore xmm0 > diff --git a/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem16.nasm > b/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem16.nasm > index 90d159820a..375be19313 100644 > --- a/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem16.nasm > +++ b/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem16.nasm > @@ -33,7 +33,7 @@ ASM_PFX(InternalMemSetMem16): > mov r9, rdi > xor rcx, rcx > sub rcx, rdi > - and rcx, 15 > + and rcx, 63 > mov rax, r8 > jz .0 > shr rcx, 1 > @@ -43,15 +43,18 @@ ASM_PFX(InternalMemSetMem16): > rep stosw > .0: > mov rcx, rdx > - and edx, 7 > - shr rcx, 3 > + and edx, 31 > + shr rcx, 5 > jz @SetWords > movd xmm0, eax > pshuflw xmm0, xmm0, 0 > movlhps xmm0, xmm0 > .1: > movntdq [rdi], xmm0 > - add rdi, 16 > + movntdq [rdi + 16], xmm0 > + movntdq [rdi + 32], xmm0 > + movntdq [rdi + 48], xmm0 > + add rdi, 64 > loop .1 > mfence > @SetWords: > diff --git a/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem32.nasm > b/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem32.nasm > index 928e086889..5d12beaa9a 100644 > --- a/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem32.nasm > +++ b/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem32.nasm > @@ -43,14 +43,17 @@ ASM_PFX(InternalMemSetMem32): > rep stosd > .0: > mov rcx, rdx > - and edx, 3 > - shr rcx, 2 > + and edx, 15 > + shr rcx, 4 > jz @SetDwords > movd xmm0, eax > pshufd xmm0, xmm0, 0 > .1: > movntdq [rdi], xmm0 > - add rdi, 16 > + movntdq [rdi + 16], xmm0 > + movntdq [rdi + 32], xmm0 > + movntdq [rdi + 48], xmm0 > + add rdi, 64 > loop .1 > mfence > @SetDwords: > diff --git a/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem64.nasm > b/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem64.nasm > index d771810542..265983d5ad 100644 > --- a/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem64.nasm > +++ b/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem64.nasm > @@ -37,17 +37,28 @@ ASM_PFX(InternalMemSetMem64): > add rdx, 8 > dec rcx > .0: > - shr rcx, 1 > + push rbx > + mov rbx, rcx > + and rbx, 7 > + shr rcx, 3 > jz @SetQwords > movlhps xmm0, xmm0 > .1: > movntdq [rdx], xmm0 > - lea rdx, [rdx + 16] > + movntdq [rdx + 16], xmm0 > + movntdq [rdx + 32], xmm0 > + movntdq [rdx + 48], xmm0 > + lea rdx, [rdx + 64] > loop .1 > mfence > @SetQwords: > - jnc .2 > - mov [rdx], r8 > + push rdi > + mov rcx, rbx > + mov rax, r8 > + mov rdi, rdx > + rep stosq > + pop rdi > .2: > + pop rbx > ret >=20 > diff --git a/MdePkg/Library/BaseMemoryLibSse2/X64/ZeroMem.nasm > b/MdePkg/Library/BaseMemoryLibSse2/X64/ZeroMem.nasm > index 5ddcae9ca5..21f504e3b7 100644 > --- a/MdePkg/Library/BaseMemoryLibSse2/X64/ZeroMem.nasm > +++ b/MdePkg/Library/BaseMemoryLibSse2/X64/ZeroMem.nasm > @@ -32,7 +32,7 @@ ASM_PFX(InternalMemZeroMem): > xor rcx, rcx > xor eax, eax > sub rcx, rdi > - and rcx, 15 > + and rcx, 63 > mov r8, rdi > jz .0 > cmp rcx, rdx > @@ -41,13 +41,16 @@ ASM_PFX(InternalMemZeroMem): > rep stosb > .0: > mov rcx, rdx > - and edx, 15 > - shr rcx, 4 > + and edx, 63 > + shr rcx, 6 > jz @ZeroBytes > pxor xmm0, xmm0 > .1: > - movntdq [rdi], xmm0 ; rdi should be 16-byte > aligned > - add rdi, 16 > + movntdq [rdi], xmm0 > + movntdq [rdi + 16], xmm0 > + movntdq [rdi + 32], xmm0 > + movntdq [rdi + 48], xmm0 > + add rdi, 64 > loop .1 > mfence > @ZeroBytes: > -- > 2.25.3 >=20 >=20 >=20 >=20 >=20