From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail.byosoft.com.cn (mail.byosoft.com.cn [58.240.74.242]) by mx.groups.io with SMTP id smtpd.web10.4505.1602809987883286262 for ; Thu, 15 Oct 2020 17:59:49 -0700 Authentication-Results: mx.groups.io; dkim=missing; spf=none, err=permanent DNS error (domain: byosoft.com.cn, ip: 58.240.74.242, mailfrom: gaoliming@byosoft.com.cn) Received: from DESKTOPS6D0PVI ([58.246.60.130]) (envelope-sender ) by 192.168.6.13 with ESMTP for ; Fri, 16 Oct 2020 08:59:42 +0800 X-WM-Sender: gaoliming@byosoft.com.cn X-WM-AuthFlag: YES X-WM-AuthUser: gaoliming@byosoft.com.cn From: "gaoliming" To: , References: <87v9fj6shx.fsf@jcompost-mobl.ch.intel.com> <006801d6a110$7e0d5540$7a27ffc0$@byosoft.com.cn> <87ft6f5qka.fsf@jcompost-mobl.ch.intel.com> In-Reply-To: <87ft6f5qka.fsf@jcompost-mobl.ch.intel.com> Subject: =?UTF-8?B?5Zue5aSNOiDlm57lpI06IFtlZGsyLWRldmVsXSBbUEFUQ0hdIEJhc2VNZW1vcnlMaWJTc2UyOiBUYWtlIGFkdmFudGFnZSBvZiB3cml0ZSBjb21iaW5pbmcgYnVmZmVycw==?= Date: Fri, 16 Oct 2020 08:59:44 +0800 Message-ID: <000501d6a357$a355cdc0$ea016940$@byosoft.com.cn> MIME-Version: 1.0 X-Mailer: Microsoft Outlook 16.0 Thread-Index: AQJbqOYgGlNXEA2jnZ7h0FjHjLBe3AIWPVUMAQmqFsWodl4ckA== Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Language: zh-cn I have created pull request for this patch set https://github.com/tianocore= /edk2/pull/1009 Thanks Liming > -----=E9=82=AE=E4=BB=B6=E5=8E=9F=E4=BB=B6----- > =E5=8F=91=E4=BB=B6=E4=BA=BA: bounce+27952+66282+4905953+8761045@groups.i= o > =E4=BB=A3=E8=A1=A8 Compos= tella, > Jeremy > =E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4: 2020=E5=B9=B410=E6=9C=8816=E6=97= =A5 1:48 > =E6=94=B6=E4=BB=B6=E4=BA=BA: gaoliming ; devel= @edk2.groups.io > =E4=B8=BB=E9=A2=98: Re: =E5=9B=9E=E5=A4=8D: [edk2-devel] [PATCH] BaseMem= oryLibSse2: Take advantage > of write combining buffers >=20 > Thank for the review Liming. Is there anything else I should to help > with the review/merge process ? >=20 > Regards, > Jeremy >=20 > gaoliming writes: >=20 > > This is a good enhancement. The change is good to me. > > > > Reviewed-by: Liming Gao > > > > Thanks > > Liming > >> -----=E9=82=AE=E4=BB=B6=E5=8E=9F=E4=BB=B6----- > >> =E5=8F=91=E4=BB=B6=E4=BA=BA: bounce+27952+66093+4905953+8761045@group= s.io > >> =E4=BB=A3=E8=A1=A8 > Compostella, > >> Jeremy > >> =E5=8F=91=E9=80=81=E6=97=B6=E9=97=B4: 2020=E5=B9=B410=E6=9C=8810=E6= =97=A5 4:43 > >> =E6=94=B6=E4=BB=B6=E4=BA=BA: devel@edk2.groups.io > >> =E4=B8=BB=E9=A2=98: [edk2-devel] [PATCH] BaseMemoryLibSse2: Take adva= ntage of > write > >> combining buffers > >> > >> The current SSE2 implementation of the ZeroMem(), SetMem(), > >> SetMem16(), SetMem32 and SetMem64 functions is writing 16 bytes per > 16 > >> bytes. It hurts the performances so bad that this is even slower than > >> a simple 'rep stos' (4% slower) in regular DRAM. > >> > >> To take full advantages of the 'movntdq' instruction it is better to > >> "queue" a total of 64 bytes in the write combining buffers. This > >> patch implement such a change. Below is a table where I measured > >> (with 'rdtsc') the time to write an entire 100MB RAM buffer. These > >> functions operate almost two times faster. > >> > >> | Function | Arch | Untouched | 64 bytes | Result | > >> |----------+------+-----------+----------+--------| > >> | ZeroMem | Ia32 | 17765947 | 9136062 | 1.945x | > >> | ZeroMem | X64 | 17525170 | 9233391 | 1.898x | > >> | SetMem | Ia32 | 17522291 | 9137272 | 1.918x | > >> | SetMem | X64 | 17949261 | 9176978 | 1.956x | > >> | SetMem16 | Ia32 | 18219673 | 9372062 | 1.944x | > >> | SetMem16 | X64 | 17523331 | 9275184 | 1.889x | > >> | SetMem32 | Ia32 | 18495036 | 9273053 | 1.994x | > >> | SetMem32 | X64 | 17368864 | 9285885 | 1.870x | > >> | SetMem64 | Ia32 | 18564473 | 9241362 | 2.009x | > >> | SetMem64 | X64 | 17506951 | 9280148 | 1.886x | > >> > >> Signed-off-by: Jeremy Compostella > >> --- > >> .../BaseMemoryLibSse2/Ia32/SetMem.nasm | 11 ++++++---- > >> .../BaseMemoryLibSse2/Ia32/SetMem16.nasm | 11 ++++++---- > >> .../BaseMemoryLibSse2/Ia32/SetMem32.nasm | 9 ++++++--- > >> .../BaseMemoryLibSse2/Ia32/SetMem64.nasm | 20 > >> +++++++++++++++---- > >> .../BaseMemoryLibSse2/Ia32/ZeroMem.nasm | 11 ++++++---- > >> .../Library/BaseMemoryLibSse2/X64/SetMem.nasm | 9 ++++++--- > >> .../BaseMemoryLibSse2/X64/SetMem16.nasm | 11 ++++++---- > >> .../BaseMemoryLibSse2/X64/SetMem32.nasm | 9 ++++++--- > >> .../BaseMemoryLibSse2/X64/SetMem64.nasm | 19 > >> ++++++++++++++---- > >> .../BaseMemoryLibSse2/X64/ZeroMem.nasm | 13 > +++++++----- > >> 10 files changed, 85 insertions(+), 38 deletions(-) > >> > >> diff --git a/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem.nasm > >> b/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem.nasm > >> index 24313cb4b3..a8744300c6 100644 > >> --- a/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem.nasm > >> +++ b/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem.nasm > >> @@ -34,7 +34,7 @@ ASM_PFX(InternalMemSetMem): > >> mov al, [esp + 16] ; al <- Value > >> xor ecx, ecx > >> sub ecx, edi > >> - and ecx, 15 ; ecx + edi aligns on > 16-byte > >> boundary > >> + and ecx, 63 ; ecx + edi aligns on > 16-byte > >> boundary > >> jz .0 > >> cmp ecx, edx > >> cmova ecx, edx > >> @@ -42,8 +42,8 @@ ASM_PFX(InternalMemSetMem): > >> rep stosb > >> .0: > >> mov ecx, edx > >> - and edx, 15 > >> - shr ecx, 4 ; ecx <- # of DQwords to > set > >> + and edx, 63 > >> + shr ecx, 6 ; ecx <- # of DQwords to > set > >> jz @SetBytes > >> mov ah, al ; ax <- Value | (Value << > 8) > >> add esp, -16 > >> @@ -53,7 +53,10 @@ ASM_PFX(InternalMemSetMem): > >> movlhps xmm0, xmm0 ; xmm0 <- Value > repeats > >> 16 times > >> .1: > >> movntdq [edi], xmm0 ; edi should be 16-byte > >> aligned > >> - add edi, 16 > >> + movntdq [edi + 16], xmm0 > >> + movntdq [edi + 32], xmm0 > >> + movntdq [edi + 48], xmm0 > >> + add edi, 64 > >> loop .1 > >> mfence > >> movdqu xmm0, [esp] ; restore xmm0 > >> diff --git a/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem16.nasm > >> b/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem16.nasm > >> index 6e308b5594..d461ee086c 100644 > >> --- a/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem16.nasm > >> +++ b/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem16.nasm > >> @@ -33,7 +33,7 @@ ASM_PFX(InternalMemSetMem16): > >> mov edi, [esp + 8] > >> xor ecx, ecx > >> sub ecx, edi > >> - and ecx, 15 ; ecx + edi aligns on > 16-byte > >> boundary > >> + and ecx, 63 ; ecx + edi aligns on > 16-byte > >> boundary > >> mov eax, [esp + 16] > >> jz .0 > >> shr ecx, 1 > >> @@ -43,15 +43,18 @@ ASM_PFX(InternalMemSetMem16): > >> rep stosw > >> .0: > >> mov ecx, edx > >> - and edx, 7 > >> - shr ecx, 3 > >> + and edx, 31 > >> + shr ecx, 5 > >> jz @SetWords > >> movd xmm0, eax > >> pshuflw xmm0, xmm0, 0 > >> movlhps xmm0, xmm0 > >> .1: > >> movntdq [edi], xmm0 ; edi should be 16-byte > >> aligned > >> - add edi, 16 > >> + movntdq [edi + 16], xmm0 > >> + movntdq [edi + 32], xmm0 > >> + movntdq [edi + 48], xmm0 > >> + add edi, 64 > >> loop .1 > >> mfence > >> @SetWords: > >> diff --git a/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem32.nasm > >> b/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem32.nasm > >> index 2cfc8cb0dd..3ffdcd07d7 100644 > >> --- a/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem32.nasm > >> +++ b/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem32.nasm > >> @@ -43,14 +43,17 @@ ASM_PFX(InternalMemSetMem32): > >> rep stosd > >> .0: > >> mov ecx, edx > >> - and edx, 3 > >> - shr ecx, 2 > >> + and edx, 15 > >> + shr ecx, 4 > >> jz @SetDwords > >> movd xmm0, eax > >> pshufd xmm0, xmm0, 0 > >> .1: > >> movntdq [edi], xmm0 > >> - add edi, 16 > >> + movntdq [edi + 16], xmm0 > >> + movntdq [edi + 32], xmm0 > >> + movntdq [edi + 48], xmm0 > >> + add edi, 64 > >> loop .1 > >> mfence > >> @SetDwords: > >> diff --git a/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem64.nasm > >> b/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem64.nasm > >> index e153495a68..cd000648ae 100644 > >> --- a/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem64.nasm > >> +++ b/MdePkg/Library/BaseMemoryLibSse2/Ia32/SetMem64.nasm > >> @@ -38,17 +38,29 @@ ASM_PFX(InternalMemSetMem64): > >> add edx, 8 > >> dec ecx > >> .0: > >> - shr ecx, 1 > >> + push ebx > >> + mov ebx, ecx > >> + and ebx, 7 > >> + shr ecx, 3 > >> jz @SetQwords > >> movlhps xmm0, xmm0 > >> .1: > >> movntdq [edx], xmm0 > >> - lea edx, [edx + 16] > >> + movntdq [edx + 16], xmm0 > >> + movntdq [edx + 32], xmm0 > >> + movntdq [edx + 48], xmm0 > >> + lea edx, [edx + 64] > >> loop .1 > >> mfence > >> @SetQwords: > >> - jnc .2 > >> + test ebx, ebx > >> + jz .3 > >> + mov ecx, ebx > >> +.2 > >> movq qword [edx], xmm0 > >> -.2: > >> + lea edx, [edx + 8] > >> + loop .2 > >> +.3: > >> + pop ebx > >> ret > >> > >> diff --git a/MdePkg/Library/BaseMemoryLibSse2/Ia32/ZeroMem.nasm > >> b/MdePkg/Library/BaseMemoryLibSse2/Ia32/ZeroMem.nasm > >> index cd34006f59..0e0828551b 100644 > >> --- a/MdePkg/Library/BaseMemoryLibSse2/Ia32/ZeroMem.nasm > >> +++ b/MdePkg/Library/BaseMemoryLibSse2/Ia32/ZeroMem.nasm > >> @@ -33,7 +33,7 @@ ASM_PFX(InternalMemZeroMem): > >> xor ecx, ecx > >> sub ecx, edi > >> xor eax, eax > >> - and ecx, 15 > >> + and ecx, 63 > >> jz .0 > >> cmp ecx, edx > >> cmova ecx, edx > >> @@ -41,13 +41,16 @@ ASM_PFX(InternalMemZeroMem): > >> rep stosb > >> .0: > >> mov ecx, edx > >> - and edx, 15 > >> - shr ecx, 4 > >> + and edx, 63 > >> + shr ecx, 6 > >> jz @ZeroBytes > >> pxor xmm0, xmm0 > >> .1: > >> movntdq [edi], xmm0 > >> - add edi, 16 > >> + movntdq [edi + 16], xmm0 > >> + movntdq [edi + 32], xmm0 > >> + movntdq [edi + 48], xmm0 > >> + add edi, 64 > >> loop .1 > >> mfence > >> @ZeroBytes: > >> diff --git a/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem.nasm > >> b/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem.nasm > >> index 5bd1c2262d..28b11ee586 100644 > >> --- a/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem.nasm > >> +++ b/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem.nasm > >> @@ -42,8 +42,8 @@ ASM_PFX(InternalMemSetMem): > >> rep stosb > >> .0: > >> mov rcx, rdx > >> - and rdx, 15 > >> - shr rcx, 4 > >> + and rdx, 63 > >> + shr rcx, 6 > >> jz @SetBytes > >> mov ah, al ; ax <- Value repeats > twice > >> movdqa [rsp + 0x10], xmm0 ; save xmm0 > >> @@ -52,7 +52,10 @@ ASM_PFX(InternalMemSetMem): > >> movlhps xmm0, xmm0 ; xmm0 <- Value > repeats > >> 16 times > >> .1: > >> movntdq [rdi], xmm0 ; rdi should be 16-byte > >> aligned > >> - add rdi, 16 > >> + movntdq [rdi + 16], xmm0 > >> + movntdq [rdi + 32], xmm0 > >> + movntdq [rdi + 48], xmm0 > >> + add rdi, 64 > >> loop .1 > >> mfence > >> movdqa xmm0, [rsp + 0x10] ; restore xmm0 > >> diff --git a/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem16.nasm > >> b/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem16.nasm > >> index 90d159820a..375be19313 100644 > >> --- a/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem16.nasm > >> +++ b/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem16.nasm > >> @@ -33,7 +33,7 @@ ASM_PFX(InternalMemSetMem16): > >> mov r9, rdi > >> xor rcx, rcx > >> sub rcx, rdi > >> - and rcx, 15 > >> + and rcx, 63 > >> mov rax, r8 > >> jz .0 > >> shr rcx, 1 > >> @@ -43,15 +43,18 @@ ASM_PFX(InternalMemSetMem16): > >> rep stosw > >> .0: > >> mov rcx, rdx > >> - and edx, 7 > >> - shr rcx, 3 > >> + and edx, 31 > >> + shr rcx, 5 > >> jz @SetWords > >> movd xmm0, eax > >> pshuflw xmm0, xmm0, 0 > >> movlhps xmm0, xmm0 > >> .1: > >> movntdq [rdi], xmm0 > >> - add rdi, 16 > >> + movntdq [rdi + 16], xmm0 > >> + movntdq [rdi + 32], xmm0 > >> + movntdq [rdi + 48], xmm0 > >> + add rdi, 64 > >> loop .1 > >> mfence > >> @SetWords: > >> diff --git a/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem32.nasm > >> b/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem32.nasm > >> index 928e086889..5d12beaa9a 100644 > >> --- a/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem32.nasm > >> +++ b/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem32.nasm > >> @@ -43,14 +43,17 @@ ASM_PFX(InternalMemSetMem32): > >> rep stosd > >> .0: > >> mov rcx, rdx > >> - and edx, 3 > >> - shr rcx, 2 > >> + and edx, 15 > >> + shr rcx, 4 > >> jz @SetDwords > >> movd xmm0, eax > >> pshufd xmm0, xmm0, 0 > >> .1: > >> movntdq [rdi], xmm0 > >> - add rdi, 16 > >> + movntdq [rdi + 16], xmm0 > >> + movntdq [rdi + 32], xmm0 > >> + movntdq [rdi + 48], xmm0 > >> + add rdi, 64 > >> loop .1 > >> mfence > >> @SetDwords: > >> diff --git a/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem64.nasm > >> b/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem64.nasm > >> index d771810542..265983d5ad 100644 > >> --- a/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem64.nasm > >> +++ b/MdePkg/Library/BaseMemoryLibSse2/X64/SetMem64.nasm > >> @@ -37,17 +37,28 @@ ASM_PFX(InternalMemSetMem64): > >> add rdx, 8 > >> dec rcx > >> .0: > >> - shr rcx, 1 > >> + push rbx > >> + mov rbx, rcx > >> + and rbx, 7 > >> + shr rcx, 3 > >> jz @SetQwords > >> movlhps xmm0, xmm0 > >> .1: > >> movntdq [rdx], xmm0 > >> - lea rdx, [rdx + 16] > >> + movntdq [rdx + 16], xmm0 > >> + movntdq [rdx + 32], xmm0 > >> + movntdq [rdx + 48], xmm0 > >> + lea rdx, [rdx + 64] > >> loop .1 > >> mfence > >> @SetQwords: > >> - jnc .2 > >> - mov [rdx], r8 > >> + push rdi > >> + mov rcx, rbx > >> + mov rax, r8 > >> + mov rdi, rdx > >> + rep stosq > >> + pop rdi > >> .2: > >> + pop rbx > >> ret > >> > >> diff --git a/MdePkg/Library/BaseMemoryLibSse2/X64/ZeroMem.nasm > >> b/MdePkg/Library/BaseMemoryLibSse2/X64/ZeroMem.nasm > >> index 5ddcae9ca5..21f504e3b7 100644 > >> --- a/MdePkg/Library/BaseMemoryLibSse2/X64/ZeroMem.nasm > >> +++ b/MdePkg/Library/BaseMemoryLibSse2/X64/ZeroMem.nasm > >> @@ -32,7 +32,7 @@ ASM_PFX(InternalMemZeroMem): > >> xor rcx, rcx > >> xor eax, eax > >> sub rcx, rdi > >> - and rcx, 15 > >> + and rcx, 63 > >> mov r8, rdi > >> jz .0 > >> cmp rcx, rdx > >> @@ -41,13 +41,16 @@ ASM_PFX(InternalMemZeroMem): > >> rep stosb > >> .0: > >> mov rcx, rdx > >> - and edx, 15 > >> - shr rcx, 4 > >> + and edx, 63 > >> + shr rcx, 6 > >> jz @ZeroBytes > >> pxor xmm0, xmm0 > >> .1: > >> - movntdq [rdi], xmm0 ; rdi should be 16-byte > >> aligned > >> - add rdi, 16 > >> + movntdq [rdi], xmm0 > >> + movntdq [rdi + 16], xmm0 > >> + movntdq [rdi + 32], xmm0 > >> + movntdq [rdi + 48], xmm0 > >> + add rdi, 64 > >> loop .1 > >> mfence > >> @ZeroBytes: > >> -- > >> 2.25.3 > >> > >> > >> > >> > >> >=20 >=20 >=20 >=20