From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received-SPF: Pass (sender SPF authorized) identity=mailfrom; client-ip=17.171.2.60; helo=ma1-aaemail-dr-lapp01.apple.com; envelope-from=afish@apple.com; receiver=edk2-devel@lists.01.org Received: from ma1-aaemail-dr-lapp01.apple.com (ma1-aaemail-dr-lapp01.apple.com [17.171.2.60]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 335B0219493E6 for ; Wed, 6 Mar 2019 22:45:37 -0800 (PST) Received: from pps.filterd (ma1-aaemail-dr-lapp01.apple.com [127.0.0.1]) by ma1-aaemail-dr-lapp01.apple.com (8.16.0.27/8.16.0.27) with SMTP id x276awB9023161; Wed, 6 Mar 2019 22:45:32 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=apple.com; h=mime-version : content-type : sender : from : message-id : subject : date : in-reply-to : cc : to : references; s=20180706; bh=hmUuph6N6Umhy7/vrhX5+Mw7JDYlpwQfi06HGnjMY3c=; b=EBTdq1BlbDQA/s/RCXPwBH8eTzg9ZYoWjMBo2UAbCAi3G2yZ6nyKqsKAV0NLsIMFBePb qsYcDVdfjq2jYOQ4LeUDqO6y4d8E70vowUhka+vtx6DGoz4jsITODvtuQ6Y52EkXye4p 4tBpECUltwTapY7m3FHbKvcfSZSk9dtM8Oun1V5CubLw/K++W25KV76EZRuVOb/2xq7V H8blE2tULIZXMGnI8OdAmVDeMVIrWErAxRlg0qPAkB16swZ2vpP56nsc0wfE7M/OV7Cd d4YU6n6OJeFbHnd65H6kldPc+uRECZhDDBc9h0EANV8S4Fln6nxaz5B1KKUr5pjW4efW hQ== Received: from mr2-mtap-s02.rno.apple.com (mr2-mtap-s02.rno.apple.com [17.179.226.134]) by ma1-aaemail-dr-lapp01.apple.com with ESMTP id 2qysccmaux-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO); Wed, 06 Mar 2019 22:45:32 -0800 MIME-version: 1.0 Received: from ma1-mmpp-sz08.apple.com (ma1-mmpp-sz08.apple.com [17.171.128.176]) by mr2-mtap-s02.rno.apple.com (Oracle Communications Messaging Server 8.0.2.3.20181024 64bit (built Oct 24 2018)) with ESMTPS id <0PNZ0027LHFVCI00@mr2-mtap-s02.rno.apple.com>; Wed, 06 Mar 2019 22:45:32 -0800 (PST) Received: from process_milters-daemon.ma1-mmpp-sz08.apple.com by ma1-mmpp-sz08.apple.com (Oracle Communications Messaging Server 8.0.2.3.20181024 64bit (built Oct 24 2018)) id <0PNZ00J00FMK0S00@ma1-mmpp-sz08.apple.com>; Wed, 06 Mar 2019 22:45:31 -0800 (PST) X-Va-A: X-Va-T-CD: 81ca60fce39c2560b6c4a7e5841f9b8f X-Va-E-CD: 7d4c618f85f76966dfbaabb394b9e4f0 X-Va-R-CD: a66fd94a21bd5e42e01791934a036d45 X-Va-CD: 0 X-Va-ID: f5179eb4-9899-448d-9da8-944051ed5712 X-V-A: X-V-T-CD: 81ca60fce39c2560b6c4a7e5841f9b8f X-V-E-CD: 7d4c618f85f76966dfbaabb394b9e4f0 X-V-R-CD: a66fd94a21bd5e42e01791934a036d45 X-V-CD: 0 X-V-ID: fad594bb-d13a-440a-af18-a64c99a80fa3 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2019-03-07_03:,, signatures=0 Received: from [17.234.172.123] (unknown [17.234.172.123]) by ma1-mmpp-sz08.apple.com (Oracle Communications Messaging Server 8.0.2.3.20181024 64bit (built Oct 24 2018)) with ESMTPSA id <0PNZ00IACHFRGL50@ma1-mmpp-sz08.apple.com>; Wed, 06 Mar 2019 22:45:31 -0800 (PST) Sender: afish@apple.com From: Andrew Fish Message-id: <0F79F0FC-123A-4862-926D-DFE08CF5740F@apple.com> Date: Wed, 06 Mar 2019 22:45:25 -0800 In-reply-to: <74D8A39837DF1E4DA445A8C0B3885C503F55722E@shsmsx102.ccr.corp.intel.com> Cc: Mike Kinney , "Gao, Liming" , edk2-devel-01 To: "Yao, Jiewen" References: <20190305014059.17988-1-shenglei.zhang@intel.com> <20190305014059.17988-4-shenglei.zhang@intel.com> <4A89E2EF3DFEDB4C8BFDE51014F606A14E3FBBC2@SHSMSX104.ccr.corp.intel.com> <4A89E2EF3DFEDB4C8BFDE51014F606A14E3FCC47@SHSMSX104.ccr.corp.intel.com> <47E8FE4A-4BDC-48A3-B7E6-02F1682C0990@apple.com> <74D8A39837DF1E4DA445A8C0B3885C503F556F4E@shsmsx102.ccr.corp.intel.com> <74D8A39837DF1E4DA445A8C0B3885C503F55722E@shsmsx102.ccr.corp.intel.com> X-Mailer: Apple Mail (2.3445.6.18) X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2019-03-07_03:, , signatures=0 X-Content-Filtered-By: Mailman/MimeDel 2.1.29 Subject: Re: [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code X-BeenThere: edk2-devel@lists.01.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: EDK II Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Mar 2019 06:45:37 -0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable > On Mar 6, 2019, at 10:09 PM, Yao, Jiewen wrote: >=20 > Thanks Andrew. > Now I 100% agree with you - I think it is fine to restrict new C = inline assembler, or at least have a very high bar to add anything new. = Any new inline assembler *should also be simple and just be a function = abstracting a CPU op-code* that is not available to C. This is how we = prevent the maintenance issues you are worrying about. > =20 > And I also agree with your case. > =20 > Let=E2=80=99s discuss another case. Below 2 functions SetJump/LongJump = are updated recently, by me unfortunately. > =20 > It is unclear to me what is the size optimization we can get by inline = SetJump/LongJump. > But obviously, it brings the maintenance issue to me. > And I don=E2=80=99t believe it meets the criteria you mentioned above. > =20 Jiewen, I agree it seems like given the rules we agree on this code should be = written in NASM.=20 Thanks, Andrew Fish > _declspec (naked) > RETURNS_TWICE > UINTN > EFIAPI > SetJump ( > OUT BASE_LIBRARY_JUMP_BUFFER *JumpBuffer > ) > { > _asm { > push [esp + 4] > call InternalAssertJumpBuffer > pop ecx > pop ecx > mov edx, [esp] > =20 > xor eax, eax > mov [edx + 24], eax ; save 0 to SSP > =20 > mov eax, [PcdGet32 (PcdControlFlowEnforcementPropertyMask)] > test eax, eax > jz CetDone > _emit 0x0F > _emit 0x20 > _emit 0xE0 ; mov eax, cr4 > bt eax, 23 ; check if CET is enabled > jnc CetDone > =20 > mov eax, 1 > _emit 0xF3 > _emit 0x0F > _emit 0xAE > _emit 0xE8 ; INCSSP EAX to read original SSP > _emit 0xF3 > _emit 0x0F > _emit 0x1E > _emit 0xC8 ; READSSP EAX > mov [edx + 0x24], eax ; save SSP > =20 > CetDone: > =20 > mov [edx], ebx > mov [edx + 4], esi > mov [edx + 8], edi > mov [edx + 12], ebp > mov [edx + 16], esp > mov [edx + 20], ecx > xor eax, eax > jmp ecx > } > } > =20 > __declspec (naked) > VOID > EFIAPI > InternalLongJump ( > IN BASE_LIBRARY_JUMP_BUFFER *JumpBuffer, > IN UINTN Value > ) > { > _asm { > mov eax, [PcdGet32 (PcdControlFlowEnforcementPropertyMask)] > test eax, eax > jz CetDone > _emit 0x0F > _emit 0x20 > _emit 0xE0 ; mov eax, cr4 > bt eax, 23 ; check if CET is enabled > jnc CetDone > =20 > mov edx, [esp + 4] ; edx =3D JumpBuffer > mov edx, [edx + 24] ; edx =3D target SSP > _emit 0xF3 > _emit 0x0F > _emit 0x1E > _emit 0xC8 ; READSSP EAX > sub edx, eax ; edx =3D delta > mov eax, edx ; eax =3D delta > =20 > shr eax, 2 ; eax =3D delta/sizeof(UINT32) > _emit 0xF3 > _emit 0x0F > _emit 0xAE > _emit 0xE8 ; INCSSP EAX > =20 > CetDone: > =20 > pop eax ; skip return address > pop edx ; edx <- JumpBuffer > pop eax ; eax <- Value > mov ebx, [edx] > mov esi, [edx + 4] > mov edi, [edx + 8] > mov ebp, [edx + 12] > mov esp, [edx + 16] > jmp dword ptr [edx + 20] > } > } > =20 > =20 > =C2=A0 <> > <>From: afish@apple.com [mailto:afish@apple.com]=20 > Sent: Wednesday, March 6, 2019 9:56 PM > To: Yao, Jiewen > Cc: Kinney, Michael D ; Gao, Liming = ; edk2-devel-01 > Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove = inline X86 assembly code > =20 > =20 >=20 >=20 > On Mar 6, 2019, at 9:06 PM, Yao, Jiewen > wrote: > =20 > HI Mike and Andrew > The problem is maintainability. >=20 > If we have multiple copy of implementation, a developer need verify = multiple copy of implementation, if we make update. Before that, a = developer has to be aware that there is multiple copy of implementation. = - That increases the complexity. >=20 > If we have everything, there MAY be 5 copy - ASM, NASM, S, GCC inline, = MS inline, theoretically. > Now, we remove ASM. It is good first step. > But we may still have 4 copies. I suggest we consider do more. >=20 > =20 > Jiewen, > =20 > I think you are trying do the right thing, but are optimize the wrong = thing.=20 > =20 > Most of the GCC/Clang inline assembler code is in Gccinline.c and = since that code is mostly just abstracting an x86 instruction and the = functions are very simple and thus it is NOT code that needs ongoing = maintenance.=20 > =20 > Lets look at the history: > = https://github.com/tianocore/edk2/commits/master/MdePkg/Library/BaseLib/X6= 4/GccInline.c = > The last logic fix was Jun 1, 2010 > =20 > = https://github.com/tianocore/edk2/commits/master/MdePkg/Library/BaseLib/Ia= 32/GccInline.c = > Ok so Mike had to make a fix in this file in 2015 to make something = optional, due to an embedded CPU defeating an instruction. But prior to = that it was 2010.=20 > =20 > The set of things that are C inline assembler we have should be static = and not require much maintenance. More complex code should be written in = C and use the C inline assembler functions we already have. If any more = complex assembly code is required we should just write it in NASM. So I = think it is fine to restrict new C inline assembler, or at least have a = very high bar to add anything new. Any new inline assembler should also = be simple and just be a function abstracting a CPU op-code that is not = available to C. This is how we prevent the maintenance issues you are = worrying about.=20 > =20 > I gave an example in this mail thread on how a Breakpoint goes from = being 1 byte to 11 bytes if you remove the C inline assembler. For clang = with LTO enabled a CpuBreakpoint will always get inlined into the = callers code and it will only take the one byte for int 3 instruction. = If that code moves to NASM then it get replaces with a 5 byte call = instruction and an actual C ABI function for the breakpoint.=20 > =20 > VOID > EFIAPI > CpuBreakpoint ( > VOID > ) > { > __asm__ __volatile__ ("int $3"); > } >=20 >=20 > Today with clang LTO turned on calling CpuBreakpoint() looks like this = from the callers point of view.=20 > =20 > a.out[0x1fa5] <+6>: cc int3 =20 >=20 >=20 > But if we move that to NASM: >=20 >=20 > a.out[0x1fa6] <+7>: e8 07 00 00 00 calll 0x1fb2 = ; CpuBreakpoint >=20 >=20 > plus: > a.out`CpuBreakpoint: > a.out[0x1f99] <+0>: 55 pushl %ebp > a.out[0x1f9a] <+1>: 89 e5 movl %esp, %ebp > a.out[0x1f9c] <+3>: cc int3 =20 > a.out[0x1f9d] <+4>: 5d popl %ebp > a.out[0x1f9e] <+5>: c3 retl =20 >=20 >=20 > For any compiler that emits the frame pointer if you move the INT 3 to = assembly you need the frame pointer or the Exception Lib is not going to = be able to print out the stack backtrace of the code when you hit a = breakpoint. So this is how I get to 11 vs. 1 bytes.=20 >=20 >=20 > Thanks, > =20 > Andrew Fish > =20 > PS for clang LTO the compiler compiles to bitcode and that is an = abstracted assembly language. At link time all that code gets optimized = to together and then passed through the CPU specific code gen. For C = inline assembler the assembly instructions end up in the bitcode with = lots of information about the constraints. That is why these GccInline.c = functions almost always get inlined with clang and LTO.=20 > =20 > The CpuBreakpoint() function looks like this in bitcode, but the = function call gets optimized away by LTO in the code gen.=20 > =20 > ; Function Attrs: noinline nounwind optnone ssp uwtable > define void @CpuBreakpoint() #0 { > call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"() = #2, !srcloc !3 > ret void > } >=20 >=20 > Even something more complex like AsmReadMsr64() can get inlined, but = it contains a lot more info about the constraints: > =20 > ; Function Attrs: noinline nounwind optnone ssp uwtable > define i64 @AsmReadMsr64(i32) #0 { > %2 =3D alloca i32, align 4 > %3 =3D alloca i64, align 8 > store i32 %0, i32* %2, align 4 > %4 =3D load i32, i32* %2, align 4 > %5 =3D call i64 asm sideeffect "rdmsr", = "=3DA,{cx},~{dirflag},~{fpsr},~{flags}"(i32 %4) #2, !srcloc !4 > store i64 %5, i64* %3, align 8 > %6 =3D load i64, i64* %3, align 8 > ret i64 %6 > } >=20 >=20 > The alloca, store, load, etc are the same bitcode instructions you = would see with C code.=20 >=20 >=20 > A recently case that SetJump/LongJump, I updated the NASM file for = both IA32 and X64. >=20 > Later, to my surprise, only X64 version NASM works, and IA32 version = NASM does not work. > Then I notice that there is a different copy if IA32 Jump.c MS inline = - I also need update. That is frustrated. >=20 > I think there should be a balance between optimization and code = readability/maintainability. >=20 > Do we have data on what size benefit we can get with these inline = function, from whole image level? > We can do evaluation at function level, case by case. > If we see a huge size benefit, I agree to keep this function. > If we see no size benefit, I suggest we do the cleanup for this = function. >=20 >=20 > Thank you > Yao Jiewen >=20 >=20 >=20 >=20 > -----Original Message----- > From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org = ] On Behalf Of > Andrew Fish via edk2-devel > Sent: Wednesday, March 6, 2019 5:31 PM > To: Kinney, Michael D > > Cc: edk2-devel-01 >; Gao, Liming > > > Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove > inline X86 assembly code >=20 >=20 >=20 >=20 > On Mar 6, 2019, at 4:41 PM, Kinney, Michael D > > = wrote: >=20 >=20 > Liming, >=20 > That does not work either because inline assembly uses compiler = specific > syntax. >=20 >=20 > Please update with the specific list of functions that you think the = inline > should be removed to improve maintenance with no impacts in size/perf. >=20 > =20 >=20 > Mike, >=20 > It is easy to find the gcc/clang inline assembler, just `git grep = "__asm__ > __volatile__"` >=20 > The main files are: > https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseIoLib = > Intrinsic/IoLibGcc.c > https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/X = > 64/GccInline.c > https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/I = > a32/GccInline.c > https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync = > hronizationLib/Ia32/GccInline.c > https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync = > hronizationLib/X64/GccInline.c >=20 > This is really just compiler optimizer control. > Library/BaseSynchronizationLib/SynchronizationGcc.c:21:#define > _ReadWriteBarrier() do { __asm__ __volatile__ ("": : : "memory"); } = while(0) >=20 > Looks like this is working around the alignment ASSERT in = BaseIoLibIntrinsic. > OvmfPkg/QemuVideoDxe/UnalignedIoGcc.c:43: __asm__ __volatile__ > ( "outl %0, %1" : : "a" (Value), "d" ((UINT16)Port) ); > OvmfPkg/QemuVideoDxe/UnalignedIoGcc.c:67: __asm__ __volatile__ > ( "inl %1, %0" : "=3Da" (Data) : "d" ((UINT16)Port) ); >=20 > It seems like we have a reasonable set and should not use the inline > assembly for any more complex code. >=20 > Thanks, >=20 > Andrew Fish >=20 >=20 > The issue you pointed to was around SetJump()/LongJump(). Can we = limit > this BZ to only those 2 functions? >=20 >=20 > Mike > <> > From: Gao, Liming > Sent: Wednesday, March 6, 2019 4:28 PM > To: afish@apple.com > Cc: Zhang, Shenglei >; edk2-devel-01 > >; Kinney, = Michael D > > >=20 > Subject: RE: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove > inline X86 assembly code >=20 >=20 > Andrew: > I want to keep only one implementation. If inline assembly c source = is > preferred, I suggest to remove its nasm implementation. >=20 >=20 > Thanks > Liming > <>From: afish@apple.com = > > [mailto:afish@apple.com = ] >=20 > Sent: Tuesday, March 5, 2019 2:44 PM > To: Gao, Liming >> > Cc: Zhang, Shenglei >>; = edk2-devel-01 > >>; = Kinney, > Michael D = >> >=20 > Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove > inline X86 assembly code >=20 >=20 >=20 >=20 >=20 > On Mar 5, 2019, at 1:32 PM, Gao, Liming >> wrote: >=20 >=20 > Andrew: > BZ 1163 is to remove inline X86 assembly code in C source file. But, = this > patch is wrong. I have gave my comments to update this patch. >=20 >=20 >=20 > Why do we want to remove inline X86 assembly. As I mentioned it will = lead > to larger binaries, slower binaries, and less optimized code. >=20 >=20 > For example take this simple inline assembly function: >=20 > VOID > EFIAPI > CpuBreakpoint ( > VOID > ) > { > __asm__ __volatile__ ("int $3"); > } >=20 >=20 > Today with clang LTO turned on calling CpuBreakpoint() looks like this = from > the callers point of view. >=20 >=20 > a.out[0x1fa5] <+6>: cc int3 >=20 >=20 > But if we move that to NASM: >=20 >=20 > a.out[0x1fa6] <+7>: e8 07 00 00 00 calll > 0x1fb2 ; CpuBreakpoint >=20 >=20 >=20 > plus: > a.out`CpuBreakpoint: > a.out[0x1fb2] <+0>: cc int3 > a.out[0x1fb3] <+1>: c3 retl >=20 >=20 > And there is also an extra push and pop on the stack. The other issue = is the > call to the function that is unknown to the compiler will act like a > _ReadWriteBarrier (Yikes I see _ReadWriteBarrier is deprecated in VC++ > 2017). Is the depreciation of some of these intrinsics in VC++ driving = the > removal of inline assembly? For GCC inline assembly works great for = local > compile, and for clang LTO it works in entire program optimization. >=20 >=20 > The LTO bitcode includes inline assembly and constraints so that the > optimizer knows what to do so it can get optimized just like the = abstract > bitcode during the Link Time Optimization. >=20 >=20 > This is CpuBreakpoint(): > ; Function Attrs: noinline nounwind optnone ssp uwtable > define void @CpuBreakpoint() #0 { > call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"() > #2, !srcloc !3 >=20 > ret void > } >=20 >=20 > This is AsmReadMsr64(): > ; Function Attrs: noinline nounwind optnone ssp uwtable > define i64 @AsmReadMsr64(i32) #0 { > %2 =3D alloca i32, align 4 > %3 =3D alloca i64, align 8 > store i32 %0, i32* %2, align 4 > %4 =3D load i32, i32* %2, align 4 > %5 =3D call i64 asm sideeffect "rdmsr", > "=3DA,{cx},~{dirflag},~{fpsr},~{flags}"(i32 %4) #2, !srcloc !4 >=20 > store i64 %5, i64* %3, align 8 > %6 =3D load i64, i64* %3, align 8 > ret i64 %6 > } >=20 >=20 >=20 > I agree it make sense to remove .S and .asm files and only have .nasm = files. >=20 > Thanks, >=20 > Andrew Fish >=20 > PS For the Xcode clang since it emits frame pointers by default we = need to > add the extra 4 bytes to save the assembly functions so the stack can = get > unwound. >=20 >=20 > a.out`CpuBreakpoint: > a.out[0x1f99] <+0>: 55 pushl %ebp > a.out[0x1f9a] <+1>: 89 e5 movl %esp, %ebp > a.out[0x1f9c] <+3>: cc int3 > a.out[0x1f9d] <+4>: 5d popl %ebp > a.out[0x1f9e] <+5>: c3 retl >=20 >=20 > So breakpoint goes from 1 byte to 11 bytes if we get rid of the = intrinsics. >=20 >=20 > The change is to reduce the duplicated implementation. It will be good > on the code maintain. Recently, one patch has to update .c and .nasm = both. > Please see > https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html = > >. > Beside this change, I propose to remove all GAS assembly code for IA32 = and > X64 arch. After those change, the patch owner will collect the impact = of the > image size. >=20 >=20 > Thanks > Liming >=20 > _______________________________________________ > edk2-devel mailing list > edk2-devel@lists.01.org > https://lists.01.org/mailman/listinfo/edk2-devel =