From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received-SPF: Pass (sender SPF authorized) identity=mailfrom; client-ip=134.134.136.31; helo=mga06.intel.com; envelope-from=jiewen.yao@intel.com; receiver=edk2-devel@lists.01.org Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 7ED082194D3B8 for ; Wed, 6 Mar 2019 22:09:29 -0800 (PST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 06 Mar 2019 22:09:28 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.58,450,1544515200"; d="scan'208,217";a="324987746" Received: from fmsmsx105.amr.corp.intel.com ([10.18.124.203]) by fmsmga006.fm.intel.com with ESMTP; 06 Mar 2019 22:09:28 -0800 Received: from fmsmsx116.amr.corp.intel.com (10.18.116.20) by FMSMSX105.amr.corp.intel.com (10.18.124.203) with Microsoft SMTP Server (TLS) id 14.3.408.0; Wed, 6 Mar 2019 22:09:28 -0800 Received: from shsmsx106.ccr.corp.intel.com (10.239.4.159) by fmsmsx116.amr.corp.intel.com (10.18.116.20) with Microsoft SMTP Server (TLS) id 14.3.408.0; Wed, 6 Mar 2019 22:09:27 -0800 Received: from shsmsx102.ccr.corp.intel.com ([169.254.2.163]) by SHSMSX106.ccr.corp.intel.com ([169.254.10.144]) with mapi id 14.03.0415.000; Thu, 7 Mar 2019 14:09:25 +0800 From: "Yao, Jiewen" To: "afish@apple.com" CC: "Kinney, Michael D" , "Gao, Liming" , edk2-devel-01 Thread-Topic: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code Thread-Index: AQHU05YqeJOPQtHvH0qxU5X+IVYthKX9CLMAgAATyACAAa+GgIAAA6MAgAAOCICAAL09gP//jMcAgACH+aA= Date: Thu, 7 Mar 2019 06:09:24 +0000 Message-ID: <74D8A39837DF1E4DA445A8C0B3885C503F55722E@shsmsx102.ccr.corp.intel.com> References: <20190305014059.17988-1-shenglei.zhang@intel.com> <20190305014059.17988-4-shenglei.zhang@intel.com> <4A89E2EF3DFEDB4C8BFDE51014F606A14E3FBBC2@SHSMSX104.ccr.corp.intel.com> <4A89E2EF3DFEDB4C8BFDE51014F606A14E3FCC47@SHSMSX104.ccr.corp.intel.com> <47E8FE4A-4BDC-48A3-B7E6-02F1682C0990@apple.com> <74D8A39837DF1E4DA445A8C0B3885C503F556F4E@shsmsx102.ccr.corp.intel.com> In-Reply-To: Accept-Language: zh-CN, en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiOTc5NzJhY2ItY2Y0ZC00NGYwLThiMWMtMDY5NGQ0YzY1MDM4IiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX05UIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE3LjEwLjE4MDQuNDkiLCJUcnVzdGVkTGFiZWxIYXNoIjoiZ2tBdmN6ZnVseFZacExIWjdTXC9hRUdTUGFYQUxDU29WSGppcnowSW9EOW9CZEk1MGdnQk1MdUg3NDlZejdyYWkifQ== x-ctpclassification: CTP_NT dlp-product: dlpe-windows dlp-version: 11.0.400.15 dlp-reaction: no-action x-originating-ip: [10.239.127.40] MIME-Version: 1.0 X-Content-Filtered-By: Mailman/MimeDel 2.1.29 Subject: Re: [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code X-BeenThere: edk2-devel@lists.01.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: EDK II Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 07 Mar 2019 06:09:30 -0000 Content-Language: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Thanks Andrew. Now I 100% agree with you - I think it is fine to restrict new C inline ass= embler, or at least have a very high bar to add anything new. Any new inlin= e assembler *should also be simple and just be a function abstracting a CPU= op-code* that is not available to C. This is how we prevent the maintenanc= e issues you are worrying about. And I also agree with your case. Let's discuss another case. Below 2 functions SetJump/LongJump are updated = recently, by me unfortunately. It is unclear to me what is the size optimization we can get by inline SetJ= ump/LongJump. But obviously, it brings the maintenance issue to me. And I don't believe it meets the criteria you mentioned above. _declspec (naked) RETURNS_TWICE UINTN EFIAPI SetJump ( OUT BASE_LIBRARY_JUMP_BUFFER *JumpBuffer ) { _asm { push [esp + 4] call InternalAssertJumpBuffer pop ecx pop ecx mov edx, [esp] xor eax, eax mov [edx + 24], eax ; save 0 to SSP mov eax, [PcdGet32 (PcdControlFlowEnforcementPropertyMask)] test eax, eax jz CetDone _emit 0x0F _emit 0x20 _emit 0xE0 ; mov eax, cr4 bt eax, 23 ; check if CET is enabled jnc CetDone mov eax, 1 _emit 0xF3 _emit 0x0F _emit 0xAE _emit 0xE8 ; INCSSP EAX to read original SSP _emit 0xF3 _emit 0x0F _emit 0x1E _emit 0xC8 ; READSSP EAX mov [edx + 0x24], eax ; save SSP CetDone: mov [edx], ebx mov [edx + 4], esi mov [edx + 8], edi mov [edx + 12], ebp mov [edx + 16], esp mov [edx + 20], ecx xor eax, eax jmp ecx } } __declspec (naked) VOID EFIAPI InternalLongJump ( IN BASE_LIBRARY_JUMP_BUFFER *JumpBuffer, IN UINTN Value ) { _asm { mov eax, [PcdGet32 (PcdControlFlowEnforcementPropertyMask)] test eax, eax jz CetDone _emit 0x0F _emit 0x20 _emit 0xE0 ; mov eax, cr4 bt eax, 23 ; check if CET is enabled jnc CetDone mov edx, [esp + 4] ; edx =3D JumpBuffer mov edx, [edx + 24] ; edx =3D target SSP _emit 0xF3 _emit 0x0F _emit 0x1E _emit 0xC8 ; READSSP EAX sub edx, eax ; edx =3D delta mov eax, edx ; eax =3D delta shr eax, 2 ; eax =3D delta/sizeof(UINT32) _emit 0xF3 _emit 0x0F _emit 0xAE _emit 0xE8 ; INCSSP EAX CetDone: pop eax ; skip return address pop edx ; edx <- JumpBuffer pop eax ; eax <- Value mov ebx, [edx] mov esi, [edx + 4] mov edi, [edx + 8] mov ebp, [edx + 12] mov esp, [edx + 16] jmp dword ptr [edx + 20] } } From: afish@apple.com [mailto:afish@apple.com] Sent: Wednesday, March 6, 2019 9:56 PM To: Yao, Jiewen Cc: Kinney, Michael D ; Gao, Liming ; edk2-devel-01 Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inlin= e X86 assembly code On Mar 6, 2019, at 9:06 PM, Yao, Jiewen > wrote: HI Mike and Andrew The problem is maintainability. If we have multiple copy of implementation, a developer need verify multipl= e copy of implementation, if we make update. Before that, a developer has t= o be aware that there is multiple copy of implementation. - That increases = the complexity. If we have everything, there MAY be 5 copy - ASM, NASM, S, GCC inline, MS i= nline, theoretically. Now, we remove ASM. It is good first step. But we may still have 4 copies. I suggest we consider do more. Jiewen, I think you are trying do the right thing, but are optimize the wrong thing= . Most of the GCC/Clang inline assembler code is in Gccinline.c and since tha= t code is mostly just abstracting an x86 instruction and the functions are = very simple and thus it is NOT code that needs ongoing maintenance. Lets look at the history: https://github.com/tianocore/edk2/commits/master/MdePkg/Library/BaseLib/X64= /GccInline.c The last logic fix was Jun 1, 2010 https://github.com/tianocore/edk2/commits/master/MdePkg/Library/BaseLib/Ia3= 2/GccInline.c Ok so Mike had to make a fix in this file in 2015 to make something optiona= l, due to an embedded CPU defeating an instruction. But prior to that it wa= s 2010. The set of things that are C inline assembler we have should be static and = not require much maintenance. More complex code should be written in C and = use the C inline assembler functions we already have. If any more complex a= ssembly code is required we should just write it in NASM. So I think it is = fine to restrict new C inline assembler, or at least have a very high bar t= o add anything new. Any new inline assembler should also be simple and just= be a function abstracting a CPU op-code that is not available to C. This i= s how we prevent the maintenance issues you are worrying about. I gave an example in this mail thread on how a Breakpoint goes from being = 1 byte to 11 bytes if you remove the C inline assembler. For clang with LTO= enabled a CpuBreakpoint will always get inlined into the callers code and = it will only take the one byte for int 3 instruction. If that code moves to= NASM then it get replaces with a 5 byte call instruction and an actual C A= BI function for the breakpoint. VOID EFIAPI CpuBreakpoint ( VOID ) { __asm__ __volatile__ ("int $3"); } Today with clang LTO turned on calling CpuBreakpoint() looks like this from= the callers point of view. a.out[0x1fa5] <+6>: cc int3 But if we move that to NASM: a.out[0x1fa6] <+7>: e8 07 00 00 00 calll 0x1fb2 ; Cpu= Breakpoint plus: a.out`CpuBreakpoint: a.out[0x1f99] <+0>: 55 pushl %ebp a.out[0x1f9a] <+1>: 89 e5 movl %esp, %ebp a.out[0x1f9c] <+3>: cc int3 a.out[0x1f9d] <+4>: 5d popl %ebp a.out[0x1f9e] <+5>: c3 retl For any compiler that emits the frame pointer if you move the INT 3 to asse= mbly you need the frame pointer or the Exception Lib is not going to be abl= e to print out the stack backtrace of the code when you hit a breakpoint. S= o this is how I get to 11 vs. 1 bytes. Thanks, Andrew Fish PS for clang LTO the compiler compiles to bitcode and that is an abstracted= assembly language. At link time all that code gets optimized to together a= nd then passed through the CPU specific code gen. For C inline assembler th= e assembly instructions end up in the bitcode with lots of information abou= t the constraints. That is why these GccInline.c functions almost always ge= t inlined with clang and LTO. The CpuBreakpoint() function looks like this in bitcode, but the function c= all gets optimized away by LTO in the code gen. ; Function Attrs: noinline nounwind optnone ssp uwtable define void @CpuBreakpoint() #0 { call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"() #2, != srcloc !3 ret void } Even something more complex like AsmReadMsr64() can get inlined, but it con= tains a lot more info about the constraints: ; Function Attrs: noinline nounwind optnone ssp uwtable define i64 @AsmReadMsr64(i32) #0 { %2 =3D alloca i32, align 4 %3 =3D alloca i64, align 8 store i32 %0, i32* %2, align 4 %4 =3D load i32, i32* %2, align 4 %5 =3D call i64 asm sideeffect "rdmsr", "=3DA,{cx},~{dirflag},~{fpsr},~{f= lags}"(i32 %4) #2, !srcloc !4 store i64 %5, i64* %3, align 8 %6 =3D load i64, i64* %3, align 8 ret i64 %6 } The alloca, store, load, etc are the same bitcode instructions you would se= e with C code. A recently case that SetJump/LongJump, I updated the NASM file for both IA3= 2 and X64. Later, to my surprise, only X64 version NASM works, and IA32 version NASM d= oes not work. Then I notice that there is a different copy if IA32 Jump.c MS inline - I a= lso need update. That is frustrated. I think there should be a balance between optimization and code readability= /maintainability. Do we have data on what size benefit we can get with these inline function,= from whole image level? We can do evaluation at function level, case by case. If we see a huge size benefit, I agree to keep this function. If we see no size benefit, I suggest we do the cleanup for this function. Thank you Yao Jiewen -----Original Message----- From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of Andrew Fish via edk2-devel Sent: Wednesday, March 6, 2019 5:31 PM To: Kinney, Michael D > Cc: edk2-devel-01 >= ; Gao, Liming > Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code On Mar 6, 2019, at 4:41 PM, Kinney, Michael D > wrote: Liming, That does not work either because inline assembly uses compiler specific syntax. Please update with the specific list of functions that you think the inline should be removed to improve maintenance with no impacts in size/perf. Mike, It is easy to find the gcc/clang inline assembler, just `git grep "__asm__ __volatile__"` The main files are: https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseIoLib Intrinsic/IoLibGcc.c https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/X 64/GccInline.c https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseLib/I a32/GccInline.c https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync hronizationLib/Ia32/GccInline.c https://github.com/tianocore/edk2/blob/master/MdePkg/Library/BaseSync hronizationLib/X64/GccInline.c This is really just compiler optimizer control. Library/BaseSynchronizationLib/SynchronizationGcc.c:21:#define _ReadWriteBarrier() do { __asm__ __volatile__ ("": : : "memory"); } while(0= ) Looks like this is working around the alignment ASSERT in BaseIoLibIntrinsi= c. OvmfPkg/QemuVideoDxe/UnalignedIoGcc.c:43: __asm__ __volatile__ ( "outl %0, %1" : : "a" (Value), "d" ((UINT16)Port) ); OvmfPkg/QemuVideoDxe/UnalignedIoGcc.c:67: __asm__ __volatile__ ( "inl %1, %0" : "=3Da" (Data) : "d" ((UINT16)Port) ); It seems like we have a reasonable set and should not use the inline assembly for any more complex code. Thanks, Andrew Fish The issue you pointed to was around SetJump()/LongJump(). Can we limit this BZ to only those 2 functions? Mike <> From: Gao, Liming Sent: Wednesday, March 6, 2019 4:28 PM To: afish@apple.com Cc: Zhang, Shenglei >; edk2-devel-01 >; Kinney, Michael = D > Subject: RE: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code Andrew: I want to keep only one implementation. If inline assembly c source is preferred, I suggest to remove its nasm implementation. Thanks Liming <>From: afish@apple.com [mailto:afish@apple.com ] Sent: Tuesday, March 5, 2019 2:44 PM To: Gao, Liming >> Cc: Zhang, Shenglei >; edk2-devel-01 >>; Kinney, Michael D >> Subject: Re: [edk2] [PATCH 3/3] MdePkg/BaseSynchronizationLib: Remove inline X86 assembly code On Mar 5, 2019, at 1:32 PM, Gao, Liming > wrote: Andrew: BZ 1163 is to remove inline X86 assembly code in C source file. But, this patch is wrong. I have gave my comments to update this patch. Why do we want to remove inline X86 assembly. As I mentioned it will lead to larger binaries, slower binaries, and less optimized code. For example take this simple inline assembly function: VOID EFIAPI CpuBreakpoint ( VOID ) { __asm__ __volatile__ ("int $3"); } Today with clang LTO turned on calling CpuBreakpoint() looks like this from the callers point of view. a.out[0x1fa5] <+6>: cc int3 But if we move that to NASM: a.out[0x1fa6] <+7>: e8 07 00 00 00 calll 0x1fb2 ; CpuBreakpoint plus: a.out`CpuBreakpoint: a.out[0x1fb2] <+0>: cc int3 a.out[0x1fb3] <+1>: c3 retl And there is also an extra push and pop on the stack. The other issue is th= e call to the function that is unknown to the compiler will act like a _ReadWriteBarrier (Yikes I see _ReadWriteBarrier is deprecated in VC++ 2017). Is the depreciation of some of these intrinsics in VC++ driving the removal of inline assembly? For GCC inline assembly works great for local compile, and for clang LTO it works in entire program optimization. The LTO bitcode includes inline assembly and constraints so that the optimizer knows what to do so it can get optimized just like the abstract bitcode during the Link Time Optimization. This is CpuBreakpoint(): ; Function Attrs: noinline nounwind optnone ssp uwtable define void @CpuBreakpoint() #0 { call void asm sideeffect "int $$3", "~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !3 ret void } This is AsmReadMsr64(): ; Function Attrs: noinline nounwind optnone ssp uwtable define i64 @AsmReadMsr64(i32) #0 { %2 =3D alloca i32, align 4 %3 =3D alloca i64, align 8 store i32 %0, i32* %2, align 4 %4 =3D load i32, i32* %2, align 4 %5 =3D call i64 asm sideeffect "rdmsr", "=3DA,{cx},~{dirflag},~{fpsr},~{flags}"(i32 %4) #2, !srcloc !4 store i64 %5, i64* %3, align 8 %6 =3D load i64, i64* %3, align 8 ret i64 %6 } I agree it make sense to remove .S and .asm files and only have .nasm files= . Thanks, Andrew Fish PS For the Xcode clang since it emits frame pointers by default we need to add the extra 4 bytes to save the assembly functions so the stack can get unwound. a.out`CpuBreakpoint: a.out[0x1f99] <+0>: 55 pushl %ebp a.out[0x1f9a] <+1>: 89 e5 movl %esp, %ebp a.out[0x1f9c] <+3>: cc int3 a.out[0x1f9d] <+4>: 5d popl %ebp a.out[0x1f9e] <+5>: c3 retl So breakpoint goes from 1 byte to 11 bytes if we get rid of the intrinsics. The change is to reduce the duplicated implementation. It will be good on the code maintain. Recently, one patch has to update .c and .nasm both. Please see https://lists.01.org/pipermail/edk2-devel/2019-February/037165.html . Beside this change, I propose to remove all GAS assembly code for IA32 and X64 arch. After those change, the patch owner will collect the impact of th= e image size. Thanks Liming _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel