From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 3045D81F22 for ; Wed, 16 Nov 2016 17:10:05 -0800 (PST) Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by orsmga101.jf.intel.com with ESMTP; 16 Nov 2016 17:10:09 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.31,650,1473145200"; d="scan'208,217";a="32242278" Received: from orsmsx104.amr.corp.intel.com ([10.22.225.131]) by fmsmga005.fm.intel.com with ESMTP; 16 Nov 2016 17:10:09 -0800 Received: from orsmsx112.amr.corp.intel.com (10.22.240.13) by ORSMSX104.amr.corp.intel.com (10.22.225.131) with Microsoft SMTP Server (TLS) id 14.3.248.2; Wed, 16 Nov 2016 17:10:09 -0800 Received: from orsmsx113.amr.corp.intel.com ([169.254.9.63]) by ORSMSX112.amr.corp.intel.com ([169.254.3.163]) with mapi id 14.03.0248.002; Wed, 16 Nov 2016 17:10:08 -0800 From: "Kinney, Michael D" To: "afish@apple.com" , Laszlo Ersek , "Kinney, Michael D" CC: "Fan, Jeff" , "edk2-devel@ml01.01.org" , Paolo Bonzini , "Tian, Feng" , "Yao, Jiewen" Thread-Topic: [edk2] [PATCH v2 0/2] Add volatile for mNumberToFinish Thread-Index: AQHSP1rNjKZBHgmXqkGDdQ/d754BhqDb6ZDAgACbDwCAAB+WgP//ujOg Date: Thu, 17 Nov 2016 01:10:08 +0000 Message-ID: References: <20161115140809.5292-1-jeff.fan@intel.com> <605a5187-5b91-e0f8-5560-8a02ddad5057@redhat.com> <87503cca-0496-e78c-cc94-c27c1417901e@redhat.com> <512424A8-31F2-4333-9FAC-AD4D35C785AC@apple.com> In-Reply-To: <512424A8-31F2-4333-9FAC-AD4D35C785AC@apple.com> Accept-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ctpclassification: CTP_IC x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiOWZlNmI0NWItMjk4NS00M2Q5LWIxODQtMTI3MjM3ZjM2NzkwIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE1LjkuNi42IiwiVHJ1c3RlZExhYmVsSGFzaCI6IlJYS2QxVkorSWxHVXhFbkRpQUNkcTBzVGpmRVNqV1VqQXkrVVFcL1hGZnhVPSJ9 x-originating-ip: [10.22.254.140] MIME-Version: 1.0 X-Content-Filtered-By: Mailman/MimeDel 2.1.21 Subject: Re: [PATCH v2 0/2] Add volatile for mNumberToFinish X-BeenThere: edk2-devel@lists.01.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: EDK II Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 17 Nov 2016 01:10:05 -0000 Content-Language: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Hi Andrew, Thanks for all the examples! I have done a more detailed review of the BaseSynchronizationLib. I do not= think there are any missing memory barriers in the implementation. I have also experimented with adding volatile qualifier to parameters in th= e Interlocked*() functions, and for the builds I have done so far, that app= ears to be a compatible change. I did have to update the internal function= s in that lib to also consistently use volatile, but that makes sense based= on the ANSI C spec items that Laszlo quoted. I have also been looking at Jeff's patch to CpuS3.c to add volatile to the = mNumberToFinish global variable, and I have a slightly different patch I ha= ve evaluated to avoid the type case from a volatile variable to a non-volat= ile variable in all cases. It does require that the calls to SwitchStack()= and DisablePaging64() treat the parameter as a UINTN address of the semaph= ore, instead of a volatile pointer to the semaphore. I will send some patches soon for review. Mike From: afish@apple.com [mailto:afish@apple.com] Sent: Wednesday, November 16, 2016 1:14 PM To: Laszlo Ersek Cc: Kinney, Michael D ; Fan, Jeff ; edk2-devel@ml01.01.org; Paolo Bonzini ; Tian,= Feng ; Yao, Jiewen Subject: Re: [edk2] [PATCH v2 0/2] Add volatile for mNumberToFinish On Nov 16, 2016, at 11:21 AM, Laszlo Ersek > wrote: On 11/16/16 19:18, Kinney, Michael D wrote: Laszlo, Thanks for the details from and ANSI C spec. For this compiler issue, are there more details on the assembly code generated by the GCC 5.4 compiler in the failing mode? Unfortunately, I have little first-hand experience with this kind of failure, I've just read different accounts of it. Perhaps Paolo can comment -- my experience with contacting GCC maintainers directly hasn't been stellar. If it reproduces in clang I've had a lot of luck getting help from those fo= lks. In general the best thing to do if you think you have a compiler bug it to = write a small command line program that reproduces the issues. If it is a c= ode generation issue, you don't even need to make the code link or run, you= can just compile the code directly to assembler and write up in the bug re= port why you think the code generation is in error. Basically like the exam= ples I post to this list from time to time. Well unless it is a link time o= ptimization issue... I also see Liming's observation that the internal implementation of the SynchronizationLib adds volatile in some internal worker functions. Other implementation artifacts include the use of a read/write memory barrier to prevent the optimizing compiler from optimizing across a boundary. These read/write barriers are used in the spin lock functions, but not the Interlocked*() functions. I want to make sure we have studied the code generation to see if the issue is related to volatile or a read/write memory barrier. It could be adding volatile forced the compiler to internally add read/write barrier. In my understanding, MemoryFence() [MdePkg/Library/BaseLib/X64/GccInline.c] is just a compiler-level barrier, not a processor level barrier. __asm__ __volatile__ ("":::"memory"); Processor level barriers shouldn't be a problem here; as far as I recall from the SDM from a few days ago, the XCHG instruction "engages the CPU's locking protocol even without the LOCK prefix", plus I see an explicit LOCK in InternalSyncDecrement() [MdePkg/Library/BaseSynchronizationLib/X64/GccInline.c] for example. I'm pretty sure the compiler opimized away the access because it saw no intervening access (it doesn't know about the code running on other CPUs). I think "volatile" and MemoryFence() -- with the latter telling GCC that all memory should be considered clobbered -- should fix the issue just the same, in this situation. There is a good chance it is similar to my example from my previous email w= hen the return value is a constant since the compiler wrote the value and r= emember the value it wrote. You will need to make the global volatile or pl= ace a barrier between the write and the read to tell the compiler what opti= mizations are legal. There can also be other insidious things in place that hide bugs. For examp= le in this code: mNumberToFinish =3D mAcpiCpuData.NumberOfCpus - 1; mExchangeInfo->ApFunction =3D (VOID *) (UINTN) EarlyMPRendezvousProcedur= e; // // Send INIT IPI - SIPI to all APs // SendInitSipiSipiAllExcludingSelf ((UINT32)mAcpiCpuData.StartupVector); while (mNumberToFinish > 0) { CpuPause (); } What if some versions of SendInitSipiSipiAllExcludingSelf() have side effec= ts that make it hard to optimize and some do not. You could get a different= result... Anyway I think this optimization is more subtle than my previous examples. = If you just compile this code to assembler it looks correct (you see a loop= ). But if you do link time optimization the compiler will make more assumpt= ions about how mNumberToFinish is used globally and be more aggressive abou= t optimizing it out. You can get the same behavior from the assembler gene= ration if you make mNumberToFinish a static local. ~/work/Compiler>cat barriers.c #define _ReadWriteBarrier() do { __asm__ __volatile__ ("": : : "memory"); }= while(0) unsigned int mNumberToFinish =3D 0; void SendInit2() { *(int *)0x12345678 =3D 0; } int main() { mNumberToFinish =3D 3; SendInit2(); while (mNumberToFinish > 0) { ; } return 0; } ~/work/Compiler>clang -Os -flto -g barriers.c ~/work/Compiler>lldb a.out (lldb) target create "a.out" Current executable set to 'a.out' (x86_64). (lldb) dis -n main -b a.out`main: a.out[0x100000fa7] <+0>: 55 pushq %rbp a.out[0x100000fa8] <+1>: 48 89 e5 movq %rsp, %r= bp a.out[0x100000fab] <+4>: c7 04 25 78 56 34 12 00 00 00 00 movl $0x0, 0x= 12345678 a.out[0x100000fb6] <+15>: eb fe jmp 0x100000= fb6 ; <+15> at barriers.c:16 ~/work/Compiler>clang -Os -S barriers.c ~/work/Compiler>cat barriers.s .globl _SendInit2 _SendInit2: ## @SendInit2 pushq %rbp movq %rsp, %rbp movl $0, 305419896 popq %rbp retq .globl _main _main: ## @main pushq %rbp movq %rsp, %rbp movl $3, _mNumberToFinish(%rip) xorl %eax, %eax movl %eax, 305419896 cmpl %eax, _mNumberToFinish(%rip) je LBB1_2 LBB1_1: ## =3D>This Inner Loop Header: Dept= h=3D1 jmp LBB1_1 LBB1_2: xorl %eax, %eax popq %rbp retq .globl _mNumberToFinish ## @mNumberToFinish .zerofill __DATA,__common,_mNumberToFinish,4,2 Live in fear of the optimizer as it is getting smarter every day. Sorry if = I picked the wrong bit of code as the example. Thanks, Andrew Fish PS This is a good example of how barriers don't do what you may think. Noti= ce how the barrier only adds a single read and then a return or an infinite= loop. It makes sense if you think about it.... After the barrier the compi= ler is clueless as to what is going on so it reads the global. Given the gl= obal can't change it decides to return or fall into the infinite loop. This= is really good example of the difference between a barrier and a volatile = variable. This really highlights that barrier is a one shot vs. the volatil= e is a property for the variable. ~/work/Compiler>cat barriers.c #define _ReadWriteBarrier() do { __asm__ __volatile__ ("": : : "memory"); }= while(0) unsigned int mNumberToFinish =3D 0; void SendInit2() { *(int *)0x12345678 =3D 0; } int main() { mNumberToFinish =3D 3; SendInit2(); _ReadWriteBarrier(); while (mNumberToFinish > 0) { ; } return 0; } ~/work/Compiler>clang -Os -flto -g barriers.c ~/work/Compiler>lldb a.out (lldb) target create "a.out" Current executable set to 'a.out' (x86_64). (lldb) dis -n main -b a.out`main: a.out[0x100000f8d] <+0>: 55 pushq %rbp a.out[0x100000f8e] <+1>: 48 89 e5 movq %rsp, %r= bp a.out[0x100000f91] <+4>: c6 05 68 00 00 00 01 movb $0x1, 0x= 68(%rip) a.out[0x100000f98] <+11>: c7 04 25 78 56 34 12 00 00 00 00 movl $0x0, 0x= 12345678 a.out[0x100000fa3] <+22>: 0f b6 05 56 00 00 00 movzbl 0x56(%ri= p), %eax ; mNumberToFinish a.out[0x100000faa] <+29>: 83 e0 01 andl $0x1, %e= ax a.out[0x100000fad] <+32>: 83 f8 01 cmpl $0x1, %e= ax a.out[0x100000fb0] <+35>: 75 02 jne 0x100000= fb4 ; <+39> at barriers.c:21 a.out[0x100000fb2] <+37>: eb fe jmp 0x100000= fb2 ; <+37> at barriers.c:17 a.out[0x100000fb4] <+39>: 31 c0 xorl %eax, %e= ax a.out[0x100000fb6] <+41>: 5d popq %rbp a.out[0x100000fb7] <+42>: c3 retq Also, given the implementation I see in the SynchronizationLib I am not sure the cast to (UINT32 *) is required in the proposed patch. Do we get compiler warnings/errors if those casts are not included? Yes. Passing an argument to a function whose prototype is visible invokes type conversion just like in assignment: 6.5.2.2 Function calls 7 If the expression that denotes the called function has a type that does include a prototype, the arguments are implicitly converted, as if by assignment, to the types of the corresponding parameters, taking the type of each parameter to be the unqualified version of its declared type. [...] (IMPORTANT: the de-qualification mentioned above refers to the pointer object itself, not to the pointed-to object! It just means if you have a function parameter "const long x", and pass the argument 5 (which has type int), then it is converted to "long", and not "const long".) So, passing a (volatile UINT32 *) argument to a (UINT32 *) parameter is the same as the similar assignment volatile UINT32 *x; UINT32 *y; y =3D x; Then, regarding assignment, we have 6.5.16.1 Simple assignment Constraints 1 One of the following shall hold: [...] - both operands are pointers to qualified or unqualified versions of compatible types, and the type pointed to by the left has all the qualifiers of the type pointed to by the right; [...] Finally we have diagnostics, 5.1.1.3 Diagnostics 1 A conforming implementation shall produce at least one diagnostic message (identified in an implementation-defined manner) if a preprocessing translation unit or translation unit contains a violation of any syntax rule or constraint, even if the behavior is also explicitly specified as undefined or implementation-defined. [...] In summary, without adding the explicit cast, we break the constraint on simple assignment, which requires the compiler to emit a diagnostic message= . (In turn if we add the explicit cast, we break other requirements, as discussed in my other email -- and here we should investigate whether we care about breaking that exact rule.) The second topic is what the SynchronizationLib API interfaces should have been from the beginning. In retrospect, I think they should have been defined with volatile pointers. I agree! The spin lock APIs do use volatile pointers, but that is embedded in the typedef for SPIN_LOCK, so it is not as obvious. Thanks Laszlo -----Original Message----- From: Laszlo Ersek [mailto:lersek@redhat.com] Sent: Tuesday, November 15, 2016 8:10 AM To: Fan, Jeff >; edk2-devel@m= l01.01.org Cc: Paolo Bonzini >; Yao, J= iewen >; Tian, Feng >; Kinney, Michael D <= michael.d.kinney@intel.com> Subject: Re: [PATCH v2 0/2] Add volatile for mNumberToFinish Jeff, On 11/15/16 15:08, Jeff Fan wrote: v2: Add patch #1 per Laszlo's comments at https://lists.01.org/pipermail/edk2-devel/2016-November/004697.html About the comments updated SynchronizationLib to add volatile for input parameter, I will send in another serial patches. Cc: Paolo Bonzini > Cc: Laszlo Ersek > Cc: Jiewen Yao > Cc: Feng Tian > Cc: Michael D Kinney > Contributed-under: TianoCore Contribution Agreement 1.0 Signed-off-by: Jeff Fan > Jeff Fan (2): UefiCpuPkg/PiSmmCpuDxeSmm: Add volatile for parameter NumberToFinish UefiCpuPkg/PiSmmCpuDxeSmm: Add volatile for mNumberToFinish UefiCpuPkg/PiSmmCpuDxeSmm/CpuS3.c | 4 ++-- UefiCpuPkg/PiSmmCpuDxeSmm/Ia32/SmmFuncsArch.c | 4 ++-- UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.h | 2 +- UefiCpuPkg/PiSmmCpuDxeSmm/X64/SmmFuncsArch.c | 2 +- 4 files changed, 6 insertions(+), 6 deletions(-) if you want to keep GCC5.4 from optimizing away the access, the synchronization object itself, and all pointers to it must remain volatile. Wherever you cast away the volatile qualifier, for example in a function call, GCC can break code on the next level, even if you don't actually access the object through that pointer (i.e., if you cast the pointer back to volatile just in time for the access). So, the safe way to go about this is to change function prototypes from callee towards callers -- first change the callee (because both volatile and non-volatile can be accepted as volatile), then change the caller (make sure what you pass in is volatile, and propagate it outwards). It is also okay to convert the original volatile pointer to UINTN, and to pass it to assembly code like that, or to convert it back to a volatile pointer from UINTN before use. >>From the C99 standard: 6.3 Conversions 6.3.2.3 Pointers 2 For any qualifier q, a pointer to a non-q-qualified type may be converted to a pointer to the q-qualified version of the type; the values stored in the original and converted pointers shall compare equal. 5 An integer may be converted to any pointer type. Except as previously specified, the result is implementation-defined, might not be correctly aligned, might not point to an entity of the referenced type, and might be a trap representation. 6 Any pointer type may be converted to an integer type. Except as previously specified, the result is implementation-defined. If the result cannot be represented in the integer type, the behavior is undefined. The result need not be in the range of values of any integer type. 6.7.3 Type qualifiers, paragraph 5: If an attempt is made to modify an object defined with a const-qualified type through use of an lvalue with non-const-qualified type, the behavior is undefined. If an attempt is made to refer to an object defined with a volatile-qualified type through use of an lvalue with non-volatile-qualified type, the behavior is undefined. In summary: - casting away "volatile" even just temporarily (without actual accesses) may give gcc license to break the code (6.3.2.3 p2) - accessing without volatile is known to break the code (6.7.3 p5) - you can always cast from non-volatile to volatile (6.3.2.3 p2), but not the other way around! - you can cast from (volatile VOID *) to UINTN and back as much as you want (6.3.2.3 p5 p6), because our execution environment makes that safe ("implementation-defined") We might want to play loose with 6.3.2.3 p2 -- that is, cast away volatile from the pointer only temporarily, and cast it back just before accessing the object through the pointer --, but that could be unsafe in the long term. The *really* safe method is to cast it to UINTN, and then back the same way. Yes, this would affect functions like SwitchStack() too -- the Context1 and Context2 parameters would have to change their types to UINTN. I think what we should discuss at this point is whether we'd like to care about 6.3.2.3 p2; that is, whether we consider casting away volatile temporarily. The direction I've been experiencing with GCC is that its optimization methods are becoming more aggressive. For some optimizations, there are flags that disable them; I'm not sure they provide a specific flag for preventing GCC from exploiting violations of 6.3.2.3 p2. Thanks Laszlo _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel