From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received-SPF: Pass (sender SPF authorized) identity=mailfrom; client-ip=66.187.233.73; helo=mx1.redhat.com; envelope-from=lersek@redhat.com; receiver=edk2-devel@lists.01.org Received: from mx1.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id A1157210C1B8D for ; Wed, 25 Jul 2018 03:13:37 -0700 (PDT) Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id DA3AE40D9B12; Wed, 25 Jul 2018 10:13:36 +0000 (UTC) Received: from lacos-laptop-7.usersys.redhat.com (ovpn-120-225.rdu2.redhat.com [10.10.120.225]) by smtp.corp.redhat.com (Postfix) with ESMTP id 4FBCC2142F20; Wed, 25 Jul 2018 10:13:36 +0000 (UTC) To: "Dong, Eric" , "edk2-devel@lists.01.org" Cc: "Ni, Ruiyu" References: <20180629032047.6340-1-eric.dong@intel.com> <2eac3f3f-972f-9844-6567-5503a0403a85@redhat.com> <3ec340cf-3bf1-ad22-3b7b-aa1b2c1fcaa8@redhat.com> <055fb2f1-cd73-e5a9-11b2-407f31e81305@redhat.com> From: Laszlo Ersek Message-ID: Date: Wed, 25 Jul 2018 12:13:35 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: X-Scanned-By: MIMEDefang 2.78 on 10.11.54.6 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.7]); Wed, 25 Jul 2018 10:13:36 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.7]); Wed, 25 Jul 2018 10:13:36 +0000 (UTC) for IP:'10.11.54.6' DOMAIN:'int-mx06.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'lersek@redhat.com' RCPT:'' Subject: Re: [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter. X-BeenThere: edk2-devel@lists.01.org X-Mailman-Version: 2.1.27 Precedence: list List-Id: EDK II Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2018 10:13:38 -0000 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit On 07/25/18 05:50, Dong, Eric wrote: > Hi Laszlo, > > I have root cause this issue, the AP hangs in the procedure when > PiSmmCpuDxeSmm driver start up trigged this issue. > > When PiSmmCpuDxeSmm driver start up, it will call StartAllAps to set > memory attribute. In StartAllAps function, after call WakeUpAp to > start Aps, it calls CheckAllAps to wait all Aps finished the task. In > CheckAllAps function, it detect AP state to know whether the AP has > finished its task. In old code, it check whether the AP state is > CpuStateFinished to know whether AP has finished tasks. This state is > only set by AP when it truly finished task. In new logic, > CpuStateFinished been replace with CpuStateIdle. And CpuStateIdle is > also the begin state of the AP. AP will change state from CpuStateIdle > to CpuStateBusy when it start execute the procedure. And after it > finished the procedure, it will change state back to CpuStateIdle. > > So when the hang issue raised, AP state is not been changed to > CpuStateBusy when BSP calls CheckAllAps to check whether the AP has > finished its task. So the state for the AP still in CpuStateIdle, but > BSP think AP has finished its task. In this case, BSP think all the > Aps has finished their tasks and it continues boot. Awesome analysis! So, this looks like an "inverse" variant of the classic "ABA problem": https://en.wikipedia.org/wiki/ABA_problem > But some AP may wake up later and it failed to return from the > procedure. Ah! So that explains another symptom I've since seen as well -- although *very* rarely. Namely, if an AP wakes up *after* PiSmmCpuDxeSmm moves on, thinking that all APs are finished, the AP can execute garbage in "no man's land" -- and that crashes the guest. Basically, QEMU/KVM pause the guest with "emulation failure", and QEMU dumps the VCPU register state to the standard error on the host side. In particular, the register state indicates that the crashed VCPU is *not* in SMM. When I first encountered this symptom now, while playing some more with your patches, it reminded me of earlier problems with MpInitLib. And now your analysis makes perfect sense of this additional symptom! > In this case, the AP state keeps at CpuStateBusy. So later in > ChangeApLoopCallback function, because this AP state still in > CpuStateBusy, this AP will not trig the procedure. But BSP wait all > APs to trig the procedure(BSP wait the Aps to reduce the > mNumberToFinish value in procedure to continue boot) to continue the > boot, so the hang occurred. This completes the explanation. > I think we should keep a middle state to let us know whether the AP > truly finished its task. I will send another serial patch for this > issue. Please help to check the new patches. Yes, I'll test them too. Thanks! Laszlo