From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received-SPF: Pass (sender SPF authorized) identity=mailfrom; client-ip=66.187.233.73; helo=mx1.redhat.com; envelope-from=lersek@redhat.com; receiver=edk2-devel@lists.01.org Received: from mx1.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 32580210C1EED for ; Wed, 25 Jul 2018 08:35:29 -0700 (PDT) Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 436E48550F; Wed, 25 Jul 2018 15:35:28 +0000 (UTC) Received: from lacos-laptop-7.usersys.redhat.com (ovpn-122-38.rdu2.redhat.com [10.10.122.38]) by smtp.corp.redhat.com (Postfix) with ESMTP id 906431684A; Wed, 25 Jul 2018 15:35:27 +0000 (UTC) To: "Dong, Eric" , "edk2-devel@lists.01.org" Cc: "Ni, Ruiyu" References: <20180629032047.6340-1-eric.dong@intel.com> <2eac3f3f-972f-9844-6567-5503a0403a85@redhat.com> <3ec340cf-3bf1-ad22-3b7b-aa1b2c1fcaa8@redhat.com> <055fb2f1-cd73-e5a9-11b2-407f31e81305@redhat.com> From: Laszlo Ersek Message-ID: <9082dc2e-a239-18c5-14a7-e2bb18c0985b@redhat.com> Date: Wed, 25 Jul 2018 17:35:26 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: X-Scanned-By: MIMEDefang 2.79 on 10.11.54.5 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.2]); Wed, 25 Jul 2018 15:35:28 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.2]); Wed, 25 Jul 2018 15:35:28 +0000 (UTC) for IP:'10.11.54.5' DOMAIN:'int-mx05.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'lersek@redhat.com' RCPT:'' Subject: Re: [Patch V2] UefiCpuPkg/MpInitLib: Remove redundant parameter. X-BeenThere: edk2-devel@lists.01.org X-Mailman-Version: 2.1.27 Precedence: list List-Id: EDK II Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2018 15:35:29 -0000 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit On 07/25/18 13:35, Dong, Eric wrote: >> From: Laszlo Ersek [mailto:lersek@redhat.com] >> Sent: Wednesday, July 25, 2018 6:14 PM >> To: Dong, Eric ; edk2-devel@lists.01.org >> Cc: Ni, Ruiyu >> On 07/25/18 05:50, Dong, Eric wrote: >>> But some AP may wake up later and it failed to return from the >>> procedure. >> >> Ah! So that explains another symptom I've since seen as well -- >> although *very* rarely. Namely, if an AP wakes up *after* >> PiSmmCpuDxeSmm moves on, thinking that all APs are finished, the AP >> can execute garbage in "no man's land" -- and that crashes the guest. >> Basically, QEMU/KVM pause the guest with "emulation failure", and >> QEMU dumps the VCPU register state to the standard error on the host >> side. In particular, the register state indicates that the crashed >> VCPU is *not* in SMM. > > Where to get these QEMU dumps info? QEMU simply writes the register dump to stderr. If you use QEMU together with libvirt, then QEMU's stderr is saved in: /var/log/libvirt/qemu/$DOMAIN.log Here's an example: KVM internal error. Suberror: 1 emulation failure RAX=0000000000000000 RBX=0000000000000005 RCX=000000007eab9828 RDX=0000000000000002 RSI=000000000009f000 RDI=000000007eef2ae0 RBP=000000007eef2e40 RSP=000000007eef2a08 R8 =0000000000000002 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000 R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000 RIP=000000007ea9cdb2 RFL=00010046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] CS =0038 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA] SS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] DS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] FS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] GS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA] LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy GDT= 000000007ebed998 00000047 IDT= 000000007e237280 00000fff CR0=c0010033 CR2=0000000000000000 CR3=000000007ec01000 CR4=00000668 DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 DR6=00000000ffff0ff0 DR7=0000000000000400 EFER=0000000000000500 Code=89 d0 5f c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc f3 90 cc cc cc cc cc cc cc cc cc cc cc cc cc 53 89 c8 89 d1 50 0f a2 4c 8b 54 24 38 4d 85 d2 (Anyway, this is just for general discussion now -- your v3 series works perfectly for me.) > I'm not clear who calls StartAllAPs function when I send this mail, I > just based on the debug message found StartAllAps trigged right after > PiSmmCpuDxeSmm driver. but I think this not blocks my patches for this > issue. So I sent the patches and back to dig it more. Sure, I think your analysis was spot-on, regardless of how you produced it. The register dumps / emulation failures that I mentioned in my previous email in this thread are not a new issue with your v3 posting. The v3 series is fine. Instead, these failures were a *further* (but less frequent) symptom with the *original* series, and I encountered them separately from the boot hangs that I reported at first. > Now I found it's PiSmmIpl driver calls gDS->SetMemorySpaceAttributes > in SmmIplDxeDispatchEventNotify function which finally calls the > StartAllAps. > The Ap procedure is SetMtrrsFromBuffer function in CpuDxe driver. In > SetMtrrsFromBuffer function, it just call MtrrSetAllMtrrs to update > the MTRR, and it use local variable to save the MTRR settings. > > When the hang issue raised, in the time the AP begin to run the > procedure, the BSP has leave current function. So the saved MTRR > setting in the local variable is not valid anymore. So the > MtrrSetAllMtrrs function use an invalid Mtrr Setttings to set mtrrs > and never exit the procedure. Do you think it's reasonable? Absolutely. Basically, if an AP is allowed to continue running the user procedure after the BSP in MpInitLib thinks that the AP is finished, anything at all can happen -- the escapes control, it may access memory that has been freed, and perhaps it might even execute code that has been freed and overwritten. The actual symptoms can vary wildly. > I found till the ExitBootService point, the AP still in busy state. I > check the ApWakeupFunction function, found between the AP procedure > and set AP state to Idle, no other possible code exist. So I think the > AP should not return back from procedure. I try to add debug message > to know where the AP is stopped at. But after I add debug message in > MtrrSetAllMtrrs, the issue can't reproduce anymore. I tried about 30 > times. Do you have other message to get to know where the AP is? No, I don't have any other ideas at hand to deduce where the AP was stuck -- as far as I'm concerned, it had just wandered off into the weeds :) I think your idea to check the AP state at the time of the ExitBootServices handler hang was great. So, again, in this thread I didn't mean to report a new issue, I just meant to report another -- less frequent -- symptom of the same "ABA" issue. Everything is fine with your v3 posting as much as I can tell (just please update the commit messages before you push). Thanks! Laszlo