From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 3EB0781D70 for ; Thu, 10 Nov 2016 02:41:50 -0800 (PST) Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga105.fm.intel.com with ESMTP; 10 Nov 2016 02:41:53 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.31,618,1473145200"; d="scan'208,217";a="1057795432" Received: from fmsmsx107.amr.corp.intel.com ([10.18.124.205]) by orsmga001.jf.intel.com with ESMTP; 10 Nov 2016 02:41:52 -0800 Received: from fmsmsx112.amr.corp.intel.com (10.18.116.6) by fmsmsx107.amr.corp.intel.com (10.18.124.205) with Microsoft SMTP Server (TLS) id 14.3.248.2; Thu, 10 Nov 2016 02:41:52 -0800 Received: from shsmsx152.ccr.corp.intel.com (10.239.6.52) by FMSMSX112.amr.corp.intel.com (10.18.116.6) with Microsoft SMTP Server (TLS) id 14.3.248.2; Thu, 10 Nov 2016 02:41:50 -0800 Received: from shsmsx102.ccr.corp.intel.com ([169.254.2.239]) by SHSMSX152.ccr.corp.intel.com ([169.254.6.138]) with mapi id 14.03.0248.002; Thu, 10 Nov 2016 18:41:48 +0800 From: "Yao, Jiewen" To: Laszlo Ersek CC: "Tian, Feng" , "edk2-devel@ml01.01.org" , "Kinney, Michael D" , Paolo Bonzini , "Fan, Jeff" , "Zeng, Star" Thread-Topic: [edk2] [PATCH V2 0/6] Enable SMM page level protection. Thread-Index: AQHSNn5Npc+l51c2NUCmigdj/tAIBqDNyfkAgAJjohCAAHPkAIABR3PQ Date: Thu, 10 Nov 2016 10:41:48 +0000 Message-ID: <74D8A39837DF1E4DA445A8C0B3885C50386CE375@shsmsx102.ccr.corp.intel.com> References: <1478251854-14660-1-git-send-email-jiewen.yao@intel.com> <08406bf5-4377-63a1-8dd9-34479c015d4b@redhat.com> <74D8A39837DF1E4DA445A8C0B3885C50386C0CB8@shsmsx102.ccr.corp.intel.com> In-Reply-To: Accept-Language: zh-CN, en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.239.127.40] MIME-Version: 1.0 X-Content-Filtered-By: Mailman/MimeDel 2.1.21 Subject: Re: [PATCH V2 0/6] Enable SMM page level protection. X-BeenThere: edk2-devel@lists.01.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: EDK II Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 Nov 2016 10:41:50 -0000 Content-Language: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Thanks to report case 3 issue on bugzillar. Let's focus on Case 8. It seems another random failure issue. I did more test. 1) I tested some other our internal real platform for UEFI32 OS boot. = I cannot reproduce the ASSERT. 2) I wrote a small test app to call ExitBootServices and send SMI. I r= un it on current my windows QEMU but I still cannot reproduce the ASSERT. It seem your env is the only way to repo the issue. I am trying to follow y= our step by step to install OS on QEMU/KVM. I haven't finish all thing yet,= because of some network proxy issue. :( Your information and analysis is great. It gives us some clue. I guess the same thing as you and checked: InitializeSpinLock (mSmmMpSyncDa= ta->CpuData[CpuIndex].Busy); This address is initialized in InitializeMpSyncData(), with gSmmCpuPrivate-= >SmmCoreEntryContext.NumberOfCpus which is got from MpServices->GetNumberOf= Processors(). I do not know why this address is zero. I also did not quite understand below log. * CPU #0: pc=3D0x00000000c1844555 (halted) thread_id=3D7835 CPU #1: pc=3D0x00000000c1844555 (halted) thread_id=3D7836 CPU #2: pc=3D0x00000000c1844555 (halted) thread_id=3D7837 CPU #3: pc=3D0x000000007ffd17ca thread_id=3D7838 As I recall, writing to B2 only cause BSP get SMI on OVMF. AP does not ente= r SMM mode. So why #3 can enter SMM mode? Is that expected behavior? Or unexpected beha= vior? If this is expected, how this happened? Does OS send SendSmiIpiAllExcluding= Self, after ExitBootServices()? I will see if I can finish QEMU/KVM installation tomorrow. If you have some idea on why and how #3 enter SMM, please let us know. Thank you Yao Jiewen From: Laszlo Ersek [mailto:lersek@redhat.com] Sent: Thursday, November 10, 2016 4:46 AM To: Yao, Jiewen Cc: Tian, Feng ; edk2-devel@ml01.01.org; Kinney, Micha= el D ; Paolo Bonzini ; Fan= , Jeff ; Zeng, Star Subject: Re: [edk2] [PATCH V2 0/6] Enable SMM page level protection. On 11/09/16 07:25, Yao, Jiewen wrote: > Hi Laszlo > I will fix DEBUG message issue in V3 patch. > > Below is rest issues: > > > l Case 13: S3 fails randomly. > A good news: I worked with Jeff Fan to root-cause the S3 resume issue. He= re is detail. > > > 1) We believe the dead CPU is AP. Not BSP. > The reason is that: > > 1.1) The BSP already transfer control to OS waking vector. The GDT/IDT/= CR3 should be set by OS. > > 1.2) The current dead CPU still has GDT/IDT point to a BIOS reserved me= mory. The CS/DS/SS is typical BIOS X64 mode setting. > > 1.3) The current dead CPU still has CR3 in SMM. (Which is obvious wrong= ) > > > 2) Based upon the 1), we reviewed S3 resume AP flow. > Current BSP will wake up AP in SMRAM, for security consideration. At that= time, we are using SMM mode CR3. It is OK for BSP because BSP is NOT in SM= M mode yet. Even after SMM rebase, we can still use it because SMRR is not = set in first SMM rebase. > Current BSP just uses its own context to initialize AP. So that AP takes = BSP CR3, which is SMM CR3, unfortunately. > After BSP initialized APs, the AP is put to HALT-LOOP in X64 mode. It is = the last straw, because X64 mode halt still need paging. > > > 3) The error happen, once the AP receives an interrupt (for whatever= reason), AP starts executing code. However, that that time the AP might no= t be in SMM mode. It means SMM CR3 is not available. And then we see this. > > > 4) I guess we did not see the error, or this is RANDOM issue, becaus= e it depends on if AP receives an interrupt before BSP send INIT-SIPI-SIPI. > > > 5) The fix, I think, should be below: > We should always put AP to protected mode, so that no paging is needed. > We should put AP in above 1M reserved memory, instead of <1M memory, beca= use <1M memory is restored. > > > Would you please file a bugzillar? I think we need assign CPU owner to fi= x that critical issue. > > There is no need to do more investigation. Thanks for your great help on = that. :) Thank you for your help! I filed . The title i= s BSP exits SMM and closes SMRAM on the S3 resume path before meeting with AP(s) I hope the title is mostly right. I didn't add any other details (I haven't gone through the thread in detail yet, and without that I can't even write up a semi-reasonable report myself). Instead, I referenced this message of yours in the report, and I also linked Paolo's analysis from elsewhere in the thread. I hope this will do for the report. (Also, thank you Paolo, from the amazing analysis -- I haven't digested it yet, but I can already tell it's amazing! :)) > l Case 17 - I do not think it is a real issue, because SMM is out of res= ource. > > > l Case 8 - that is a very weird issue. I talk with Jeff again. I do not = have a clear clue yet. >> ASSERT MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(73): S= pinLock !=3D ((void *) 0) > Here is code. We do not know why there is some code need InitializeSpinLo= ck after ExitBootServices. > SPIN_LOCK * > EFIAPI > InitializeSpinLock ( > OUT SPIN_LOCK *SpinLock > ) > { > ASSERT (SpinLock !=3D NULL); > > _ReadWriteBarrier(); > *SpinLock =3D SPIN_LOCK_RELEASED; > _ReadWriteBarrier(); > > return SpinLock; > } > > If you can have a quick check on below, that would be great. > > 1) Which processor triggers this ASSERT? BSP or AP. > > 2) Which module triggers this ASSERT? Which module contains current = RIP value? First, one additional piece of info I have learned is that the issue does not always present itself. Sometimes the boot just works fine, other times the assert fires. Using the QEMU monitor, I managed to get the following information with the "info cpus" command: * CPU #0: pc=3D0x00000000c1844555 (halted) thread_id=3D7835 CPU #1: pc=3D0x00000000c1844555 (halted) thread_id=3D7836 CPU #2: pc=3D0x00000000c1844555 (halted) thread_id=3D7837 CPU #3: pc=3D0x000000007ffd17ca thread_id=3D7838 VCPU#3 is an AP (the last AP), I think. The instruction pointer seems to point into SMRAM again. In the OVMF log, I see Loading SMM driver at 0x0007FFC9000 EntryPoint=3D0x0007FFC9253 PiSmmCpuDxeSmm.efi So the offset into PiSmmCpuDxeSmm.efi is 0x87CA (or, relative to the entry point, 0x8577, 0x253 bytes less). Running objdump -x -S Build/OvmfIa32/DEBUG_GCC48/IA32/PiSmmCpuDxeSmm.debug first I see confirmation that start address 0x00000253 and then 000087bd : VOID EFIAPI CpuDeadLoop ( VOID ) { 87bd: 55 push %ebp 87be: 89 e5 mov %esp,%ebp 87c0: 83 ec 10 sub $0x10,%esp volatile UINTN Index; for (Index =3D 0; Index =3D=3D 0;); 87c3: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%ebp) 87ca: 8b 45 fc mov -0x4(%ebp),%eax <-- HERE 87cd: 85 c0 test %eax,%eax 87cf: 74 f9 je 87ca } 87d1: c9 leave 87d2: c3 ret This seems consistent with an assertion failure. I searched UefiCpuPkg/PiSmmCpuDxeSmm/ for InitializeSpinLock(), and the SmiRendezvous() function [UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c] looks like a possible caller: // // The BUSY lock is initialized to Released state. This needs to // be done early enough to be ready for BSP's SmmStartupThisAp() // call. E.g., with Relaxed AP flow, SmmStartupThisAp() may be // called immediately after AP's present flag is detected. // InitializeSpinLock (mSmmMpSyncData->CpuData[CpuIndex].Busy); Just a guess, of course. > At same time, all my OS test is on real platform. I have not setup OVMF e= nv to run an OS yet. > If you can share a step by step to me, that would be great. (1) Grab a host computer with a CPU that supports VMX and EPT. (2) Download and install Fedora 24 (for example): https://getfedora.org/en/workstation/download/ http://docs.fedoraproject.org/install-guide (3) Install the "qemu-system-x86" package with DNF dnf install qemu-system-x86 (4) clone edk2 with git (5) embed OpenSSL optionally (for secure boot); see "CryptoPkg/Library/OpensslLib/Patch-HOWTO.txt" (6) build OVMF: source edksetup.sh make -C "$EDK_TOOLS_PATH" # Ia32 build \ -a IA32 \ -p OvmfPkg/OvmfPkgIa32.dsc \ -D SMM_REQUIRE -D SECURE_BOOT_ENABLE \ -t GCC5 -b DEBUG # Ia32X64 build \ -a IA32 -a X64 \ -p OvmfPkg/OvmfPkgIa32X64.dsc \ -D SMM_REQUIRE -D SECURE_BOOT_ENABLE \ -t GCC5 -b DEBUG (7) Create disk images: qemu-img create -f qcow2 -o compat=3D1.1 -o cluster_size=3D65536 \ -o preallocation=3Dmetadata -o lazy_refcounts=3Don disk-ia32.img 100G qemu-img create -f qcow2 -o compat=3D1.1 -o cluster_size=3D65536 \ -o preallocation=3Dmetadata -o lazy_refcounts=3Don disk-ia32x64.img 100G (8) For a 64-bit guest OS, you can again use the Fedora 24 Workstation that you downloaded already (the ISO image). For 32-bit guest OS, this one used to work: https://www.happyassassin.net/fedlet-a-fedora-remix-for-bay-trail-tablets/ minimally the 20141209 release. Hm... actually, I think the maintainer of that image has discontinued the downloadable files :( So, I don't know what 32-bit UEFI OS to recommend for testing. 32-bit Windows doesn't boot on OVMF (I looked into that earlier, several times, with some help from a Microsoft developer, but we couldn't solve it), so I can't recommend Windows as an alternative. Perhaps you can use https://linuxiumcomau.blogspot.com/2016/10/running-ubuntu-on-intel-bay-trai= l-and.html as a 32-bit guest OS, I never tried. (9) Anyway, once you have an installer ISO, set the "ISO" environment variable to the ISO image's full pathname, and then run QEMU like this: # Settings for Ia32 only: ISO=3D... DISK=3D.../disk-ia32.img FW=3D.../Build/OvmfIa32/DEBUG_GCC5/FV/OVMF_CODE.fd TEMPLATE=3D.../Build/OvmfIa32/DEBUG_GCC5/FV/OVMF_VARS.fd VARS=3Dvars-32.fd QEMU_COMMAND=3D"qemu-system-i386 -cpu coreduo,-nx" DEBUG=3Ddebug-32.log # Settings for Ia32X64 only: ISO=3D... DISK=3D.../disk-ia32x64.img FW=3D.../Build/Ovmf3264/DEBUG_GCC5/FV/OVMF_CODE.fd TEMPLATE=3D.../Build/Ovmf3264/DEBUG_GCC5/FV/OVMF_VARS.fd VARS=3Dvars-3264.fd QEMU_COMMAND=3Dqemu-system-x86_64 DEBUG=3Ddebug-3264.log # Common commands for both target arches: # create variable store from varstore template # if the former doesn't exist yet if ! [ -e "$VARS" ]; then cp -- "$TEMPLATE" "$VARS" fi $QEMU_COMMAND \ -machine q35,smm=3Don,accel=3Dkvm \ -m 4096 \ -smp sockets=3D1,cores=3D2,threads=3D2 \ -global driver=3Dcfi.pflash01,property=3Dsecure,value=3Don \ -drive if=3Dpflash,format=3Draw,unit=3D0,file=3D${FW},readonly=3Don \ -drive if=3Dpflash,format=3Draw,unit=3D1,file=3D${VARS} \ \ -chardev file,id=3Ddebugfile,path=3D$DEBUG \ -device isa-debugcon,iobase=3D0x402,chardev=3Ddebugfile \ \ -chardev stdio,id=3Dchar0,signal=3Doff,mux=3Don \ -mon chardev=3Dchar0,mode=3Dreadline,default \ -serial chardev:char0 \ \ -drive id=3Diso,if=3Dnone,format=3Draw,readonly,file=3D$ISO \ -drive id=3Ddisk,if=3Dnone,format=3Dqcow2,file=3D$DISK \ \ -device virtio-scsi-pci,id=3Dscsi0 \ -device scsi-cd,drive=3Diso,bus=3Dscsi0.0,bootindex=3D2 \ -device scsi-hd,drive=3Ddisk,bus=3Dscsi0.0,bootindex=3D1 \ \ -device VGA This will capture the OVMF debug output in the $DEBUG file. Also, the terminal where you run the command can be switched between the guest's serial console and the QEMU monitor with [Ctrl-A C]. Thanks Laszlo > > Thank you > Yao Jiewen > > From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of La= szlo Ersek > Sent: Tuesday, November 8, 2016 9:22 AM > To: Yao, Jiewen > > Cc: Tian, Feng >; edk2-de= vel@ml01.01.org; Kinney, Michael D >; Paolo Bonzini >; Fan, Jeff >; Zeng, Star > > Subject: Re: [edk2] [PATCH V2 0/6] Enable SMM page level protection. > > On 11/04/16 10:30, Jiewen Yao wrote: >> =3D=3D=3D=3D below is V2 description =3D=3D=3D=3D >> 1) PiSmmCpu: resolve OVMF multiple processors boot hang issue. >> 2) PiSmmCpu: Add debug info on StartupAp() fails. >> 3) PiSmmCpu: Add ASSERT for AllocatePages(). >> 4) PiSmmCpu: Add protection detail in commit message. >> 5) UefiCpuPkg.dsc: Add page table footprint info in commit message. >> >> =3D=3D=3D=3D below is V1 description =3D=3D=3D=3D >> This series patch enables SMM page level protection. >> Features are: >> 1) PiSmmCore reports SMM PE image code/data information >> in EdkiiPiSmmMemoryAttributeTable, if the SMM image is page aligned. >> 2) PiSmmCpu consumes EdkiiPiSmmMemoryAttributeTable >> and set XD for data page and RO for code page. >> 3) PiSmmCpu enables Static Paging for X64 according to >> PcdCpuSmmStaticPageTable. If it is true, 1G paging for above 4G >> is used as long as it is supported. >> 4) PiSmmCpu sets importance data structure to be read only, >> such as Gdt, Idt, SmmEntrypoint, and PageTable itself. >> >> tested platform: >> 1) Intel internal platform (X64). >> 2) EDKII Quark IA32 >> 3) EDKII Vlv2 X64 >> 4) EDKII OVMF IA32 and IA32X64. (with -smp 8) >> >> Cc: Jeff Fan >> >> Cc: Feng Tian >> >> Cc: Star Zeng >> >> Cc: Michael D Kinney >> >> Cc: Laszlo Ersek >> >> Contributed-under: TianoCore Contribution Agreement 1.0 >> Signed-off-by: Jiewen Yao >> > > I have new test results. Let's start with the table again: > > Legend: > > - "untested" means the test was not executed because the same test > failed or proved unreliable in a less demanding configuration already, > > - "n/a" means a setting or test case was impossible, > > - "fail" and "unreliable" (lower case) are outside the scope of this > series; they either capture the pre-series status, or are expected > even with the series applied due to the pre-series status, > > - "FAIL" and "UNRELIABLE" mean regressions caused (or exposed) by the > series. > > In all cases, 36 bits were used as address width in the CPU HOB (--> up > to 64GB guest-phys address space). > > series OVMF = VCPU boot S3 resume > # applied platform PcdCpuMaxLogicalProcessorNumber PcdCpuSmmStaticPageTa= ble topology result result > -- ------- -------- ------------------------------- ---------------------= --- -------- ------ --------- > 1 no Ia32 64 n/a = 1x2x2 pass unreliable > 2 no Ia32 255 n/a = 52x2x2 pass untested > 3 no Ia32 255 n/a = 53x2x2 unreliable untested > 4 no Ia32X64 64 n/a = 1x2x2 pass unreliable > 5 no Ia32X64 255 n/a = 52x2x2 pass untested > 6 no Ia32X64 255 n/a = 54x2x2 fail n/a > 7 v2 Ia32 64 FALSE = 1x2x2 pass untested > 8 v2 Ia32 64 TRUE = 1x2x2 FAIL untested > 9 v2 Ia32 255 FALSE = 52x2x2 pass untested > 10 v2 Ia32 255 FALSE = 53x2x2 untested untested > 11 v2 Ia32 255 TRUE = 52x2x2 untested untested > 12 v2 Ia32 255 TRUE = 53x2x2 untested untested > 13 v2 Ia32X64 64 FALSE = 1x2x2 pass unreliable > 14 v2 Ia32X64 64 TRUE = 1x2x2 pass untested > 15 v2 Ia32X64 255 FALSE = 52x2x2 pass untested > 16 v2 Ia32X64 255 FALSE = 54x2x2 untested untested > 17 v2 Ia32X64 255 TRUE = 52x2x2 FAIL untested > 18 v2 Ia32X64 255 TRUE = 54x2x2 untested untested > > * Case 8: this test case failed with v2 as well, but this time with > different symptoms: > >> FSOpen: Open '\EFI\fedora\grubia32.efi' Success >> InstallProtocolInterface: [EfiLoadedImageProtocol] 7E4037A8 >> Loading driver at 0x0007DA7F000 EntryPoint=3D0x0007DA7F400 >> InstallProtocolInterface: [EfiLoadedImageDevicePathProtocol] 7E403A90 >> PixelBlueGreenRedReserved8BitPerColor >> ConvertPages: Incompatible memory types >> PixelBlueGreenRedReserved8BitPerColor >> ConvertPages: Incompatible memory types >> SmmInstallProtocolInterface: [EdkiiSmmExitBootServicesProtocol] 0 >> MpInitExitBootServicesCallback() done! >> ASSERT MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(73): S= pinLock !=3D ((void *) 0) > > I didn't try to narrow this down. > > * Case 13 (the "unreliable S3 resume" case): Here the news are both bad > and good. The good news is for Jiewen: this patch series does not > cause the unreliability, it "only" amplifies it severely. The bad news > is correspondingly for everyone else: S3 resume is actually unreliable > even in case 4, that is, without this series applied, it's just the > failure rate is much-much lower. > > Namely, in my new testing, in case 13, S3 resume failed 8 times out of > 21 tries. (I stopped testing at the 8th failure.) > > Whereas in case 4, S3 resume failed with *identical symptoms* (e.g., > exact same RIP=3D000000000009f0fd), 1 time out of 55 tries. (It was try > #12 that failed; I continued testing and aborted the test after the > 55th try.) > > So, while the series hugely amplifies the failure rate, the failure > does exist without the series. Which is why I modified the case 4 > results in the table, and also lower-cased the word "unreliable" in > case 13. > > Below I will return to this problem separately; let's go over the rest > of the table first. > > * Case 17: I guess this is not a real failure, I'm just including it for > completeness, as PcdCpuSmmStaticPageTable=3D=3DTRUE is known to present > additional SMRAM demand (see the commit message on patch V2 4/6). This > case fails with > >> SmmLockBox Command - 4 >> SmmLockBoxSmmLib SetLockBoxAttributes - Enter >> SmmLockBoxSmmLib SetLockBoxAttributes - Exit (Success) >> SmmLockBox SmmLockBoxHandler Exit >> SmmLockBoxDxeLib SetLockBoxAttributes - Exit (Success) >> SmmInstallProtocolInterface: [EfiSmmReadyToLockProtocol] 0 >> ASSERT UefiCpuPkg/PiSmmCpuDxeSmm/CpuS3.c(892): mGdtForAp !=3D ((void *) = 0) > > which is an SMRAM allocation failure. If I lower the VCPU count to > 50x2x2, then the guest boots fine. > > ----*---- > > Before I get to the S3 resume problem (which, again, reproduces without > this series, although much less frequently), I'd like to comment on the > removal of the ASSERT(), from v1 to v2, in the FlushTlbForAll() > function, on the return value of SmmBlockingStartupThisAp(). This change > allows v2 to proceed past that point; however, I'm seeing a whole lot of > >> !mSmmMpSyncData->CpuData[1].Present >> !mSmmMpSyncData->CpuData[2].Present >> !mSmmMpSyncData->CpuData[3].Present >> ... > > messages in the OVMF boot log, interspersed with > >> ConvertPageEntryAttribute 0x7F92B067->0x7F92B065 > > style messages. (That is, one error message for each AP, per > ConvertPageEntryAttribute() message.) > > Is this okay / intentional? The number of these messages can go up to > several thousands and that sort of drowns out everything else in the > log. > > It's also not easy to mask the message, because it's logged on the > DEBUG_ERROR level. > > ----*---- > > * Okay, so the S3 problem. Last time I suspected that the failure point > (RIP=3D9f0fd) was in the Linux guest's S3 wakeup vector (which starts a= t > 9A1D0, according to the OVMF log). In order to test this idea, I > exercised this series with S3 against a Windows 8.1 guest (--> case 13 > again). The failure reproduced on the second S3 resume, with identical > RIP, despite the Windows wakeup vector being located elsewhere (at > 0x1000). > > Quoting the OVMF log leading up to the resume: > >> Call AsmDisablePaging64() to return to S3 Resume in PEI Phase >> Install PPI: [PeiPostScriptTablePpi] >> Install PPI: [EfiEndOfPeiSignalPpi] >> Notify: PPI Guid: [EfiEndOfPeiSignalPpi], Peim notify entry point: 85789= 5 >> PeiMpInitLib: CpuMpEndOfPeiCallback () invoked >> Transfer to 16bit OS waking vector - 1000 > > QEMU log (same as before): > >> KVM internal error. Suberror: 1 >> KVM internal error. Suberror: 1 >> emulation failure >> emulation failure >> RAX=3D0000000000000002 RBX=3D0000000000000000 RCX=3D000000007ffdb168 RDX= =3D000000007fedd058 >> RSI=3D0000000000000004 RDI=3D000000007fedd040 RBP=3D0000000000000000 RSP= =3D000000007e1a7000 >> R8 =3D0000000000000000 R9 =3D0000000000000000 R10=3D0000000000000000 R11= =3D0000000000000000 >> R12=3D0000000000000000 R13=3D0000000000000000 R14=3D0000000000000000 R15= =3D0000000000000000 >> RIP=3D000000000009f0fd RFL=3D00010006 [-----P-] CPL=3D0 II=3D0 A20=3D1 S= MM=3D0 HLT=3D0 >> ES =3D0030 0000000000000000 ffffffff 00c09300 DPL=3D0 DS [-WA] >> CS =3D0038 0000000000000000 ffffffff 00a09b00 DPL=3D0 CS64 [-RA] >> SS =3D0030 0000000000000000 ffffffff 00c09300 DPL=3D0 DS [-WA] >> DS =3D0030 0000000000000000 ffffffff 00c09300 DPL=3D0 DS [-WA] >> FS =3D0018 0000000000000000 ffffffff 00c09300 DPL=3D0 DS [-WA] >> GS =3D0018 0000000000000000 ffffffff 00c09300 DPL=3D0 DS [-WA] >> LDT=3D0000 0000000000000000 0000ffff 00008200 DPL=3D0 LDT >> TR =3D0000 0000000000000000 0000ffff 00008b00 DPL=3D0 TSS64-busy >> GDT=3D 000000007f294000 00000047 >> IDT=3D 000000007f294048 00000fff >> CR0=3De0000011 CR2=3D0000000000000000 CR3=3D000000007ff7f000 CR4=3D00000= 220 >> DR0=3D0000000000000000 DR1=3D0000000000000000 DR2=3D0000000000000000 DR3= =3D0000000000000000 >> DR6=3D00000000ffff0ff0 DR7=3D0000000000000400 >> EFER=3D0000000000000500 >> Code=3D?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? = ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? = ?? ?? ?? ?? >> RAX=3D0000000000000001 RBX=3D0000000000000000 RCX=3D000000007ffdb168 RDX= =3D000000007fedd070 >> RSI=3D0000000000000004 RDI=3D000000007fedd058 RBP=3D0000000000000000 RSP= =3D000000007e19f000 >> R8 =3D0000000000000000 R9 =3D0000000000000000 R10=3D0000000000000000 R11= =3D0000000000000000 >> R12=3D0000000000000000 R13=3D0000000000000000 R14=3D0000000000000000 R15= =3D0000000000000000 >> RIP=3D000000000009f0fd RFL=3D00010006 [-----P-] CPL=3D0 II=3D0 A20=3D1 S= MM=3D0 HLT=3D0 >> ES =3D0030 0000000000000000 ffffffff 00c09300 DPL=3D0 DS [-WA] >> CS =3D0038 0000000000000000 ffffffff 00a09b00 DPL=3D0 CS64 [-RA] >> SS =3D0030 0000000000000000 ffffffff 00c09300 DPL=3D0 DS [-WA] >> DS =3D0030 0000000000000000 ffffffff 00c09300 DPL=3D0 DS [-WA] >> FS =3D0018 0000000000000000 ffffffff 00c09300 DPL=3D0 DS [-WA] >> GS =3D0018 0000000000000000 ffffffff 00c09300 DPL=3D0 DS [-WA] >> LDT=3D0000 0000000000000000 0000ffff 00008200 DPL=3D0 LDT >> TR =3D0000 0000000000000000 0000ffff 00008b00 DPL=3D0 TSS64-busy >> GDT=3D 000000007f294000 00000047 >> IDT=3D 000000007f294048 00000fff >> CR0=3De0000011 CR2=3D0000000000000000 CR3=3D000000007ff7f000 CR4=3D00000= 220 >> DR0=3D0000000000000000 DR1=3D0000000000000000 DR2=3D0000000000000000 DR3= =3D0000000000000000 >> DR6=3D00000000ffff0ff0 DR7=3D0000000000000400 >> EFER=3D0000000000000500 >> Code=3D?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? = ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? = ?? ?? ?? ?? > > So, we can exclude the suspicion that the problem is guest OS > dependent. > > * Then I looked for the base address of the page containing the > RIP=3D9f0fd address, in earlier parts of the OVMF log, on the hunch tha= t > some firmware component might have allocated that area actually. Here > we go: > >> Loading PEIM at 0x000008552C0 EntryPoint=3D0x000008554E0 CpuMpPei.efi >> AP Loop Mode is 1 >> WakeupBufferStart =3D 9F000, WakeupBufferSize =3D 1000 > > That is, the failure hits (when it hits -- not always) in the area > where the CpuMpPei driver *borrows* memory for the startup vector of > the APs, for the purposes of the MP service PPI. ("Wakeup" is an > overloaded word here; the "wakeup buffer" has nothing to do with S3 > resume, it just serves for booting the APs temporarily in PEI, for > implementing the MP service PPI.) > > When exiting the PEI phase (on the S3 resume path), CpuMpPei restores > the original contents of this area. This occurs just before > transfering control to the guest OS wakeup vector: see the > "EfiEndOfPeiSignalPpi" and "CpuMpEndOfPeiCallback" strings just above, > quoted from the OVMF log. > > I documented (parts of) this logic in OVMF commit > > https://github.com/tianocore/edk2/commit/e3e3090a959a0 > > (see the code comments as well). > > * At that time, I thought to have identified a memory management bug in > CpuMpPei; see the following discussion and bug report for details: > > https://www.mail-archive.com/edk2-devel@lists.01.org/msg13892.html > https://bugzilla.tianocore.org/show_bug.cgi?id=3D67 > > However, with the extraction / introduction of MpInitLib, this issue > has been fixed: GetWakeupBuffer() now calls > CheckOverlapWithAllocatedBuffer(), so that "memory management bug" is > no longer; we shouldn't be looking there for the root cause. > > * Either way, I don't understand why anything would want to execute code > in the one page that happens to host the MP services PPI startup > buffer for APs during PEI. > > Not understanding the "why", I looked at the "what", and resorted to > tracing KVM. Because the problem readily reproduces with this series > applied (case 13), it wasn't hard to start the tracing while the guest > was suspended, and capture just the actions that led from the > KVM-level wakeup to the failure. > > The QEMU state dumps are visible above in the email. I've also > uploaded the compressed OVMF log and the textual KVM trace here: > > http://people.redhat.com/lersek/s3-crash-8d1dfed7-ca92-4e25-8d2b-b1c9= ac2a53db/ > > I sincerely hope that Paolo will have a field day with the KVM trace > :) I managed to identify the following curiosities (remember this is > all on the S3 resume path): > > * First, the VCPUs (there are four of them) enter and leave SMM in a > really funky pattern: > > vcpu#0 vcpu#1 vcpu#2 vcpu#3 > ------ ------ ------ ------ > enter > | > leave > > enter > | > leave > > enter > | > leave > > enter > | > leave > > enter enter > enter | enter | > | | | | > leave | | | > | | | > enter | | | > | | | | > leave leave leave leave > > That is, first we have each VCPU enter and leave SMM in complete > isolation (1, 2, 3, 0). Then VCPUs 1 and 3 enter SMM together, soon > followed by VPCUS 0 and 2, also together. VCPU#0 drops out of SMM > temporarily (it comes back in later), while the other three remain > in SMM. Finally all four of them leave SMM together. > > After which the problem occurs. > > * Second, the instruction that causes things to blow up is <0f aa>, > i.e., RSM. I have absolutely no clue why RSM is executed: > > (a) in the area that used to host the AP startup routine for the MP > services PPI -- note that we also have "Transfer to 16bit OS waking > vector" in the log, so CpuMpEndOfPeiCallback() restores the borrowed > area completeley! --, > > (b) and why *after* all four VCPUs have just left SMM, together. > > * The RSM instruction is handled successfully elsewhere, for example > when all four VCPUs leave SMM, at the bottom of the diagram above: > >> CPU-24447 [002] 39841.982810: kvm_emulate_insn: 0:7ffbf179: 0f aa >> CPU-24446 [000] 39841.982810: kvm_emulate_insn: 0:7ffbd179: 0f aa >> CPU-24445 [005] 39841.982810: kvm_emulate_insn: 0:7ffbb179: 0f aa >> CPU-24444 [006] 39841.982811: kvm_emulate_insn: 0:7ffb9179: 0f aa > > * The guest-phys address 7ff7f000 that we see just before the error: > >> CPU-24447 [002] 39841.982825: kvm_page_fault: address 7ff7f000 err= or_code 83 >> CPU-24446 [000] 39841.982825: kvm_page_fault: address 7ff7f000 err= or_code 83 >> CPU-24447 [002] 39841.982826: kvm_emulate_insn: 0:9f0fd: 0f aa >> CPU-24444 [006] 39841.982827: kvm_exit: reason EXTERNAL_INTE= RRUPT rip 0xffffffff813a954f info 0 800000fc >> CPU-24447 [002] 39841.982827: kvm_emulate_insn: 0:9f0fd: 0f aa FAIL >> CPU-24447 [002] 39841.982827: kvm_userspace_exit: reason KVM_EXIT_INTE= RNAL_ERROR (17) > > can be found higher up in the trace; namely, it is written to CR3 > several times. It's the root of the page tables. > > * The 7F80_1000..7FFF_FFFF guest-phys addresses are all in SMRAM. > > * I also tried the "info tlb" monitor command, via "virsh > qemu-monitor-command --hmp", while the guest was auto-paused after the > crash. > > I cannot provide results: QEMU appeared to return a message that would > be longer than 16MB after encoding by libvirt, and libvirt rejected > that ("Unable to encode message payload", see VIR_NET_MESSAGE_MAX). > > Anyway, the KVM trace, and the QEMU register dump, look consistent > with what Paolo said about "Code=3D?? ?? ??...": > > The question marks usually mean that the page tables do not map a > page at that address. > > CR3=3D000000007ff7f000 points into SMRAM, but we are outside of SMM > (SMM=3D0). We can't translate *any* guest-virtual address, as we can't > even begin walking the page tables. > > Thanks > Laszlo > _______________________________________________ > edk2-devel mailing list > edk2-devel@lists.01.org> > https://lists.01.org/mailman/listinfo/edk2-devel > _______________________________________________ > edk2-devel mailing list > edk2-devel@lists.01.org > https://lists.01.org/mailman/listinfo/edk2-devel >