Re: [PATCH V2 0/6] Enable SMM page level protection.

From: "Yao, Jiewen" <jiewen.yao@intel.com>
To: Laszlo Ersek <lersek@redhat.com>
Cc: "Tian, Feng" <feng.tian@intel.com>,
	"edk2-devel@ml01.01.org" <edk2-devel@ml01.01.org>,
	"Kinney, Michael D" <michael.d.kinney@intel.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	"Fan, Jeff" <jeff.fan@intel.com>,
	"Zeng, Star" <star.zeng@intel.com>
Subject: Re: [PATCH V2 0/6] Enable SMM page level protection.
Date: Thu, 10 Nov 2016 10:41:48 +0000	[thread overview]
Message-ID: <74D8A39837DF1E4DA445A8C0B3885C50386CE375@shsmsx102.ccr.corp.intel.com> (raw)
In-Reply-To: <e230a3ea-dfa9-4537-149d-07ceb465bc23@redhat.com>

Thanks to report case 3 issue on bugzillar.

Let's focus on Case 8.
It seems another random failure issue.

I did more test.

1)      I tested some other our internal real platform for UEFI32 OS boot. I cannot reproduce the ASSERT.

2)      I wrote a small test app to call ExitBootServices and send SMI. I run it on current my windows QEMU but I still cannot reproduce the ASSERT.

It seem your env is the only way to repo the issue. I am trying to follow your step by step to install OS on QEMU/KVM. I haven't finish all thing yet, because of some network proxy issue. :(

Your information and analysis is great. It gives us some clue.

I guess the same thing as you and checked: InitializeSpinLock (mSmmMpSyncData->CpuData[CpuIndex].Busy);

This address is initialized in InitializeMpSyncData(), with gSmmCpuPrivate->SmmCoreEntryContext.NumberOfCpus which is got from MpServices->GetNumberOfProcessors().
I do not know why this address is zero.

I also did not quite understand below log.

* CPU #0: pc=0x00000000c1844555 (halted) thread_id=7835
  CPU #1: pc=0x00000000c1844555 (halted) thread_id=7836
  CPU #2: pc=0x00000000c1844555 (halted) thread_id=7837
  CPU #3: pc=0x000000007ffd17ca thread_id=7838

As I recall, writing to B2 only cause BSP get SMI on OVMF. AP does not enter SMM mode.
So why #3 can enter SMM mode? Is that expected behavior? Or unexpected behavior?
If this is expected, how this happened? Does OS send SendSmiIpiAllExcludingSelf, after ExitBootServices()?

I will see if I can finish QEMU/KVM installation tomorrow.

If you have some idea on why and how #3 enter SMM, please let us know.

Thank you
Yao Jiewen

From: Laszlo Ersek [mailto:lersek@redhat.com]
Sent: Thursday, November 10, 2016 4:46 AM
To: Yao, Jiewen <jiewen.yao@intel.com>
Cc: Tian, Feng <feng.tian@intel.com>; edk2-devel@ml01.01.org; Kinney, Michael D <michael.d.kinney@intel.com>; Paolo Bonzini <pbonzini@redhat.com>; Fan, Jeff <jeff.fan@intel.com>; Zeng, Star <star.zeng@intel.com>
Subject: Re: [edk2] [PATCH V2 0/6] Enable SMM page level protection.

On 11/09/16 07:25, Yao, Jiewen wrote:
> Hi Laszlo
> I will fix DEBUG message issue in V3 patch.
>
> Below is rest issues:
>
>
> l  Case 13: S3 fails randomly.
> A good news: I worked with Jeff Fan to root-cause the S3 resume issue. Here is detail.
>
>
> 1)      We believe the dead CPU is AP. Not BSP.
> The reason is that:
>
> 1.1)   The BSP already transfer control to OS waking vector. The GDT/IDT/CR3 should be set by OS.
>
> 1.2)   The current dead CPU still has GDT/IDT point to a BIOS reserved memory. The CS/DS/SS is typical BIOS X64 mode setting.
>
> 1.3)   The current dead CPU still has CR3 in SMM. (Which is obvious wrong)
>
>
> 2)      Based upon the 1), we reviewed S3 resume AP flow.
> Current BSP will wake up AP in SMRAM, for security consideration. At that time, we are using SMM mode CR3. It is OK for BSP because BSP is NOT in SMM mode yet. Even after SMM rebase, we can still use it because SMRR is not set in first SMM rebase.
> Current BSP just uses its own context to initialize AP. So that AP takes BSP CR3, which is SMM CR3, unfortunately.
> After BSP initialized APs, the AP is put to HALT-LOOP in X64 mode. It is the last straw, because X64 mode halt still need paging.
>
>
> 3)      The error happen, once the AP receives an interrupt (for whatever reason), AP starts executing code. However, that that time the AP might not be in SMM mode. It means SMM CR3 is not available. And then we see this.
>
>
> 4)      I guess we did not see the error, or this is RANDOM issue, because it depends on if AP receives an interrupt before BSP send INIT-SIPI-SIPI.
>
>
> 5)      The fix, I think, should be below:
> We should always put AP to protected mode, so that no paging is needed.
> We should put AP in above 1M reserved memory, instead of <1M memory, because <1M memory is restored.
>
>
> Would you please file a bugzillar? I think we need assign CPU owner to fix that critical issue.
>
> There is no need to do more investigation. Thanks for your great help on that. :)

Thank you for your help!

I filed <https://bugzilla.tianocore.org/show_bug.cgi?id=216>. The title is

    BSP exits SMM and closes SMRAM on the S3 resume path before
    meeting with AP(s)

I hope the title is mostly right. I didn't add any other details (I
haven't gone through the thread in detail yet, and without that I can't
even write up a semi-reasonable report myself). Instead, I referenced
this message of yours in the report, and I also linked Paolo's analysis
from elsewhere in the thread. I hope this will do for the report.

(Also, thank you Paolo, from the amazing analysis -- I haven't digested
it yet, but I can already tell it's amazing! :))

> l  Case 17 - I do not think it is a real issue, because SMM is out of resource.
>
>
> l  Case 8 - that is a very weird issue. I talk with Jeff again. I do not have a clear clue yet.
>> ASSERT MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(73): SpinLock != ((void *) 0)
> Here is code. We do not know why there is some code need InitializeSpinLock after ExitBootServices.
> SPIN_LOCK *
> EFIAPI
> InitializeSpinLock (
>   OUT      SPIN_LOCK                 *SpinLock
>   )
> {
>   ASSERT (SpinLock != NULL);
>
>   _ReadWriteBarrier();
>   *SpinLock = SPIN_LOCK_RELEASED;
>   _ReadWriteBarrier();
>
>   return SpinLock;
> }
>
> If you can have a quick check on below, that would be great.
>
> 1)      Which processor triggers this ASSERT? BSP or AP.
>
> 2)      Which module triggers this ASSERT? Which module contains current RIP value?

First, one additional piece of info I have learned is that the issue
does not always present itself. Sometimes the boot just works fine,
other times the assert fires.

Using the QEMU monitor, I managed to get the following information with
the "info cpus" command:

* CPU #0: pc=0x00000000c1844555 (halted) thread_id=7835
  CPU #1: pc=0x00000000c1844555 (halted) thread_id=7836
  CPU #2: pc=0x00000000c1844555 (halted) thread_id=7837
  CPU #3: pc=0x000000007ffd17ca thread_id=7838

VCPU#3 is an AP (the last AP), I think. The instruction pointer seems to
point into SMRAM again.

In the OVMF log, I see

Loading SMM driver at 0x0007FFC9000 EntryPoint=0x0007FFC9253
PiSmmCpuDxeSmm.efi

So the offset into PiSmmCpuDxeSmm.efi is 0x87CA (or, relative to the
entry point, 0x8577, 0x253 bytes less).

Running

  objdump -x -S Build/OvmfIa32/DEBUG_GCC48/IA32/PiSmmCpuDxeSmm.debug

first I see confirmation that

  start address 0x00000253

and then

000087bd <CpuDeadLoop>:
VOID
EFIAPI
CpuDeadLoop (
  VOID
  )
{
    87bd:       55                      push   %ebp
    87be:       89 e5                   mov    %esp,%ebp
    87c0:       83 ec 10                sub    $0x10,%esp
  volatile UINTN  Index;

  for (Index = 0; Index == 0;);
    87c3:       c7 45 fc 00 00 00 00    movl   $0x0,-0x4(%ebp)
    87ca:       8b 45 fc                mov    -0x4(%ebp),%eax  <-- HERE
    87cd:       85 c0                   test   %eax,%eax
    87cf:       74 f9                   je     87ca <CpuDeadLoop+0xd>
}
    87d1:       c9                      leave
    87d2:       c3                      ret

This seems consistent with an assertion failure.

I searched UefiCpuPkg/PiSmmCpuDxeSmm/ for InitializeSpinLock(), and the
SmiRendezvous() function [UefiCpuPkg/PiSmmCpuDxeSmm/MpService.c] looks
like a possible caller:

      //
      // The BUSY lock is initialized to Released state. This needs to
      // be done early enough to be ready for BSP's SmmStartupThisAp()
      // call. E.g., with Relaxed AP flow, SmmStartupThisAp() may be
      // called immediately after AP's present flag is detected.
      //
      InitializeSpinLock (mSmmMpSyncData->CpuData[CpuIndex].Busy);

Just a guess, of course.

> At same time, all my OS test is on real platform. I have not setup OVMF env to run an OS yet.
> If you can share a step by step to me, that would be great.

(1) Grab a host computer with a CPU that supports VMX and EPT.

(2) Download and install Fedora 24 (for example):

https://getfedora.org/en/workstation/download/
http://docs.fedoraproject.org/install-guide

(3) Install the "qemu-system-x86" package with DNF

dnf install qemu-system-x86

(4) clone edk2 with git

(5) embed OpenSSL optionally (for secure boot); see
"CryptoPkg/Library/OpensslLib/Patch-HOWTO.txt"

(6) build OVMF:

source edksetup.sh
make -C "$EDK_TOOLS_PATH"

# Ia32
build \
  -a IA32 \
  -p OvmfPkg/OvmfPkgIa32.dsc \
  -D SMM_REQUIRE -D SECURE_BOOT_ENABLE \
  -t GCC5 -b DEBUG

# Ia32X64
build \
  -a IA32 -a X64 \
  -p OvmfPkg/OvmfPkgIa32X64.dsc \
  -D SMM_REQUIRE -D SECURE_BOOT_ENABLE \
  -t GCC5 -b DEBUG

(7) Create disk images:

qemu-img create -f qcow2 -o compat=1.1 -o cluster_size=65536 \
  -o preallocation=metadata -o lazy_refcounts=on disk-ia32.img 100G

qemu-img create -f qcow2 -o compat=1.1 -o cluster_size=65536 \
  -o preallocation=metadata -o lazy_refcounts=on disk-ia32x64.img 100G

(8) For a 64-bit guest OS, you can again use the Fedora 24 Workstation
that you downloaded already (the ISO image).

For 32-bit guest OS, this one used to work:

https://www.happyassassin.net/fedlet-a-fedora-remix-for-bay-trail-tablets/

minimally the 20141209 release. Hm... actually, I think the maintainer
of that image has discontinued the downloadable files :(

So, I don't know what 32-bit UEFI OS to recommend for testing.

32-bit Windows doesn't boot on OVMF (I looked into that earlier, several
times, with some help from a Microsoft developer, but we couldn't solve
it), so I can't recommend Windows as an alternative.

Perhaps you can use

https://linuxiumcomau.blogspot.com/2016/10/running-ubuntu-on-intel-bay-trail-and.html

as a 32-bit guest OS, I never tried.

(9) Anyway, once you have an installer ISO, set the "ISO" environment
variable to the ISO image's full pathname, and then run QEMU like this:

# Settings for Ia32 only:

ISO=...
DISK=.../disk-ia32.img
FW=.../Build/OvmfIa32/DEBUG_GCC5/FV/OVMF_CODE.fd
TEMPLATE=.../Build/OvmfIa32/DEBUG_GCC5/FV/OVMF_VARS.fd
VARS=vars-32.fd
QEMU_COMMAND="qemu-system-i386 -cpu coreduo,-nx"
DEBUG=debug-32.log

# Settings for Ia32X64 only:

ISO=...
DISK=.../disk-ia32x64.img
FW=.../Build/Ovmf3264/DEBUG_GCC5/FV/OVMF_CODE.fd
TEMPLATE=.../Build/Ovmf3264/DEBUG_GCC5/FV/OVMF_VARS.fd
VARS=vars-3264.fd
QEMU_COMMAND=qemu-system-x86_64
DEBUG=debug-3264.log

# Common commands for both target arches:

# create variable store from varstore template
# if the former doesn't exist yet
if ! [ -e "$VARS" ]; then
  cp -- "$TEMPLATE" "$VARS"
fi

$QEMU_COMMAND \
  -machine q35,smm=on,accel=kvm \
  -m 4096 \
  -smp sockets=1,cores=2,threads=2 \
  -global driver=cfi.pflash01,property=secure,value=on \
  -drive if=pflash,format=raw,unit=0,file=${FW},readonly=on \
  -drive if=pflash,format=raw,unit=1,file=${VARS} \
  \
  -chardev file,id=debugfile,path=$DEBUG \
  -device isa-debugcon,iobase=0x402,chardev=debugfile \
  \
  -chardev stdio,id=char0,signal=off,mux=on \
  -mon chardev=char0,mode=readline,default \
  -serial chardev:char0 \
  \
  -drive id=iso,if=none,format=raw,readonly,file=$ISO \
  -drive id=disk,if=none,format=qcow2,file=$DISK \
  \
  -device virtio-scsi-pci,id=scsi0 \
  -device scsi-cd,drive=iso,bus=scsi0.0,bootindex=2 \
  -device scsi-hd,drive=disk,bus=scsi0.0,bootindex=1 \
  \
  -device VGA

This will capture the OVMF debug output in the $DEBUG file. Also, the
terminal where you run the command can be switched between the guest's
serial console and the QEMU monitor with [Ctrl-A C].

Thanks
Laszlo

>
> Thank you
> Yao Jiewen
>
> From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of Laszlo Ersek
> Sent: Tuesday, November 8, 2016 9:22 AM
> To: Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>>
> Cc: Tian, Feng <feng.tian@intel.com<mailto:feng.tian@intel.com>>; edk2-devel@ml01.01.org<mailto:edk2-devel@ml01.01.org>; Kinney, Michael D <michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com>>; Paolo Bonzini <pbonzini@redhat.com<mailto:pbonzini@redhat.com>>; Fan, Jeff <jeff.fan@intel.com<mailto:jeff.fan@intel.com>>; Zeng, Star <star.zeng@intel.com<mailto:star.zeng@intel.com>>
> Subject: Re: [edk2] [PATCH V2 0/6] Enable SMM page level protection.
>
> On 11/04/16 10:30, Jiewen Yao wrote:
>> ==== below is V2 description ====
>> 1) PiSmmCpu: resolve OVMF multiple processors boot hang issue.
>> 2) PiSmmCpu: Add debug info on StartupAp() fails.
>> 3) PiSmmCpu: Add ASSERT for AllocatePages().
>> 4) PiSmmCpu: Add protection detail in commit message.
>> 5) UefiCpuPkg.dsc: Add page table footprint info in commit message.
>>
>> ==== below is V1 description ====
>> This series patch enables SMM page level protection.
>> Features are:
>> 1) PiSmmCore reports SMM PE image code/data information
>> in EdkiiPiSmmMemoryAttributeTable, if the SMM image is page aligned.
>> 2) PiSmmCpu consumes EdkiiPiSmmMemoryAttributeTable
>> and set XD for data page and RO for code page.
>> 3) PiSmmCpu enables Static Paging for X64 according to
>> PcdCpuSmmStaticPageTable. If it is true, 1G paging for above 4G
>> is used as long as it is supported.
>> 4) PiSmmCpu sets importance data structure to be read only,
>> such as Gdt, Idt, SmmEntrypoint, and PageTable itself.
>>
>> tested platform:
>> 1) Intel internal platform (X64).
>> 2) EDKII Quark IA32
>> 3) EDKII Vlv2  X64
>> 4) EDKII OVMF IA32 and IA32X64. (with -smp 8)
>>
>> Cc: Jeff Fan <jeff.fan@intel.com<mailto:jeff.fan@intel.com<mailto:jeff.fan@intel.com%3cmailto:jeff.fan@intel.com>>>
>> Cc: Feng Tian <feng.tian@intel.com<mailto:feng.tian@intel.com<mailto:feng.tian@intel.com%3cmailto:feng.tian@intel.com>>>
>> Cc: Star Zeng <star.zeng@intel.com<mailto:star.zeng@intel.com<mailto:star.zeng@intel.com%3cmailto:star.zeng@intel.com>>>
>> Cc: Michael D Kinney <michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com<mailto:michael.d.kinney@intel.com%3cmailto:michael.d.kinney@intel.com>>>
>> Cc: Laszlo Ersek <lersek@redhat.com<mailto:lersek@redhat.com<mailto:lersek@redhat.com%3cmailto:lersek@redhat.com>>>
>> Contributed-under: TianoCore Contribution Agreement 1.0
>> Signed-off-by: Jiewen Yao <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com<mailto:jiewen.yao@intel.com%3cmailto:jiewen.yao@intel.com>>>
>
> I have new test results. Let's start with the table again:
>
> Legend:
>
> - "untested" means the test was not executed because the same test
>   failed or proved unreliable in a less demanding configuration already,
>
> - "n/a" means a setting or test case was impossible,
>
> - "fail" and "unreliable" (lower case) are outside the scope of this
>   series; they either capture the pre-series status, or are expected
>   even with the series applied due to the pre-series status,
>
> - "FAIL" and "UNRELIABLE" mean regressions caused (or exposed) by the
>   series.
>
> In all cases, 36 bits were used as address width in the CPU HOB (--> up
> to 64GB guest-phys address space).
>
>    series  OVMF                                                              VCPU     boot       S3 resume
>  # applied platform PcdCpuMaxLogicalProcessorNumber PcdCpuSmmStaticPageTable topology result     result
> -- ------- -------- ------------------------------- ------------------------ -------- ------     ---------
>  1 no      Ia32      64                             n/a                      1x2x2    pass       unreliable
>  2 no      Ia32     255                             n/a                      52x2x2   pass       untested
>  3 no      Ia32     255                             n/a                      53x2x2   unreliable untested
>  4 no      Ia32X64   64                             n/a                      1x2x2    pass       unreliable
>  5 no      Ia32X64  255                             n/a                      52x2x2   pass       untested
>  6 no      Ia32X64  255                             n/a                      54x2x2   fail       n/a
>  7 v2      Ia32      64                             FALSE                    1x2x2    pass       untested
>  8 v2      Ia32      64                             TRUE                     1x2x2    FAIL       untested
>  9 v2      Ia32     255                             FALSE                    52x2x2   pass       untested
> 10 v2      Ia32     255                             FALSE                    53x2x2   untested   untested
> 11 v2      Ia32     255                             TRUE                     52x2x2   untested   untested
> 12 v2      Ia32     255                             TRUE                     53x2x2   untested   untested
> 13 v2      Ia32X64   64                             FALSE                    1x2x2    pass       unreliable
> 14 v2      Ia32X64   64                             TRUE                     1x2x2    pass       untested
> 15 v2      Ia32X64  255                             FALSE                    52x2x2   pass       untested
> 16 v2      Ia32X64  255                             FALSE                    54x2x2   untested   untested
> 17 v2      Ia32X64  255                             TRUE                     52x2x2   FAIL       untested
> 18 v2      Ia32X64  255                             TRUE                     54x2x2   untested   untested
>
> * Case 8: this test case failed with v2 as well, but this time with
>   different symptoms:
>
>> FSOpen: Open '\EFI\fedora\grubia32.efi' Success
>> InstallProtocolInterface: [EfiLoadedImageProtocol] 7E4037A8
>> Loading driver at 0x0007DA7F000 EntryPoint=0x0007DA7F400
>> InstallProtocolInterface: [EfiLoadedImageDevicePathProtocol] 7E403A90
>> PixelBlueGreenRedReserved8BitPerColor
>> ConvertPages: Incompatible memory types
>> PixelBlueGreenRedReserved8BitPerColor
>> ConvertPages: Incompatible memory types
>> SmmInstallProtocolInterface: [EdkiiSmmExitBootServicesProtocol] 0
>> MpInitExitBootServicesCallback() done!
>> ASSERT MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(73): SpinLock != ((void *) 0)
>
>   I didn't try to narrow this down.
>
> * Case 13 (the "unreliable S3 resume" case): Here the news are both bad
>   and good. The good news is for Jiewen: this patch series does not
>   cause the unreliability, it "only" amplifies it severely. The bad news
>   is correspondingly for everyone else: S3 resume is actually unreliable
>   even in case 4, that is, without this series applied, it's just the
>   failure rate is much-much lower.
>
>   Namely, in my new testing, in case 13, S3 resume failed 8 times out of
>   21 tries. (I stopped testing at the 8th failure.)
>
>   Whereas in case 4, S3 resume failed with *identical symptoms* (e.g.,
>   exact same RIP=000000000009f0fd), 1 time out of 55 tries. (It was try
>   #12 that failed; I continued testing and aborted the test after the
>   55th try.)
>
>   So, while the series hugely amplifies the failure rate, the failure
>   does exist without the series. Which is why I modified the case 4
>   results in the table, and also lower-cased the word "unreliable" in
>   case 13.
>
>   Below I will return to this problem separately; let's go over the rest
>   of the table first.
>
> * Case 17: I guess this is not a real failure, I'm just including it for
>   completeness, as PcdCpuSmmStaticPageTable==TRUE is known to present
>   additional SMRAM demand (see the commit message on patch V2 4/6). This
>   case fails with
>
>> SmmLockBox Command - 4
>> SmmLockBoxSmmLib SetLockBoxAttributes - Enter
>> SmmLockBoxSmmLib SetLockBoxAttributes - Exit (Success)
>> SmmLockBox SmmLockBoxHandler Exit
>> SmmLockBoxDxeLib SetLockBoxAttributes - Exit (Success)
>> SmmInstallProtocolInterface: [EfiSmmReadyToLockProtocol] 0
>> ASSERT UefiCpuPkg/PiSmmCpuDxeSmm/CpuS3.c(892): mGdtForAp != ((void *) 0)
>
>   which is an SMRAM allocation failure. If I lower the VCPU count to
>   50x2x2, then the guest boots fine.
>
> ----*----
>
> Before I get to the S3 resume problem (which, again, reproduces without
> this series, although much less frequently), I'd like to comment on the
> removal of the ASSERT(), from v1 to v2, in the FlushTlbForAll()
> function, on the return value of SmmBlockingStartupThisAp(). This change
> allows v2 to proceed past that point; however, I'm seeing a whole lot of
>
>> !mSmmMpSyncData->CpuData[1].Present
>> !mSmmMpSyncData->CpuData[2].Present
>> !mSmmMpSyncData->CpuData[3].Present
>> ...
>
> messages in the OVMF boot log, interspersed with
>
>> ConvertPageEntryAttribute 0x7F92B067->0x7F92B065
>
> style messages. (That is, one error message for each AP, per
> ConvertPageEntryAttribute() message.)
>
> Is this okay / intentional? The number of these messages can go up to
> several thousands and that sort of drowns out everything else in the
> log.
>
> It's also not easy to mask the message, because it's logged on the
> DEBUG_ERROR level.
>
> ----*----
>
> * Okay, so the S3 problem. Last time I suspected that the failure point
>   (RIP=9f0fd) was in the Linux guest's S3 wakeup vector (which starts at
>   9A1D0, according to the OVMF log). In order to test this idea, I
>   exercised this series with S3 against a Windows 8.1 guest (--> case 13
>   again). The failure reproduced on the second S3 resume, with identical
>   RIP, despite the Windows wakeup vector being located elsewhere (at
>   0x1000).
>
>   Quoting the OVMF log leading up to the resume:
>
>> Call AsmDisablePaging64() to return to S3 Resume in PEI Phase
>> Install PPI: [PeiPostScriptTablePpi]
>> Install PPI: [EfiEndOfPeiSignalPpi]
>> Notify: PPI Guid: [EfiEndOfPeiSignalPpi], Peim notify entry point: 857895
>> PeiMpInitLib: CpuMpEndOfPeiCallback () invoked
>> Transfer to 16bit OS waking vector - 1000
>
>   QEMU log (same as before):
>
>> KVM internal error. Suberror: 1
>> KVM internal error. Suberror: 1
>> emulation failure
>> emulation failure
>> RAX=0000000000000002 RBX=0000000000000000 RCX=000000007ffdb168 RDX=000000007fedd058
>> RSI=0000000000000004 RDI=000000007fedd040 RBP=0000000000000000 RSP=000000007e1a7000
>> R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000
>> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
>> RIP=000000000009f0fd RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
>> ES =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> CS =0038 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
>> SS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> DS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> FS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> GS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
>> TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy
>> GDT=     000000007f294000 00000047
>> IDT=     000000007f294048 00000fff
>> CR0=e0000011 CR2=0000000000000000 CR3=000000007ff7f000 CR4=00000220
>> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
>> DR6=00000000ffff0ff0 DR7=0000000000000400
>> EFER=0000000000000500
>> Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
>> RAX=0000000000000001 RBX=0000000000000000 RCX=000000007ffdb168 RDX=000000007fedd070
>> RSI=0000000000000004 RDI=000000007fedd058 RBP=0000000000000000 RSP=000000007e19f000
>> R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000
>> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
>> RIP=000000000009f0fd RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
>> ES =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> CS =0038 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
>> SS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> DS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> FS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> GS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
>> LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
>> TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy
>> GDT=     000000007f294000 00000047
>> IDT=     000000007f294048 00000fff
>> CR0=e0000011 CR2=0000000000000000 CR3=000000007ff7f000 CR4=00000220
>> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
>> DR6=00000000ffff0ff0 DR7=0000000000000400
>> EFER=0000000000000500
>> Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
>
>   So, we can exclude the suspicion that the problem is guest OS
>   dependent.
>
> * Then I looked for the base address of the page containing the
>   RIP=9f0fd address, in earlier parts of the OVMF log, on the hunch that
>   some firmware component might have allocated that area actually. Here
>   we go:
>
>> Loading PEIM at 0x000008552C0 EntryPoint=0x000008554E0 CpuMpPei.efi
>> AP Loop Mode is 1
>> WakeupBufferStart = 9F000, WakeupBufferSize = 1000
>
>   That is, the failure hits (when it hits -- not always) in the area
>   where the CpuMpPei driver *borrows* memory for the startup vector of
>   the APs, for the purposes of the MP service PPI. ("Wakeup" is an
>   overloaded word here; the "wakeup buffer" has nothing to do with S3
>   resume, it just serves for booting the APs temporarily in PEI, for
>   implementing the MP service PPI.)
>
>   When exiting the PEI phase (on the S3 resume path), CpuMpPei restores
>   the original contents of this area. This occurs just before
>   transfering control to the guest OS wakeup vector: see the
>   "EfiEndOfPeiSignalPpi" and "CpuMpEndOfPeiCallback" strings just above,
>   quoted from the OVMF log.
>
>   I documented (parts of) this logic in OVMF commit
>
>     https://github.com/tianocore/edk2/commit/e3e3090a959a0
>
>   (see the code comments as well).
>
> * At that time, I thought to have identified a memory management bug in
>   CpuMpPei; see the following discussion and bug report for details:
>
>     https://www.mail-archive.com/edk2-devel@lists.01.org/msg13892.html
>     https://bugzilla.tianocore.org/show_bug.cgi?id=67
>
>   However, with the extraction / introduction of MpInitLib, this issue
>   has been fixed: GetWakeupBuffer() now calls
>   CheckOverlapWithAllocatedBuffer(), so that "memory management bug" is
>   no longer; we shouldn't be looking there for the root cause.
>
> * Either way, I don't understand why anything would want to execute code
>   in the one page that happens to host the MP services PPI startup
>   buffer for APs during PEI.
>
>   Not understanding the "why", I looked at the "what", and resorted to
>   tracing KVM. Because the problem readily reproduces with this series
>   applied (case 13), it wasn't hard to start the tracing while the guest
>   was suspended, and capture just the actions that led from the
>   KVM-level wakeup to the failure.
>
>   The QEMU state dumps are visible above in the email. I've also
>   uploaded the compressed OVMF log and the textual KVM trace here:
>
>     http://people.redhat.com/lersek/s3-crash-8d1dfed7-ca92-4e25-8d2b-b1c9ac2a53db/
>
>   I sincerely hope that Paolo will have a field day with the KVM trace
>   :) I managed to identify the following curiosities (remember this is
>   all on the S3 resume path):
>
>   * First, the VCPUs (there are four of them) enter and leave SMM in a
>     really funky pattern:
>
>       vcpu#0  vcpu#1  vcpu#2  vcpu#3
>       ------  ------  ------  ------
>               enter
>                |
>               leave
>
>                       enter
>                         |
>                       leave
>
>                               enter
>                                 |
>                               leave
>
>       enter
>         |
>       leave
>
>               enter           enter
>        enter    |     enter     |
>          |      |       |       |
>        leave    |       |       |
>                 |       |       |
>        enter    |       |       |
>          |      |       |       |
>        leave  leave   leave   leave
>
>     That is, first we have each VCPU enter and leave SMM in complete
>     isolation (1, 2, 3, 0). Then VCPUs 1 and 3 enter SMM together, soon
>     followed by VPCUS 0 and 2, also together. VCPU#0 drops out of SMM
>     temporarily (it comes back in later), while the other three remain
>     in SMM. Finally all four of them leave SMM together.
>
>     After which the problem occurs.
>
>   * Second, the instruction that causes things to blow up is <0f aa>,
>     i.e., RSM. I have absolutely no clue why RSM is executed:
>
>     (a) in the area that used to host the AP startup routine for the MP
>     services PPI -- note that we also have "Transfer to 16bit OS waking
>     vector" in the log, so CpuMpEndOfPeiCallback() restores the borrowed
>     area completeley! --,
>
>     (b) and why *after* all four VCPUs have just left SMM, together.
>
>   * The RSM instruction is handled successfully elsewhere, for example
>     when all four VCPUs leave SMM, at the bottom of the diagram above:
>
>> CPU-24447 [002] 39841.982810: kvm_emulate_insn:     0:7ffbf179: 0f aa
>> CPU-24446 [000] 39841.982810: kvm_emulate_insn:     0:7ffbd179: 0f aa
>> CPU-24445 [005] 39841.982810: kvm_emulate_insn:     0:7ffbb179: 0f aa
>> CPU-24444 [006] 39841.982811: kvm_emulate_insn:     0:7ffb9179: 0f aa
>
>   * The guest-phys address 7ff7f000 that we see just before the error:
>
>> CPU-24447 [002] 39841.982825: kvm_page_fault:       address 7ff7f000 error_code 83
>> CPU-24446 [000] 39841.982825: kvm_page_fault:       address 7ff7f000 error_code 83
>> CPU-24447 [002] 39841.982826: kvm_emulate_insn:     0:9f0fd: 0f aa
>> CPU-24444 [006] 39841.982827: kvm_exit:             reason EXTERNAL_INTERRUPT rip 0xffffffff813a954f info 0 800000fc
>> CPU-24447 [002] 39841.982827: kvm_emulate_insn:     0:9f0fd: 0f aa FAIL
>> CPU-24447 [002] 39841.982827: kvm_userspace_exit:   reason KVM_EXIT_INTERNAL_ERROR (17)
>
>     can be found higher up in the trace; namely, it is written to CR3
>     several times. It's the root of the page tables.
>
>   * The 7F80_1000..7FFF_FFFF guest-phys addresses are all in SMRAM.
>
> * I also tried the "info tlb" monitor command, via "virsh
>   qemu-monitor-command --hmp", while the guest was auto-paused after the
>   crash.
>
>   I cannot provide results: QEMU appeared to return a message that would
>   be longer than 16MB after encoding by libvirt, and libvirt rejected
>   that ("Unable to encode message payload", see VIR_NET_MESSAGE_MAX).
>
>   Anyway, the KVM trace, and the QEMU register dump, look consistent
>   with what Paolo said about "Code=?? ?? ??...":
>
>     The question marks usually mean that the page tables do not map a
>     page at that address.
>
>   CR3=000000007ff7f000 points into SMRAM, but we are outside of SMM
>   (SMM=0). We can't translate *any* guest-virtual address, as we can't
>   even begin walking the page tables.
>
> Thanks
> Laszlo
> _______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org%3cmailto:edk2-devel@lists.01.org>>
> https://lists.01.org/mailman/listinfo/edk2-devel
> _______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>
> https://lists.01.org/mailman/listinfo/edk2-devel
>