public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed
From: Laszlo Ersek <lersek@redhat.com>
To: Jiewen Yao <jiewen.yao@intel.com>
Cc: edk2-devel@ml01.01.org,
	Michael D Kinney <michael.d.kinney@intel.com>,
	Feng Tian <feng.tian@intel.com>, Jeff Fan <jeff.fan@intel.com>,
	Star Zeng <star.zeng@intel.com>,
	Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [PATCH V2 0/6] Enable SMM page level protection.
Date: Tue, 8 Nov 2016 02:22:14 +0100	[thread overview]
Message-ID: <08406bf5-4377-63a1-8dd9-34479c015d4b@redhat.com> (raw)
In-Reply-To: <1478251854-14660-1-git-send-email-jiewen.yao@intel.com>

On 11/04/16 10:30, Jiewen Yao wrote:
> ==== below is V2 description ====
> 1) PiSmmCpu: resolve OVMF multiple processors boot hang issue.
> 2) PiSmmCpu: Add debug info on StartupAp() fails.
> 3) PiSmmCpu: Add ASSERT for AllocatePages().
> 4) PiSmmCpu: Add protection detail in commit message.
> 5) UefiCpuPkg.dsc: Add page table footprint info in commit message.
>
> ==== below is V1 description ====
> This series patch enables SMM page level protection.
> Features are:
> 1) PiSmmCore reports SMM PE image code/data information
> in EdkiiPiSmmMemoryAttributeTable, if the SMM image is page aligned.
> 2) PiSmmCpu consumes EdkiiPiSmmMemoryAttributeTable
> and set XD for data page and RO for code page.
> 3) PiSmmCpu enables Static Paging for X64 according to
> PcdCpuSmmStaticPageTable. If it is true, 1G paging for above 4G
> is used as long as it is supported.
> 4) PiSmmCpu sets importance data structure to be read only,
> such as Gdt, Idt, SmmEntrypoint, and PageTable itself.
>
> tested platform:
> 1) Intel internal platform (X64).
> 2) EDKII Quark IA32
> 3) EDKII Vlv2  X64
> 4) EDKII OVMF IA32 and IA32X64. (with -smp 8)
>
> Cc: Jeff Fan <jeff.fan@intel.com>
> Cc: Feng Tian <feng.tian@intel.com>
> Cc: Star Zeng <star.zeng@intel.com>
> Cc: Michael D Kinney <michael.d.kinney@intel.com>
> Cc: Laszlo Ersek <lersek@redhat.com>
> Contributed-under: TianoCore Contribution Agreement 1.0
> Signed-off-by: Jiewen Yao <jiewen.yao@intel.com>

I have new test results. Let's start with the table again:

Legend:

- "untested" means the test was not executed because the same test
  failed or proved unreliable in a less demanding configuration already,

- "n/a" means a setting or test case was impossible,

- "fail" and "unreliable" (lower case) are outside the scope of this
  series; they either capture the pre-series status, or are expected
  even with the series applied due to the pre-series status,

- "FAIL" and "UNRELIABLE" mean regressions caused (or exposed) by the
  series.

In all cases, 36 bits were used as address width in the CPU HOB (--> up
to 64GB guest-phys address space).

   series  OVMF                                                              VCPU     boot       S3 resume
 # applied platform PcdCpuMaxLogicalProcessorNumber PcdCpuSmmStaticPageTable topology result     result
-- ------- -------- ------------------------------- ------------------------ -------- ------     ---------
 1 no      Ia32      64                             n/a                      1x2x2    pass       unreliable
 2 no      Ia32     255                             n/a                      52x2x2   pass       untested
 3 no      Ia32     255                             n/a                      53x2x2   unreliable untested
 4 no      Ia32X64   64                             n/a                      1x2x2    pass       unreliable
 5 no      Ia32X64  255                             n/a                      52x2x2   pass       untested
 6 no      Ia32X64  255                             n/a                      54x2x2   fail       n/a
 7 v2      Ia32      64                             FALSE                    1x2x2    pass       untested
 8 v2      Ia32      64                             TRUE                     1x2x2    FAIL       untested
 9 v2      Ia32     255                             FALSE                    52x2x2   pass       untested
10 v2      Ia32     255                             FALSE                    53x2x2   untested   untested
11 v2      Ia32     255                             TRUE                     52x2x2   untested   untested
12 v2      Ia32     255                             TRUE                     53x2x2   untested   untested
13 v2      Ia32X64   64                             FALSE                    1x2x2    pass       unreliable
14 v2      Ia32X64   64                             TRUE                     1x2x2    pass       untested
15 v2      Ia32X64  255                             FALSE                    52x2x2   pass       untested
16 v2      Ia32X64  255                             FALSE                    54x2x2   untested   untested
17 v2      Ia32X64  255                             TRUE                     52x2x2   FAIL       untested
18 v2      Ia32X64  255                             TRUE                     54x2x2   untested   untested

* Case 8: this test case failed with v2 as well, but this time with
  different symptoms:

> FSOpen: Open '\EFI\fedora\grubia32.efi' Success
> InstallProtocolInterface: [EfiLoadedImageProtocol] 7E4037A8
> Loading driver at 0x0007DA7F000 EntryPoint=0x0007DA7F400
> InstallProtocolInterface: [EfiLoadedImageDevicePathProtocol] 7E403A90
> PixelBlueGreenRedReserved8BitPerColor
> ConvertPages: Incompatible memory types
> PixelBlueGreenRedReserved8BitPerColor
> ConvertPages: Incompatible memory types
> SmmInstallProtocolInterface: [EdkiiSmmExitBootServicesProtocol] 0
> MpInitExitBootServicesCallback() done!
> ASSERT MdePkg/Library/BaseSynchronizationLib/SynchronizationGcc.c(73): SpinLock != ((void *) 0)

  I didn't try to narrow this down.

* Case 13 (the "unreliable S3 resume" case): Here the news are both bad
  and good. The good news is for Jiewen: this patch series does not
  cause the unreliability, it "only" amplifies it severely. The bad news
  is correspondingly for everyone else: S3 resume is actually unreliable
  even in case 4, that is, without this series applied, it's just the
  failure rate is much-much lower.

  Namely, in my new testing, in case 13, S3 resume failed 8 times out of
  21 tries. (I stopped testing at the 8th failure.)

  Whereas in case 4, S3 resume failed with *identical symptoms* (e.g.,
  exact same RIP=000000000009f0fd), 1 time out of 55 tries. (It was try
  #12 that failed; I continued testing and aborted the test after the
  55th try.)

  So, while the series hugely amplifies the failure rate, the failure
  does exist without the series. Which is why I modified the case 4
  results in the table, and also lower-cased the word "unreliable" in
  case 13.

  Below I will return to this problem separately; let's go over the rest
  of the table first.

* Case 17: I guess this is not a real failure, I'm just including it for
  completeness, as PcdCpuSmmStaticPageTable==TRUE is known to present
  additional SMRAM demand (see the commit message on patch V2 4/6). This
  case fails with

> SmmLockBox Command - 4
> SmmLockBoxSmmLib SetLockBoxAttributes - Enter
> SmmLockBoxSmmLib SetLockBoxAttributes - Exit (Success)
> SmmLockBox SmmLockBoxHandler Exit
> SmmLockBoxDxeLib SetLockBoxAttributes - Exit (Success)
> SmmInstallProtocolInterface: [EfiSmmReadyToLockProtocol] 0
> ASSERT UefiCpuPkg/PiSmmCpuDxeSmm/CpuS3.c(892): mGdtForAp != ((void *) 0)

  which is an SMRAM allocation failure. If I lower the VCPU count to
  50x2x2, then the guest boots fine.

----*----

Before I get to the S3 resume problem (which, again, reproduces without
this series, although much less frequently), I'd like to comment on the
removal of the ASSERT(), from v1 to v2, in the FlushTlbForAll()
function, on the return value of SmmBlockingStartupThisAp(). This change
allows v2 to proceed past that point; however, I'm seeing a whole lot of

> !mSmmMpSyncData->CpuData[1].Present
> !mSmmMpSyncData->CpuData[2].Present
> !mSmmMpSyncData->CpuData[3].Present
> ...

messages in the OVMF boot log, interspersed with

> ConvertPageEntryAttribute 0x7F92B067->0x7F92B065

style messages. (That is, one error message for each AP, per
ConvertPageEntryAttribute() message.)

Is this okay / intentional? The number of these messages can go up to
several thousands and that sort of drowns out everything else in the
log.

It's also not easy to mask the message, because it's logged on the
DEBUG_ERROR level.

----*----

* Okay, so the S3 problem. Last time I suspected that the failure point
  (RIP=9f0fd) was in the Linux guest's S3 wakeup vector (which starts at
  9A1D0, according to the OVMF log). In order to test this idea, I
  exercised this series with S3 against a Windows 8.1 guest (--> case 13
  again). The failure reproduced on the second S3 resume, with identical
  RIP, despite the Windows wakeup vector being located elsewhere (at
  0x1000).

  Quoting the OVMF log leading up to the resume:

> Call AsmDisablePaging64() to return to S3 Resume in PEI Phase
> Install PPI: [PeiPostScriptTablePpi]
> Install PPI: [EfiEndOfPeiSignalPpi]
> Notify: PPI Guid: [EfiEndOfPeiSignalPpi], Peim notify entry point: 857895
> PeiMpInitLib: CpuMpEndOfPeiCallback () invoked
> Transfer to 16bit OS waking vector - 1000

  QEMU log (same as before):

> KVM internal error. Suberror: 1
> KVM internal error. Suberror: 1
> emulation failure
> emulation failure
> RAX=0000000000000002 RBX=0000000000000000 RCX=000000007ffdb168 RDX=000000007fedd058
> RSI=0000000000000004 RDI=000000007fedd040 RBP=0000000000000000 RSP=000000007e1a7000
> R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000
> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
> RIP=000000000009f0fd RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> CS =0038 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
> SS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> DS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> FS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> GS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
> TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy
> GDT=     000000007f294000 00000047
> IDT=     000000007f294048 00000fff
> CR0=e0000011 CR2=0000000000000000 CR3=000000007ff7f000 CR4=00000220
> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
> DR6=00000000ffff0ff0 DR7=0000000000000400
> EFER=0000000000000500
> Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
> RAX=0000000000000001 RBX=0000000000000000 RCX=000000007ffdb168 RDX=000000007fedd070
> RSI=0000000000000004 RDI=000000007fedd058 RBP=0000000000000000 RSP=000000007e19f000
> R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000
> R12=0000000000000000 R13=0000000000000000 R14=0000000000000000 R15=0000000000000000
> RIP=000000000009f0fd RFL=00010006 [-----P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> CS =0038 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
> SS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> DS =0030 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> FS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> GS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
> LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
> TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy
> GDT=     000000007f294000 00000047
> IDT=     000000007f294048 00000fff
> CR0=e0000011 CR2=0000000000000000 CR3=000000007ff7f000 CR4=00000220
> DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
> DR6=00000000ffff0ff0 DR7=0000000000000400
> EFER=0000000000000500
> Code=?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? <??> ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??

  So, we can exclude the suspicion that the problem is guest OS
  dependent.

* Then I looked for the base address of the page containing the
  RIP=9f0fd address, in earlier parts of the OVMF log, on the hunch that
  some firmware component might have allocated that area actually. Here
  we go:

> Loading PEIM at 0x000008552C0 EntryPoint=0x000008554E0 CpuMpPei.efi
> AP Loop Mode is 1
> WakeupBufferStart = 9F000, WakeupBufferSize = 1000

  That is, the failure hits (when it hits -- not always) in the area
  where the CpuMpPei driver *borrows* memory for the startup vector of
  the APs, for the purposes of the MP service PPI. ("Wakeup" is an
  overloaded word here; the "wakeup buffer" has nothing to do with S3
  resume, it just serves for booting the APs temporarily in PEI, for
  implementing the MP service PPI.)

  When exiting the PEI phase (on the S3 resume path), CpuMpPei restores
  the original contents of this area. This occurs just before
  transfering control to the guest OS wakeup vector: see the
  "EfiEndOfPeiSignalPpi" and "CpuMpEndOfPeiCallback" strings just above,
  quoted from the OVMF log.

  I documented (parts of) this logic in OVMF commit

    https://github.com/tianocore/edk2/commit/e3e3090a959a0

  (see the code comments as well).

* At that time, I thought to have identified a memory management bug in
  CpuMpPei; see the following discussion and bug report for details:

    https://www.mail-archive.com/edk2-devel@lists.01.org/msg13892.html
    https://bugzilla.tianocore.org/show_bug.cgi?id=67

  However, with the extraction / introduction of MpInitLib, this issue
  has been fixed: GetWakeupBuffer() now calls
  CheckOverlapWithAllocatedBuffer(), so that "memory management bug" is
  no longer; we shouldn't be looking there for the root cause.

* Either way, I don't understand why anything would want to execute code
  in the one page that happens to host the MP services PPI startup
  buffer for APs during PEI.

  Not understanding the "why", I looked at the "what", and resorted to
  tracing KVM. Because the problem readily reproduces with this series
  applied (case 13), it wasn't hard to start the tracing while the guest
  was suspended, and capture just the actions that led from the
  KVM-level wakeup to the failure.

  The QEMU state dumps are visible above in the email. I've also
  uploaded the compressed OVMF log and the textual KVM trace here:

    http://people.redhat.com/lersek/s3-crash-8d1dfed7-ca92-4e25-8d2b-b1c9ac2a53db/

  I sincerely hope that Paolo will have a field day with the KVM trace
  :) I managed to identify the following curiosities (remember this is
  all on the S3 resume path):

  * First, the VCPUs (there are four of them) enter and leave SMM in a
    really funky pattern:

      vcpu#0  vcpu#1  vcpu#2  vcpu#3
      ------  ------  ------  ------
              enter
               |
              leave

                      enter
                        |
                      leave

                              enter
                                |
                              leave

      enter
        |
      leave

              enter           enter
       enter    |     enter     |
         |      |       |       |
       leave    |       |       |
                |       |       |
       enter    |       |       |
         |      |       |       |
       leave  leave   leave   leave

    That is, first we have each VCPU enter and leave SMM in complete
    isolation (1, 2, 3, 0). Then VCPUs 1 and 3 enter SMM together, soon
    followed by VPCUS 0 and 2, also together. VCPU#0 drops out of SMM
    temporarily (it comes back in later), while the other three remain
    in SMM. Finally all four of them leave SMM together.

    After which the problem occurs.

  * Second, the instruction that causes things to blow up is <0f aa>,
    i.e., RSM. I have absolutely no clue why RSM is executed:

    (a) in the area that used to host the AP startup routine for the MP
    services PPI -- note that we also have "Transfer to 16bit OS waking
    vector" in the log, so CpuMpEndOfPeiCallback() restores the borrowed
    area completeley! --,

    (b) and why *after* all four VCPUs have just left SMM, together.

  * The RSM instruction is handled successfully elsewhere, for example
    when all four VCPUs leave SMM, at the bottom of the diagram above:

> CPU-24447 [002] 39841.982810: kvm_emulate_insn:     0:7ffbf179: 0f aa
> CPU-24446 [000] 39841.982810: kvm_emulate_insn:     0:7ffbd179: 0f aa
> CPU-24445 [005] 39841.982810: kvm_emulate_insn:     0:7ffbb179: 0f aa
> CPU-24444 [006] 39841.982811: kvm_emulate_insn:     0:7ffb9179: 0f aa

  * The guest-phys address 7ff7f000 that we see just before the error:

> CPU-24447 [002] 39841.982825: kvm_page_fault:       address 7ff7f000 error_code 83
> CPU-24446 [000] 39841.982825: kvm_page_fault:       address 7ff7f000 error_code 83
> CPU-24447 [002] 39841.982826: kvm_emulate_insn:     0:9f0fd: 0f aa
> CPU-24444 [006] 39841.982827: kvm_exit:             reason EXTERNAL_INTERRUPT rip 0xffffffff813a954f info 0 800000fc
> CPU-24447 [002] 39841.982827: kvm_emulate_insn:     0:9f0fd: 0f aa FAIL
> CPU-24447 [002] 39841.982827: kvm_userspace_exit:   reason KVM_EXIT_INTERNAL_ERROR (17)

    can be found higher up in the trace; namely, it is written to CR3
    several times. It's the root of the page tables.

  * The 7F80_1000..7FFF_FFFF guest-phys addresses are all in SMRAM.

* I also tried the "info tlb" monitor command, via "virsh
  qemu-monitor-command --hmp", while the guest was auto-paused after the
  crash.

  I cannot provide results: QEMU appeared to return a message that would
  be longer than 16MB after encoding by libvirt, and libvirt rejected
  that ("Unable to encode message payload", see VIR_NET_MESSAGE_MAX).

  Anyway, the KVM trace, and the QEMU register dump, look consistent
  with what Paolo said about "Code=?? ?? ??...":

    The question marks usually mean that the page tables do not map a
    page at that address.

  CR3=000000007ff7f000 points into SMRAM, but we are outside of SMM
  (SMM=0). We can't translate *any* guest-virtual address, as we can't
  even begin walking the page tables.

Thanks
Laszlo


  parent reply	other threads:[~2016-11-08  1:22 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-11-04  9:30 [PATCH V2 0/6] Enable SMM page level protection Jiewen Yao
2016-11-04  9:30 ` [PATCH V2 1/6] MdeModulePkg/Include: Add PiSmmMemoryAttributesTable.h Jiewen Yao
2016-11-04  9:30 ` [PATCH V2 2/6] MdeModulePkg/dec: Add gEdkiiPiSmmMemoryAttributesTableGuid Jiewen Yao
2016-11-04  9:30 ` [PATCH V2 3/6] MdeModulePkg/PiSmmCore: Add MemoryAttributes support Jiewen Yao
2016-11-04  9:30 ` [PATCH V2 4/6] UefiCpuPkg/dec: Add PcdCpuSmmStaticPageTable Jiewen Yao
2016-11-04  9:30 ` [PATCH V2 5/6] UefiCpuPkg/PiSmmCpuDxeSmm: Add paging protection Jiewen Yao
2016-11-04  9:30 ` [PATCH V2 6/6] QuarkPlatformPkg/dsc: enable Smm " Jiewen Yao
2016-11-04 22:40 ` [PATCH V2 0/6] Enable SMM page level protection Laszlo Ersek
2016-11-04 22:46   ` Yao, Jiewen
2016-11-04 23:08     ` Laszlo Ersek
2016-11-08  1:22 ` Laszlo Ersek [this message]
2016-11-08 12:59   ` Yao, Jiewen
2016-11-08 13:22     ` Laszlo Ersek
2016-11-08 13:41       ` Yao, Jiewen
2016-11-09  6:25   ` Yao, Jiewen
2016-11-09 11:30     ` Paolo Bonzini
2016-11-09 15:01       ` Yao, Jiewen
2016-11-09 15:54         ` Paolo Bonzini
2016-11-09 16:06           ` Paolo Bonzini
2016-11-09 22:28           ` Laszlo Ersek
2016-11-09 22:59             ` Paolo Bonzini
2016-11-09 23:27               ` Laszlo Ersek
2016-11-10  1:13                 ` Yao, Jiewen
2016-11-10  6:30                   ` Fan, Jeff
2016-11-10  0:49               ` Yao, Jiewen
2016-11-10  0:50               ` Yao, Jiewen
2016-11-10  1:02                 ` Fan, Jeff
2016-11-09 20:46     ` Laszlo Ersek
2016-11-10 10:41       ` Yao, Jiewen
2016-11-10 12:01         ` Laszlo Ersek
2016-11-10 14:48           ` Yao, Jiewen
2016-11-10 14:53             ` Paolo Bonzini
2016-11-10 16:22               ` Laszlo Ersek
2016-11-10 16:39                 ` Paolo Bonzini
2016-11-10 16:25             ` Laszlo Ersek
2016-11-10 12:27         ` Paolo Bonzini
2016-11-09 11:23   ` Paolo Bonzini
2016-11-09 15:16     ` Yao, Jiewen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=08406bf5-4377-63a1-8dd9-34479c015d4b@redhat.com \
    --to=devel@edk2.groups.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox