public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed
* [PATCH v3 0/2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression
@ 2023-01-19 11:01 Laszlo Ersek
  2023-01-19 11:01 ` [PATCH v3 1/2] OvmfPkg/PlatformInitLib: factor out PlatformCpuCountBugCheck() Laszlo Ersek
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Laszlo Ersek @ 2023-01-19 11:01 UTC (permalink / raw)
  To: lersek, devel
  Cc: Ard Biesheuvel, Brijesh Singh, Erdem Aktas, Gerd Hoffmann,
	James Bottomley, Jiewen Yao, Jordan Justen, Michael Brown, Min Xu,
	Oliver Steffen, Sebastien Boeuf, Tom Lendacky

Repo:       https://pagure.io/lersek/edk2.git
Branch:     cpuhp-reg-catch-4250-v3
Test build: https://github.com/tianocore/edk2/pull/3930
Bugzilla:   https://bugzilla.tianocore.org/show_bug.cgi?id=4250

v2 was posted at:
- http://mid.mail-archive.com/20230112082845.128463-1-lersek@redhat.com
- https://edk2.groups.io/g/devel/message/98336
- https://listman.redhat.com/archives/edk2-devel-archive/2023-January/057634.html

Please see the Notes sections of the patches, regarding the updates.

The "PlatformCI_OvmfPkg_Windows_VS2019_PR" checks in test build linked
above time out again (as expected); now with the following debug log:

> PlatformCpuCountBugCheck: Present=0 Possible=1
> PlatformCpuCountBugCheck: Broken CPU hotplug register block found. Update QEMU to version 8+, or
> PlatformCpuCountBugCheck: to a stable release with commit dab30fbef389 backported. Refer to
> PlatformCpuCountBugCheck: <https://bugzilla.tianocore.org/show_bug.cgi?id=4250>.
> PlatformCpuCountBugCheck: Consequences of the QEMU bug may include, but are not limited to:
> PlatformCpuCountBugCheck: - all firmware logic, dependent on the CPU hotplug register block,
> PlatformCpuCountBugCheck:   being confused, for example, multiprocessing-related logic;
> PlatformCpuCountBugCheck: - guest OS data loss, including filesystem corruption, due to crash or
> PlatformCpuCountBugCheck:   hang during ACPI S3 resume;
> PlatformCpuCountBugCheck: - SMM privilege escalation, by a malicious guest OS or 3rd partty UEFI
> PlatformCpuCountBugCheck:   agent, against the platform firmware.
> PlatformCpuCountBugCheck: These symptoms need not necessarily be limited to the QEMU user
> PlatformCpuCountBugCheck: attempting to hot(un)plug a CPU.
> PlatformCpuCountBugCheck: The firmware will now stop (hang) deliberately, in order to prevent the
> PlatformCpuCountBugCheck: above symptoms.
> PlatformCpuCountBugCheck: You can forcibly override the hang, *at your own risk*, with the
> PlatformCpuCountBugCheck: following *experimental* QEMU command line option:
> PlatformCpuCountBugCheck:   -fw_cfg name=opt/org.tianocore/X-Cpuhp-Bugcheck-Override,string=yes
> PlatformCpuCountBugCheck: Please only report such bugs that you can reproduce *without* the
> PlatformCpuCountBugCheck: override.
> Select Item: 0x19
> ASSERT d:\a\1\s\OvmfPkg\Library\PlatformInitLib\Platform.c(520): ((BOOLEAN)(0==1))

The "PlatformCI_OvmfPkg_Ubuntu_GCC5_PR" checks succed, probably due to
edk2 commits ef0916009843 ("CI: Use Fedora 35 container (Linux only)",
2023-01-17) and 7fab007f33e9 ("OvmfPkg: CI: Use Fedora 35 container
(Linux only)", 2023-01-17), and due to
<https://github.com/tianocore/containers/commit/47addc9a4f20> applying
the upstream QEMU patch.

Laszlo

Cc: Ard Biesheuvel <ardb+tianocore@kernel.org>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Erdem Aktas <erdemaktas@google.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: Jiewen Yao <jiewen.yao@intel.com>
Cc: Jordan Justen <jordan.l.justen@intel.com>
Cc: Michael Brown <mcb30@ipxe.org>
Cc: Min Xu <min.m.xu@intel.com>
Cc: Oliver Steffen <osteffen@redhat.com>
Cc: Sebastien Boeuf <sebastien.boeuf@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>

Laszlo Ersek (2):
  OvmfPkg/PlatformInitLib: factor out PlatformCpuCountBugCheck()
  OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression

 OvmfPkg/Library/PlatformInitLib/Platform.c | 168 +++++++++++++++++---
 1 file changed, 145 insertions(+), 23 deletions(-)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3 1/2] OvmfPkg/PlatformInitLib: factor out PlatformCpuCountBugCheck()
  2023-01-19 11:01 [PATCH v3 0/2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression Laszlo Ersek
@ 2023-01-19 11:01 ` Laszlo Ersek
  2023-01-19 11:01 ` [PATCH v3 2/2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression Laszlo Ersek
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Laszlo Ersek @ 2023-01-19 11:01 UTC (permalink / raw)
  To: lersek, devel
  Cc: Ard Biesheuvel, Brijesh Singh, Erdem Aktas, Gerd Hoffmann,
	James Bottomley, Jiewen Yao, Jordan Justen, Michael Brown, Min Xu,
	Oliver Steffen, Sebastien Boeuf, Tom Lendacky

Move the QEMU v2.7 reset bug check/workaround to a separate function, as
we'll need to detect further issues.

Cc: Ard Biesheuvel <ardb+tianocore@kernel.org>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Erdem Aktas <erdemaktas@google.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: Jiewen Yao <jiewen.yao@intel.com>
Cc: Jordan Justen <jordan.l.justen@intel.com>
Cc: Michael Brown <mcb30@ipxe.org>
Cc: Min Xu <min.m.xu@intel.com>
Cc: Oliver Steffen <osteffen@redhat.com>
Cc: Sebastien Boeuf <sebastien.boeuf@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Bugzilla: https://bugzilla.tianocore.org/show_bug.cgi?id=4250
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
---

Notes:
    v3:
    - new patch

 OvmfPkg/Library/PlatformInitLib/Platform.c | 81 ++++++++++++++------
 1 file changed, 58 insertions(+), 23 deletions(-)

diff --git a/OvmfPkg/Library/PlatformInitLib/Platform.c b/OvmfPkg/Library/PlatformInitLib/Platform.c
index 9ab0342fd8c0..d1be5c2d7970 100644
--- a/OvmfPkg/Library/PlatformInitLib/Platform.c
+++ b/OvmfPkg/Library/PlatformInitLib/Platform.c
@@ -404,6 +404,61 @@ PlatformMiscInitialization (
   }
 }
 
+/**
+  Check for various QEMU bugs concerning CPU numbers.
+
+  Compensate for those bugs if various conditions are satisfied, by updating a
+  suitable subset of the input-output parameters. The function may not return
+  (it may hang deliberately), even in RELEASE builds, if the QEMU bug is
+  impossible to cover up.
+
+  @param[in,out] BootCpuCount  On input, the boot CPU count reported by QEMU via
+                               fw_cfg (QemuFwCfgItemSmpCpuCount). The caller is
+                               responsible for ensuring (BootCpuCount > 0); that
+                               is, if QEMU does not provide the boot CPU count
+                               via fw_cfg *at all*, then this function must not
+                               be called.
+
+  @param[in,out] Present       On input, the number of present-at-boot CPUs, as
+                               reported by QEMU through the modern CPU hotplug
+                               register block.
+
+  @param[in,out] Possible      On input, the number of possible CPUs, as
+                               reported by QEMU through the modern CPU hotplug
+                               register block.
+**/
+STATIC
+VOID
+PlatformCpuCountBugCheck (
+  IN OUT UINT16  *BootCpuCount,
+  IN OUT UINT32  *Present,
+  IN OUT UINT32  *Possible
+  )
+{
+  ASSERT (*BootCpuCount > 0);
+
+  //
+  // Sanity check: fw_cfg and the modern CPU hotplug interface should expose the
+  // same boot CPU count.
+  //
+  if (*BootCpuCount != *Present) {
+    DEBUG ((
+      DEBUG_WARN,
+      "%a: QEMU v2.7 reset bug: BootCpuCount=%d Present=%u\n",
+      __FUNCTION__,
+      *BootCpuCount,
+      *Present
+      ));
+    //
+    // The handling of QemuFwCfgItemSmpCpuCount, across CPU hotplug plus
+    // platform reset (including S3), was corrected in QEMU commit e3cadac073a9
+    // ("pc: fix FW_CFG_NB_CPUS to account for -device added CPUs", 2016-11-16),
+    // part of release v2.8.0.
+    //
+    *BootCpuCount = (UINT16)*Present;
+  }
+}
+
 /**
   Fetch the boot CPU count and the possible CPU count from QEMU, and expose
   them to UefiCpuPkg modules.
@@ -518,8 +573,8 @@ PlatformMaxCpuCountInitialization (
         UINT8  CpuStatus;
 
         //
-        // Read the status of the currently selected CPU. This will help with a
-        // sanity check against "BootCpuCount".
+        // Read the status of the currently selected CPU. This will help with
+        // various CPU count sanity checks.
         //
         CpuStatus = IoRead8 (CpuHpBase + QEMU_CPUHP_R_CPU_STAT);
         if ((CpuStatus & QEMU_CPUHP_STAT_ENABLED) != 0) {
@@ -540,27 +595,7 @@ PlatformMaxCpuCountInitialization (
         ASSERT (Selected == Possible || Selected == 0);
       } while (Selected > 0);
 
-      //
-      // Sanity check: fw_cfg and the modern CPU hotplug interface should
-      // return the same boot CPU count.
-      //
-      if (BootCpuCount != Present) {
-        DEBUG ((
-          DEBUG_WARN,
-          "%a: QEMU v2.7 reset bug: BootCpuCount=%d "
-          "Present=%u\n",
-          __FUNCTION__,
-          BootCpuCount,
-          Present
-          ));
-        //
-        // The handling of QemuFwCfgItemSmpCpuCount, across CPU hotplug plus
-        // platform reset (including S3), was corrected in QEMU commit
-        // e3cadac073a9 ("pc: fix FW_CFG_NB_CPUS to account for -device added
-        // CPUs", 2016-11-16), part of release v2.8.0.
-        //
-        BootCpuCount = (UINT16)Present;
-      }
+      PlatformCpuCountBugCheck (&BootCpuCount, &Present, &Possible);
 
       MaxCpuCount = Possible;
     }


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 2/2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression
  2023-01-19 11:01 [PATCH v3 0/2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression Laszlo Ersek
  2023-01-19 11:01 ` [PATCH v3 1/2] OvmfPkg/PlatformInitLib: factor out PlatformCpuCountBugCheck() Laszlo Ersek
@ 2023-01-19 11:01 ` Laszlo Ersek
  2023-01-19 11:27   ` Ard Biesheuvel
  2023-01-19 11:25 ` [edk2-devel] [PATCH v3 0/2] " Michael Brown
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Laszlo Ersek @ 2023-01-19 11:01 UTC (permalink / raw)
  To: lersek, devel
  Cc: Ard Biesheuvel, Brijesh Singh, Erdem Aktas, Gerd Hoffmann,
	James Bottomley, Jiewen Yao, Jordan Justen, Michael Brown, Min Xu,
	Oliver Steffen, Sebastien Boeuf, Tom Lendacky

In QEMU v5.1.0, the CPU hotplug register block misbehaves: the negotiation
protocol is (effectively) broken such that it suggests that switching from
the legacy interface to the modern interface works, but in reality the
switch never happens. The symptom has been witnessed when using TCG
acceleration; KVM seems to mask the issue. The issue persists with the
following (latest) stable QEMU releases: v5.2.0, v6.2.0, v7.2.0. Currently
there is no stable release that addresses the problem.

The QEMU bug confuses the Present and Possible counting in function
PlatformMaxCpuCountInitialization(), in
"OvmfPkg/Library/PlatformInitLib/Platform.c". OVMF ends up with Present=0
Possible=1. This in turn further confuses MpInitLib in UefiCpuPkg (hence
firmware-time multiprocessing will be broken). Worse, CPU hot(un)plug with
SMI will be summarily broken in OvmfPkg/CpuHotplugSmm, which (considering
the privilege level of SMM) is not that great.

Detect the issue in PlatformCpuCountBugCheck(), and print an error message
and *hang* if the issue is present.

Users willing to take risks can override the hang with the experimental
QEMU command line option

  -fw_cfg name=opt/org.tianocore/X-Cpuhp-Bugcheck-Override,string=yes

(The "-fw_cfg" QEMU option itself is not experimental; its above argument,
as far it concerns the firmware, is experimental.)

The problem was originally reported by Ard [0]. We analyzed it at [1] and
[2]. A QEMU patch was sent at [3]; now merged as commit dab30fbef389
("acpi: cpuhp: fix guest-visible maximum access size to the legacy reg
block", 2023-01-08), to be included in QEMU v8.0.0.

[0] https://bugzilla.tianocore.org/show_bug.cgi?id=4234#c2

[1] https://bugzilla.tianocore.org/show_bug.cgi?id=4234#c3

[2] IO port write width clamping differs between TCG and KVM
    http://mid.mail-archive.com/aaedee84-d3ed-a4f9-21e7-d221a28d1683@redhat.com
    https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00199.html

[3] acpi: cpuhp: fix guest-visible maximum access size to the legacy reg block
    http://mid.mail-archive.com/20230104090138.214862-1-lersek@redhat.com
    https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00278.html

NOTE: PlatformInitLib is used in the following platform DSCs:

  OvmfPkg/AmdSev/AmdSevX64.dsc
  OvmfPkg/CloudHv/CloudHvX64.dsc
  OvmfPkg/IntelTdx/IntelTdxX64.dsc
  OvmfPkg/Microvm/MicrovmX64.dsc
  OvmfPkg/OvmfPkgIa32.dsc
  OvmfPkg/OvmfPkgIa32X64.dsc
  OvmfPkg/OvmfPkgX64.dsc

but I can only test this change with the last three platforms, running on
QEMU.

Test results:

  TCG  QEMU     OVMF     override  result
       patched  patched
  ---  -------  -------  --------  --------------------------------------
  0    0        0        0         CPU counts OK (KVM masks the QEMU bug)
  0    0        1        0         CPU counts OK (KVM masks the QEMU bug)
  0    1        0        0         CPU counts OK (QEMU fix, but KVM masks
                                   the QEMU bug anyway)
  0    1        1        0         CPU counts OK (QEMU fix, but KVM masks
                                   the QEMU bug anyway)
  1    0        0        0         boot with broken CPU counts (original
                                   QEMU bug)
  1    0        1        0         broken CPU count caught (boot hangs)
  1    0        1        1         broken CPU count caught, bug check
                                   overridden, boot continues
  1    1        0        0         CPU counts OK (QEMU fix)
  1    1        1        0         CPU counts OK (QEMU fix)

Cc: Ard Biesheuvel <ardb+tianocore@kernel.org>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Erdem Aktas <erdemaktas@google.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: Jiewen Yao <jiewen.yao@intel.com>
Cc: Jordan Justen <jordan.l.justen@intel.com>
Cc: Michael Brown <mcb30@ipxe.org>
Cc: Min Xu <min.m.xu@intel.com>
Cc: Oliver Steffen <osteffen@redhat.com>
Cc: Sebastien Boeuf <sebastien.boeuf@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Bugzilla: https://bugzilla.tianocore.org/show_bug.cgi?id=4250
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
---

Notes:
    v3:
    
    - reimplement the bug check in the factored out function
      PlatformCpuCountBugCheck()
    
    - add override [Ard, Gerd, Michael]
    
    - assert "0 < Present == BootCpuCount <= Possible" whenever
      PlatformCpuCountBugCheck() returns
    
    - update commit message
    
    - update test matrix in commit message
    
    - (re)test all "OVMF patched = 1" cases (sigh)
    
    v2:
    
    - V1 was at
      <http://mid.mail-archive.com/20230104151234.286030-1-lersek@redhat.com>.
    
    - Repo: <https://pagure.io/lersek/edk2.git>, branch:
      cpuhp-reg-catch-4250-v2
    
    - Remove KVM as a proposed workaround from the error message, because in
      the QEMU discussion, we had found that the KVM accelerator's behavior
      in QEMU (masking the problem) was not right, and that a fix for that
      had been in progress for quite some time.
    
    - Add the QEMU commit hash to the commit message, the code comment, and
      the error message.
    
    - Pick up Gerd's R-b; add Oliver to the Cc list.

 OvmfPkg/Library/PlatformInitLib/Platform.c | 87 ++++++++++++++++++++
 1 file changed, 87 insertions(+)

diff --git a/OvmfPkg/Library/PlatformInitLib/Platform.c b/OvmfPkg/Library/PlatformInitLib/Platform.c
index d1be5c2d7970..9fee6e481038 100644
--- a/OvmfPkg/Library/PlatformInitLib/Platform.c
+++ b/OvmfPkg/Library/PlatformInitLib/Platform.c
@@ -36,6 +36,9 @@
 
 #include <Library/PlatformInitLib.h>
 
+#define CPUHP_BUGCHECK_OVERRIDE_FWCFG_FILE \
+  "opt/org.tianocore/X-Cpuhp-Bugcheck-Override"
+
 VOID
 EFIAPI
 PlatformAddIoMemoryBaseSizeHob (
@@ -437,6 +440,87 @@ PlatformCpuCountBugCheck (
 {
   ASSERT (*BootCpuCount > 0);
 
+  //
+  // Sanity check: we need at least 1 present CPU (CPU#0 is always present).
+  //
+  // The legacy-to-modern switching of the CPU hotplug register block got broken
+  // (for TCG) in QEMU v5.1.0. Refer to "IO port write width clamping differs
+  // between TCG and KVM" at
+  // <http://mid.mail-archive.com/aaedee84-d3ed-a4f9-21e7-d221a28d1683@redhat.com>
+  // or at
+  // <https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00199.html>.
+  //
+  // QEMU received the fix in commit dab30fbef389 ("acpi: cpuhp: fix
+  // guest-visible maximum access size to the legacy reg block", 2023-01-08), to
+  // be included in QEMU v8.0.0.
+  //
+  // If we're affected by this QEMU bug, then we must not continue: it confuses
+  // the multiprocessing in UefiCpuPkg/Library/MpInitLib, and breaks CPU
+  // hot(un)plug with SMI in OvmfPkg/CpuHotplugSmm.
+  //
+  if (*Present == 0) {
+    UINTN                      Idx;
+    STATIC CONST CHAR8 *CONST  Message[] = {
+      "Broken CPU hotplug register block found. Update QEMU to version 8+, or",
+      "to a stable release with commit dab30fbef389 backported. Refer to",
+      "<https://bugzilla.tianocore.org/show_bug.cgi?id=4250>.",
+      "Consequences of the QEMU bug may include, but are not limited to:",
+      "- all firmware logic, dependent on the CPU hotplug register block,",
+      "  being confused, for example, multiprocessing-related logic;",
+      "- guest OS data loss, including filesystem corruption, due to crash or",
+      "  hang during ACPI S3 resume;",
+      "- SMM privilege escalation, by a malicious guest OS or 3rd partty UEFI",
+      "  agent, against the platform firmware.",
+      "These symptoms need not necessarily be limited to the QEMU user",
+      "attempting to hot(un)plug a CPU.",
+      "The firmware will now stop (hang) deliberately, in order to prevent the",
+      "above symptoms.",
+      "You can forcibly override the hang, *at your own risk*, with the",
+      "following *experimental* QEMU command line option:",
+      "  -fw_cfg name=" CPUHP_BUGCHECK_OVERRIDE_FWCFG_FILE ",string=yes",
+      "Please only report such bugs that you can reproduce *without* the",
+      "override.",
+    };
+    RETURN_STATUS              ParseStatus;
+    BOOLEAN                    Override;
+
+    DEBUG ((
+      DEBUG_ERROR,
+      "%a: Present=%u Possible=%u\n",
+      __FUNCTION__,
+      *Present,
+      *Possible
+      ));
+    for (Idx = 0; Idx < ARRAY_SIZE (Message); ++Idx) {
+      DEBUG ((DEBUG_ERROR, "%a: %a\n", __FUNCTION__, Message[Idx]));
+    }
+
+    ParseStatus = QemuFwCfgParseBool (
+                    CPUHP_BUGCHECK_OVERRIDE_FWCFG_FILE,
+                    &Override
+                    );
+    if (!RETURN_ERROR (ParseStatus) && Override) {
+      DEBUG ((
+        DEBUG_WARN,
+        "%a: \"%a\" active. You've been warned.\n",
+        __FUNCTION__,
+        CPUHP_BUGCHECK_OVERRIDE_FWCFG_FILE
+        ));
+      //
+      // The bug is in QEMU v5.1.0+, where we're not affected by the QEMU v2.7
+      // reset bug, so BootCpuCount from fw_cfg is reliable. Assume a fully
+      // populated topology, like when the modern CPU hotplug interface is
+      // unavailable.
+      //
+      *Present  = *BootCpuCount;
+      *Possible = *BootCpuCount;
+      return;
+    }
+
+    ASSERT (FALSE);
+    CpuDeadLoop ();
+  }
+
   //
   // Sanity check: fw_cfg and the modern CPU hotplug interface should expose the
   // same boot CPU count.
@@ -596,6 +680,9 @@ PlatformMaxCpuCountInitialization (
       } while (Selected > 0);
 
       PlatformCpuCountBugCheck (&BootCpuCount, &Present, &Possible);
+      ASSERT (Present > 0);
+      ASSERT (Present <= Possible);
+      ASSERT (BootCpuCount == Present);
 
       MaxCpuCount = Possible;
     }

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [edk2-devel] [PATCH v3 0/2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression
  2023-01-19 11:01 [PATCH v3 0/2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression Laszlo Ersek
  2023-01-19 11:01 ` [PATCH v3 1/2] OvmfPkg/PlatformInitLib: factor out PlatformCpuCountBugCheck() Laszlo Ersek
  2023-01-19 11:01 ` [PATCH v3 2/2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression Laszlo Ersek
@ 2023-01-19 11:25 ` Michael Brown
  2023-01-19 12:05 ` Gerd Hoffmann
  2023-01-20 13:48 ` Laszlo Ersek
  4 siblings, 0 replies; 12+ messages in thread
From: Michael Brown @ 2023-01-19 11:25 UTC (permalink / raw)
  To: devel, lersek
  Cc: Ard Biesheuvel, Brijesh Singh, Erdem Aktas, Gerd Hoffmann,
	James Bottomley, Jiewen Yao, Jordan Justen, Min Xu,
	Oliver Steffen, Sebastien Boeuf, Tom Lendacky

On 19/01/2023 11:01, Laszlo Ersek wrote:
>> PlatformCpuCountBugCheck: Present=0 Possible=1
>> PlatformCpuCountBugCheck: Broken CPU hotplug register block found. Update QEMU to version 8+, or
>> PlatformCpuCountBugCheck: to a stable release with commit dab30fbef389 backported. Refer to
>> PlatformCpuCountBugCheck: <https://bugzilla.tianocore.org/show_bug.cgi?id=4250>.
>> PlatformCpuCountBugCheck: Consequences of the QEMU bug may include, but are not limited to:
>> PlatformCpuCountBugCheck: - all firmware logic, dependent on the CPU hotplug register block,
>> PlatformCpuCountBugCheck:   being confused, for example, multiprocessing-related logic;
>> PlatformCpuCountBugCheck: - guest OS data loss, including filesystem corruption, due to crash or
>> PlatformCpuCountBugCheck:   hang during ACPI S3 resume;
>> PlatformCpuCountBugCheck: - SMM privilege escalation, by a malicious guest OS or 3rd partty UEFI
>> PlatformCpuCountBugCheck:   agent, against the platform firmware.
>> PlatformCpuCountBugCheck: These symptoms need not necessarily be limited to the QEMU user
>> PlatformCpuCountBugCheck: attempting to hot(un)plug a CPU.
>> PlatformCpuCountBugCheck: The firmware will now stop (hang) deliberately, in order to prevent the
>> PlatformCpuCountBugCheck: above symptoms.
>> PlatformCpuCountBugCheck: You can forcibly override the hang, *at your own risk*, with the
>> PlatformCpuCountBugCheck: following *experimental* QEMU command line option:
>> PlatformCpuCountBugCheck:   -fw_cfg name=opt/org.tianocore/X-Cpuhp-Bugcheck-Override,string=yes
>> PlatformCpuCountBugCheck: Please only report such bugs that you can reproduce *without* the
>> PlatformCpuCountBugCheck: override.

Laszlo: thank you for taking the time to deal with this so thoroughly 
and to see it through to its conclusion.

For what it's worth:

Hugely-appreciated-by: Michael Brown <mcb30@ipxe.org>

Thanks,

Michael


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 2/2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression
  2023-01-19 11:01 ` [PATCH v3 2/2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression Laszlo Ersek
@ 2023-01-19 11:27   ` Ard Biesheuvel
  2023-01-20  8:50     ` Laszlo Ersek
  0 siblings, 1 reply; 12+ messages in thread
From: Ard Biesheuvel @ 2023-01-19 11:27 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: devel, Ard Biesheuvel, Brijesh Singh, Erdem Aktas, Gerd Hoffmann,
	James Bottomley, Jiewen Yao, Jordan Justen, Michael Brown, Min Xu,
	Oliver Steffen, Sebastien Boeuf, Tom Lendacky

On Thu, 19 Jan 2023 at 12:01, Laszlo Ersek <lersek@redhat.com> wrote:
>
> In QEMU v5.1.0, the CPU hotplug register block misbehaves: the negotiation
> protocol is (effectively) broken such that it suggests that switching from
> the legacy interface to the modern interface works, but in reality the
> switch never happens. The symptom has been witnessed when using TCG
> acceleration; KVM seems to mask the issue. The issue persists with the
> following (latest) stable QEMU releases: v5.2.0, v6.2.0, v7.2.0. Currently
> there is no stable release that addresses the problem.
>
> The QEMU bug confuses the Present and Possible counting in function
> PlatformMaxCpuCountInitialization(), in
> "OvmfPkg/Library/PlatformInitLib/Platform.c". OVMF ends up with Present=0
> Possible=1. This in turn further confuses MpInitLib in UefiCpuPkg (hence
> firmware-time multiprocessing will be broken). Worse, CPU hot(un)plug with
> SMI will be summarily broken in OvmfPkg/CpuHotplugSmm, which (considering
> the privilege level of SMM) is not that great.
>
> Detect the issue in PlatformCpuCountBugCheck(), and print an error message
> and *hang* if the issue is present.
>
> Users willing to take risks can override the hang with the experimental
> QEMU command line option
>
>   -fw_cfg name=opt/org.tianocore/X-Cpuhp-Bugcheck-Override,string=yes
>
> (The "-fw_cfg" QEMU option itself is not experimental; its above argument,
> as far it concerns the firmware, is experimental.)
>
> The problem was originally reported by Ard [0]. We analyzed it at [1] and
> [2]. A QEMU patch was sent at [3]; now merged as commit dab30fbef389
> ("acpi: cpuhp: fix guest-visible maximum access size to the legacy reg
> block", 2023-01-08), to be included in QEMU v8.0.0.
>
> [0] https://bugzilla.tianocore.org/show_bug.cgi?id=4234#c2
>
> [1] https://bugzilla.tianocore.org/show_bug.cgi?id=4234#c3
>
> [2] IO port write width clamping differs between TCG and KVM
>     http://mid.mail-archive.com/aaedee84-d3ed-a4f9-21e7-d221a28d1683@redhat.com
>     https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00199.html
>
> [3] acpi: cpuhp: fix guest-visible maximum access size to the legacy reg block
>     http://mid.mail-archive.com/20230104090138.214862-1-lersek@redhat.com
>     https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00278.html
>
> NOTE: PlatformInitLib is used in the following platform DSCs:
>
>   OvmfPkg/AmdSev/AmdSevX64.dsc
>   OvmfPkg/CloudHv/CloudHvX64.dsc
>   OvmfPkg/IntelTdx/IntelTdxX64.dsc
>   OvmfPkg/Microvm/MicrovmX64.dsc
>   OvmfPkg/OvmfPkgIa32.dsc
>   OvmfPkg/OvmfPkgIa32X64.dsc
>   OvmfPkg/OvmfPkgX64.dsc
>
> but I can only test this change with the last three platforms, running on
> QEMU.
>
> Test results:
>
>   TCG  QEMU     OVMF     override  result
>        patched  patched
>   ---  -------  -------  --------  --------------------------------------
>   0    0        0        0         CPU counts OK (KVM masks the QEMU bug)
>   0    0        1        0         CPU counts OK (KVM masks the QEMU bug)
>   0    1        0        0         CPU counts OK (QEMU fix, but KVM masks
>                                    the QEMU bug anyway)
>   0    1        1        0         CPU counts OK (QEMU fix, but KVM masks
>                                    the QEMU bug anyway)
>   1    0        0        0         boot with broken CPU counts (original
>                                    QEMU bug)
>   1    0        1        0         broken CPU count caught (boot hangs)
>   1    0        1        1         broken CPU count caught, bug check
>                                    overridden, boot continues
>   1    1        0        0         CPU counts OK (QEMU fix)
>   1    1        1        0         CPU counts OK (QEMU fix)
>
> Cc: Ard Biesheuvel <ardb+tianocore@kernel.org>
> Cc: Brijesh Singh <brijesh.singh@amd.com>
> Cc: Erdem Aktas <erdemaktas@google.com>
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> Cc: James Bottomley <jejb@linux.ibm.com>
> Cc: Jiewen Yao <jiewen.yao@intel.com>
> Cc: Jordan Justen <jordan.l.justen@intel.com>
> Cc: Michael Brown <mcb30@ipxe.org>
> Cc: Min Xu <min.m.xu@intel.com>
> Cc: Oliver Steffen <osteffen@redhat.com>
> Cc: Sebastien Boeuf <sebastien.boeuf@intel.com>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Bugzilla: https://bugzilla.tianocore.org/show_bug.cgi?id=4250
> Signed-off-by: Laszlo Ersek <lersek@redhat.com>

Thanks a lot for taking the time and investing the effort. I'm quite
happy that we have this 'escape hatch' now, which we could arguably
use temporarily in the VS2019 platform CI until its QEMU binary gets
updated, right?

For the series,

Reviewed-by: Ard Biesheuvel <ardb@kernel.org>



> ---
>
> Notes:
>     v3:
>
>     - reimplement the bug check in the factored out function
>       PlatformCpuCountBugCheck()
>
>     - add override [Ard, Gerd, Michael]
>
>     - assert "0 < Present == BootCpuCount <= Possible" whenever
>       PlatformCpuCountBugCheck() returns
>
>     - update commit message
>
>     - update test matrix in commit message
>
>     - (re)test all "OVMF patched = 1" cases (sigh)
>
>     v2:
>
>     - V1 was at
>       <http://mid.mail-archive.com/20230104151234.286030-1-lersek@redhat.com>.
>
>     - Repo: <https://pagure.io/lersek/edk2.git>, branch:
>       cpuhp-reg-catch-4250-v2
>
>     - Remove KVM as a proposed workaround from the error message, because in
>       the QEMU discussion, we had found that the KVM accelerator's behavior
>       in QEMU (masking the problem) was not right, and that a fix for that
>       had been in progress for quite some time.
>
>     - Add the QEMU commit hash to the commit message, the code comment, and
>       the error message.
>
>     - Pick up Gerd's R-b; add Oliver to the Cc list.
>
>  OvmfPkg/Library/PlatformInitLib/Platform.c | 87 ++++++++++++++++++++
>  1 file changed, 87 insertions(+)
>
> diff --git a/OvmfPkg/Library/PlatformInitLib/Platform.c b/OvmfPkg/Library/PlatformInitLib/Platform.c
> index d1be5c2d7970..9fee6e481038 100644
> --- a/OvmfPkg/Library/PlatformInitLib/Platform.c
> +++ b/OvmfPkg/Library/PlatformInitLib/Platform.c
> @@ -36,6 +36,9 @@
>
>  #include <Library/PlatformInitLib.h>
>
> +#define CPUHP_BUGCHECK_OVERRIDE_FWCFG_FILE \
> +  "opt/org.tianocore/X-Cpuhp-Bugcheck-Override"
> +
>  VOID
>  EFIAPI
>  PlatformAddIoMemoryBaseSizeHob (
> @@ -437,6 +440,87 @@ PlatformCpuCountBugCheck (
>  {
>    ASSERT (*BootCpuCount > 0);
>
> +  //
> +  // Sanity check: we need at least 1 present CPU (CPU#0 is always present).
> +  //
> +  // The legacy-to-modern switching of the CPU hotplug register block got broken
> +  // (for TCG) in QEMU v5.1.0. Refer to "IO port write width clamping differs
> +  // between TCG and KVM" at
> +  // <http://mid.mail-archive.com/aaedee84-d3ed-a4f9-21e7-d221a28d1683@redhat.com>
> +  // or at
> +  // <https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00199.html>.
> +  //
> +  // QEMU received the fix in commit dab30fbef389 ("acpi: cpuhp: fix
> +  // guest-visible maximum access size to the legacy reg block", 2023-01-08), to
> +  // be included in QEMU v8.0.0.
> +  //
> +  // If we're affected by this QEMU bug, then we must not continue: it confuses
> +  // the multiprocessing in UefiCpuPkg/Library/MpInitLib, and breaks CPU
> +  // hot(un)plug with SMI in OvmfPkg/CpuHotplugSmm.
> +  //
> +  if (*Present == 0) {
> +    UINTN                      Idx;
> +    STATIC CONST CHAR8 *CONST  Message[] = {
> +      "Broken CPU hotplug register block found. Update QEMU to version 8+, or",
> +      "to a stable release with commit dab30fbef389 backported. Refer to",
> +      "<https://bugzilla.tianocore.org/show_bug.cgi?id=4250>.",
> +      "Consequences of the QEMU bug may include, but are not limited to:",
> +      "- all firmware logic, dependent on the CPU hotplug register block,",
> +      "  being confused, for example, multiprocessing-related logic;",
> +      "- guest OS data loss, including filesystem corruption, due to crash or",
> +      "  hang during ACPI S3 resume;",
> +      "- SMM privilege escalation, by a malicious guest OS or 3rd partty UEFI",
> +      "  agent, against the platform firmware.",
> +      "These symptoms need not necessarily be limited to the QEMU user",
> +      "attempting to hot(un)plug a CPU.",
> +      "The firmware will now stop (hang) deliberately, in order to prevent the",
> +      "above symptoms.",
> +      "You can forcibly override the hang, *at your own risk*, with the",
> +      "following *experimental* QEMU command line option:",
> +      "  -fw_cfg name=" CPUHP_BUGCHECK_OVERRIDE_FWCFG_FILE ",string=yes",
> +      "Please only report such bugs that you can reproduce *without* the",
> +      "override.",
> +    };
> +    RETURN_STATUS              ParseStatus;
> +    BOOLEAN                    Override;
> +
> +    DEBUG ((
> +      DEBUG_ERROR,
> +      "%a: Present=%u Possible=%u\n",
> +      __FUNCTION__,
> +      *Present,
> +      *Possible
> +      ));
> +    for (Idx = 0; Idx < ARRAY_SIZE (Message); ++Idx) {
> +      DEBUG ((DEBUG_ERROR, "%a: %a\n", __FUNCTION__, Message[Idx]));
> +    }
> +
> +    ParseStatus = QemuFwCfgParseBool (
> +                    CPUHP_BUGCHECK_OVERRIDE_FWCFG_FILE,
> +                    &Override
> +                    );
> +    if (!RETURN_ERROR (ParseStatus) && Override) {
> +      DEBUG ((
> +        DEBUG_WARN,
> +        "%a: \"%a\" active. You've been warned.\n",
> +        __FUNCTION__,
> +        CPUHP_BUGCHECK_OVERRIDE_FWCFG_FILE
> +        ));
> +      //
> +      // The bug is in QEMU v5.1.0+, where we're not affected by the QEMU v2.7
> +      // reset bug, so BootCpuCount from fw_cfg is reliable. Assume a fully
> +      // populated topology, like when the modern CPU hotplug interface is
> +      // unavailable.
> +      //
> +      *Present  = *BootCpuCount;
> +      *Possible = *BootCpuCount;
> +      return;
> +    }
> +
> +    ASSERT (FALSE);
> +    CpuDeadLoop ();
> +  }
> +
>    //
>    // Sanity check: fw_cfg and the modern CPU hotplug interface should expose the
>    // same boot CPU count.
> @@ -596,6 +680,9 @@ PlatformMaxCpuCountInitialization (
>        } while (Selected > 0);
>
>        PlatformCpuCountBugCheck (&BootCpuCount, &Present, &Possible);
> +      ASSERT (Present > 0);
> +      ASSERT (Present <= Possible);
> +      ASSERT (BootCpuCount == Present);
>
>        MaxCpuCount = Possible;
>      }

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 0/2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression
  2023-01-19 11:01 [PATCH v3 0/2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression Laszlo Ersek
                   ` (2 preceding siblings ...)
  2023-01-19 11:25 ` [edk2-devel] [PATCH v3 0/2] " Michael Brown
@ 2023-01-19 12:05 ` Gerd Hoffmann
  2023-01-20 13:48 ` Laszlo Ersek
  4 siblings, 0 replies; 12+ messages in thread
From: Gerd Hoffmann @ 2023-01-19 12:05 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: devel, Ard Biesheuvel, Brijesh Singh, Erdem Aktas,
	James Bottomley, Jiewen Yao, Jordan Justen, Michael Brown, Min Xu,
	Oliver Steffen, Sebastien Boeuf, Tom Lendacky

On Thu, Jan 19, 2023 at 12:01:29PM +0100, Laszlo Ersek wrote:
> Repo:       https://pagure.io/lersek/edk2.git
> Branch:     cpuhp-reg-catch-4250-v3
> Test build: https://github.com/tianocore/edk2/pull/3930
> Bugzilla:   https://bugzilla.tianocore.org/show_bug.cgi?id=4250

Having the quirks contained in the new PlatformCpuCountBugCheck()
is a nice cleanup.

Series:
Acked-by: Gerd Hoffmann <kraxel@redhat.com>

thanks & take care,
  Gerd


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 2/2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression
  2023-01-19 11:27   ` Ard Biesheuvel
@ 2023-01-20  8:50     ` Laszlo Ersek
  2023-01-20  9:10       ` Ard Biesheuvel
  2023-01-20  9:17       ` Laszlo Ersek
  0 siblings, 2 replies; 12+ messages in thread
From: Laszlo Ersek @ 2023-01-20  8:50 UTC (permalink / raw)
  To: Ard Biesheuvel, Oliver Steffen
  Cc: devel, Ard Biesheuvel, Brijesh Singh, Erdem Aktas, Gerd Hoffmann,
	James Bottomley, Jiewen Yao, Jordan Justen, Michael Brown, Min Xu,
	Sebastien Boeuf, Tom Lendacky

a couple of requests to Oliver below:

On 1/19/23 12:27, Ard Biesheuvel wrote:
> On Thu, 19 Jan 2023 at 12:01, Laszlo Ersek <lersek@redhat.com> wrote:
>>
>> In QEMU v5.1.0, the CPU hotplug register block misbehaves: the negotiation
>> protocol is (effectively) broken such that it suggests that switching from
>> the legacy interface to the modern interface works, but in reality the
>> switch never happens. The symptom has been witnessed when using TCG
>> acceleration; KVM seems to mask the issue. The issue persists with the
>> following (latest) stable QEMU releases: v5.2.0, v6.2.0, v7.2.0. Currently
>> there is no stable release that addresses the problem.
>>
>> The QEMU bug confuses the Present and Possible counting in function
>> PlatformMaxCpuCountInitialization(), in
>> "OvmfPkg/Library/PlatformInitLib/Platform.c". OVMF ends up with Present=0
>> Possible=1. This in turn further confuses MpInitLib in UefiCpuPkg (hence
>> firmware-time multiprocessing will be broken). Worse, CPU hot(un)plug with
>> SMI will be summarily broken in OvmfPkg/CpuHotplugSmm, which (considering
>> the privilege level of SMM) is not that great.
>>
>> Detect the issue in PlatformCpuCountBugCheck(), and print an error message
>> and *hang* if the issue is present.
>>
>> Users willing to take risks can override the hang with the experimental
>> QEMU command line option
>>
>>   -fw_cfg name=opt/org.tianocore/X-Cpuhp-Bugcheck-Override,string=yes
>>
>> (The "-fw_cfg" QEMU option itself is not experimental; its above argument,
>> as far it concerns the firmware, is experimental.)
>>
>> The problem was originally reported by Ard [0]. We analyzed it at [1] and
>> [2]. A QEMU patch was sent at [3]; now merged as commit dab30fbef389
>> ("acpi: cpuhp: fix guest-visible maximum access size to the legacy reg
>> block", 2023-01-08), to be included in QEMU v8.0.0.
>>
>> [0] https://bugzilla.tianocore.org/show_bug.cgi?id=4234#c2
>>
>> [1] https://bugzilla.tianocore.org/show_bug.cgi?id=4234#c3
>>
>> [2] IO port write width clamping differs between TCG and KVM
>>     http://mid.mail-archive.com/aaedee84-d3ed-a4f9-21e7-d221a28d1683@redhat.com
>>     https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00199.html
>>
>> [3] acpi: cpuhp: fix guest-visible maximum access size to the legacy reg block
>>     http://mid.mail-archive.com/20230104090138.214862-1-lersek@redhat.com
>>     https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00278.html
>>
>> NOTE: PlatformInitLib is used in the following platform DSCs:
>>
>>   OvmfPkg/AmdSev/AmdSevX64.dsc
>>   OvmfPkg/CloudHv/CloudHvX64.dsc
>>   OvmfPkg/IntelTdx/IntelTdxX64.dsc
>>   OvmfPkg/Microvm/MicrovmX64.dsc
>>   OvmfPkg/OvmfPkgIa32.dsc
>>   OvmfPkg/OvmfPkgIa32X64.dsc
>>   OvmfPkg/OvmfPkgX64.dsc
>>
>> but I can only test this change with the last three platforms, running on
>> QEMU.
>>
>> Test results:
>>
>>   TCG  QEMU     OVMF     override  result
>>        patched  patched
>>   ---  -------  -------  --------  --------------------------------------
>>   0    0        0        0         CPU counts OK (KVM masks the QEMU bug)
>>   0    0        1        0         CPU counts OK (KVM masks the QEMU bug)
>>   0    1        0        0         CPU counts OK (QEMU fix, but KVM masks
>>                                    the QEMU bug anyway)
>>   0    1        1        0         CPU counts OK (QEMU fix, but KVM masks
>>                                    the QEMU bug anyway)
>>   1    0        0        0         boot with broken CPU counts (original
>>                                    QEMU bug)
>>   1    0        1        0         broken CPU count caught (boot hangs)
>>   1    0        1        1         broken CPU count caught, bug check
>>                                    overridden, boot continues
>>   1    1        0        0         CPU counts OK (QEMU fix)
>>   1    1        1        0         CPU counts OK (QEMU fix)
>>
>> Cc: Ard Biesheuvel <ardb+tianocore@kernel.org>
>> Cc: Brijesh Singh <brijesh.singh@amd.com>
>> Cc: Erdem Aktas <erdemaktas@google.com>
>> Cc: Gerd Hoffmann <kraxel@redhat.com>
>> Cc: James Bottomley <jejb@linux.ibm.com>
>> Cc: Jiewen Yao <jiewen.yao@intel.com>
>> Cc: Jordan Justen <jordan.l.justen@intel.com>
>> Cc: Michael Brown <mcb30@ipxe.org>
>> Cc: Min Xu <min.m.xu@intel.com>
>> Cc: Oliver Steffen <osteffen@redhat.com>
>> Cc: Sebastien Boeuf <sebastien.boeuf@intel.com>
>> Cc: Tom Lendacky <thomas.lendacky@amd.com>
>> Bugzilla: https://bugzilla.tianocore.org/show_bug.cgi?id=4250
>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> 
> Thanks a lot for taking the time and investing the effort. I'm quite
> happy that we have this 'escape hatch' now, which we could arguably
> use temporarily in the VS2019 platform CI until its QEMU binary gets
> updated, right?

Yes, I have to agree there.

Right now, because those QEMU binaries are affected by the regression,
and because they use TCG, OVMF already sees Present=0 Possible=1. Due to
the interference of Present=0 with the QEMU v2.7 reset bug workaround,
we also get BootCpuCount=0. Furthermore, MaxCpuCount gets set to 1, from
Possible. Thus, we exit PlatformMaxCpuCountInitialization() with
PcdCpuBootLogicalProcessorNumber=0 (from BootCpuCount) and
PcdCpuMaxLogicalProcessorNumber=1 (from MaxCpuCount).

Then, in the "predictable subset" of consequences of the QEMU
regression, we can say that MpInitLib interprets the above PCD values as
"uniprocessor system with the boot CPU count not exposed by the
platform". This (i.e., *just this*) does not fall outside of MpInitLib's
domain (again, note my qualification "predictable subset").

Now, if we apply the patch and also add the -fw_cfg switch to the
Windows CI, *and* we also don't add any -smp flags (as far as I can
tell, no -smp flag is used now), then the new PCD state will be

PcdCpuBootLogicalProcessorNumber=1 (changed from zero)
PcdCpuMaxLogicalProcessorNumber=1 (stays the same)

As far as I can tell, *right now* this change should have no effect *in
MpInitLib*, IOW nothing gets worse or better there. Namely,
PcdCpuBootLogicalProcessorNumber is only consumed in WakeUpAP(), and
only when InitFlag == ApInitConfig. InitFlag is set like that only in
CollectProcessorCount(). However, CollectProcessorCount() is only called
if PcdCpuMaxLogicalProcessorNumber is >1 (see MaxLogicalProcessorNumber
in MpInitLibInitialize()). Meaning in effect that
PcdCpuMaxLogicalProcessorNumber=1 makes PcdCpuBootLogicalProcessorNumber
irrelevant, so its change from 0 to 1 is invisible *to MpInitLib*.

Oliver:

(1) can you please post a patch for the Windows CI so that the following
option be passed to QEMU:

  -fw_cfg name=opt/org.tianocore/X-Cpuhp-Bugcheck-Override,string=yes

(This option is harmless when the firmware does not determine the QEMU
bug, so it can be passed in advance; it will have no consequence at all.)

In the patch, please reference

  https://bugzilla.tianocore.org/show_bug.cgi?id=4250

(2) Please file a separate TianoCore BZ for *backing out* the change (=
for removing the -fw_cfg switch), and assign it to yourself :)

Once the Windows CI advances to a fixed QEMU binary, the "escape hatch"
should be shut welded down.

(3) Please give me a hint when the CI patch (1) has been merged; then I
can go ahead and merge this v3 series as well.

Thanks!
Laszlo

> 
> For the series,
> 
> Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
> 
> 
> 
>> ---
>>
>> Notes:
>>     v3:
>>
>>     - reimplement the bug check in the factored out function
>>       PlatformCpuCountBugCheck()
>>
>>     - add override [Ard, Gerd, Michael]
>>
>>     - assert "0 < Present == BootCpuCount <= Possible" whenever
>>       PlatformCpuCountBugCheck() returns
>>
>>     - update commit message
>>
>>     - update test matrix in commit message
>>
>>     - (re)test all "OVMF patched = 1" cases (sigh)
>>
>>     v2:
>>
>>     - V1 was at
>>       <http://mid.mail-archive.com/20230104151234.286030-1-lersek@redhat.com>.
>>
>>     - Repo: <https://pagure.io/lersek/edk2.git>, branch:
>>       cpuhp-reg-catch-4250-v2
>>
>>     - Remove KVM as a proposed workaround from the error message, because in
>>       the QEMU discussion, we had found that the KVM accelerator's behavior
>>       in QEMU (masking the problem) was not right, and that a fix for that
>>       had been in progress for quite some time.
>>
>>     - Add the QEMU commit hash to the commit message, the code comment, and
>>       the error message.
>>
>>     - Pick up Gerd's R-b; add Oliver to the Cc list.
>>
>>  OvmfPkg/Library/PlatformInitLib/Platform.c | 87 ++++++++++++++++++++
>>  1 file changed, 87 insertions(+)
>>
>> diff --git a/OvmfPkg/Library/PlatformInitLib/Platform.c b/OvmfPkg/Library/PlatformInitLib/Platform.c
>> index d1be5c2d7970..9fee6e481038 100644
>> --- a/OvmfPkg/Library/PlatformInitLib/Platform.c
>> +++ b/OvmfPkg/Library/PlatformInitLib/Platform.c
>> @@ -36,6 +36,9 @@
>>
>>  #include <Library/PlatformInitLib.h>
>>
>> +#define CPUHP_BUGCHECK_OVERRIDE_FWCFG_FILE \
>> +  "opt/org.tianocore/X-Cpuhp-Bugcheck-Override"
>> +
>>  VOID
>>  EFIAPI
>>  PlatformAddIoMemoryBaseSizeHob (
>> @@ -437,6 +440,87 @@ PlatformCpuCountBugCheck (
>>  {
>>    ASSERT (*BootCpuCount > 0);
>>
>> +  //
>> +  // Sanity check: we need at least 1 present CPU (CPU#0 is always present).
>> +  //
>> +  // The legacy-to-modern switching of the CPU hotplug register block got broken
>> +  // (for TCG) in QEMU v5.1.0. Refer to "IO port write width clamping differs
>> +  // between TCG and KVM" at
>> +  // <http://mid.mail-archive.com/aaedee84-d3ed-a4f9-21e7-d221a28d1683@redhat.com>
>> +  // or at
>> +  // <https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00199.html>.
>> +  //
>> +  // QEMU received the fix in commit dab30fbef389 ("acpi: cpuhp: fix
>> +  // guest-visible maximum access size to the legacy reg block", 2023-01-08), to
>> +  // be included in QEMU v8.0.0.
>> +  //
>> +  // If we're affected by this QEMU bug, then we must not continue: it confuses
>> +  // the multiprocessing in UefiCpuPkg/Library/MpInitLib, and breaks CPU
>> +  // hot(un)plug with SMI in OvmfPkg/CpuHotplugSmm.
>> +  //
>> +  if (*Present == 0) {
>> +    UINTN                      Idx;
>> +    STATIC CONST CHAR8 *CONST  Message[] = {
>> +      "Broken CPU hotplug register block found. Update QEMU to version 8+, or",
>> +      "to a stable release with commit dab30fbef389 backported. Refer to",
>> +      "<https://bugzilla.tianocore.org/show_bug.cgi?id=4250>.",
>> +      "Consequences of the QEMU bug may include, but are not limited to:",
>> +      "- all firmware logic, dependent on the CPU hotplug register block,",
>> +      "  being confused, for example, multiprocessing-related logic;",
>> +      "- guest OS data loss, including filesystem corruption, due to crash or",
>> +      "  hang during ACPI S3 resume;",
>> +      "- SMM privilege escalation, by a malicious guest OS or 3rd partty UEFI",
>> +      "  agent, against the platform firmware.",
>> +      "These symptoms need not necessarily be limited to the QEMU user",
>> +      "attempting to hot(un)plug a CPU.",
>> +      "The firmware will now stop (hang) deliberately, in order to prevent the",
>> +      "above symptoms.",
>> +      "You can forcibly override the hang, *at your own risk*, with the",
>> +      "following *experimental* QEMU command line option:",
>> +      "  -fw_cfg name=" CPUHP_BUGCHECK_OVERRIDE_FWCFG_FILE ",string=yes",
>> +      "Please only report such bugs that you can reproduce *without* the",
>> +      "override.",
>> +    };
>> +    RETURN_STATUS              ParseStatus;
>> +    BOOLEAN                    Override;
>> +
>> +    DEBUG ((
>> +      DEBUG_ERROR,
>> +      "%a: Present=%u Possible=%u\n",
>> +      __FUNCTION__,
>> +      *Present,
>> +      *Possible
>> +      ));
>> +    for (Idx = 0; Idx < ARRAY_SIZE (Message); ++Idx) {
>> +      DEBUG ((DEBUG_ERROR, "%a: %a\n", __FUNCTION__, Message[Idx]));
>> +    }
>> +
>> +    ParseStatus = QemuFwCfgParseBool (
>> +                    CPUHP_BUGCHECK_OVERRIDE_FWCFG_FILE,
>> +                    &Override
>> +                    );
>> +    if (!RETURN_ERROR (ParseStatus) && Override) {
>> +      DEBUG ((
>> +        DEBUG_WARN,
>> +        "%a: \"%a\" active. You've been warned.\n",
>> +        __FUNCTION__,
>> +        CPUHP_BUGCHECK_OVERRIDE_FWCFG_FILE
>> +        ));
>> +      //
>> +      // The bug is in QEMU v5.1.0+, where we're not affected by the QEMU v2.7
>> +      // reset bug, so BootCpuCount from fw_cfg is reliable. Assume a fully
>> +      // populated topology, like when the modern CPU hotplug interface is
>> +      // unavailable.
>> +      //
>> +      *Present  = *BootCpuCount;
>> +      *Possible = *BootCpuCount;
>> +      return;
>> +    }
>> +
>> +    ASSERT (FALSE);
>> +    CpuDeadLoop ();
>> +  }
>> +
>>    //
>>    // Sanity check: fw_cfg and the modern CPU hotplug interface should expose the
>>    // same boot CPU count.
>> @@ -596,6 +680,9 @@ PlatformMaxCpuCountInitialization (
>>        } while (Selected > 0);
>>
>>        PlatformCpuCountBugCheck (&BootCpuCount, &Present, &Possible);
>> +      ASSERT (Present > 0);
>> +      ASSERT (Present <= Possible);
>> +      ASSERT (BootCpuCount == Present);
>>
>>        MaxCpuCount = Possible;
>>      }
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 2/2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression
  2023-01-20  8:50     ` Laszlo Ersek
@ 2023-01-20  9:10       ` Ard Biesheuvel
  2023-01-20 12:55         ` Laszlo Ersek
  2023-01-20  9:17       ` Laszlo Ersek
  1 sibling, 1 reply; 12+ messages in thread
From: Ard Biesheuvel @ 2023-01-20  9:10 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Oliver Steffen, devel, Ard Biesheuvel, Brijesh Singh, Erdem Aktas,
	Gerd Hoffmann, James Bottomley, Jiewen Yao, Jordan Justen,
	Michael Brown, Min Xu, Sebastien Boeuf, Tom Lendacky

On Fri, 20 Jan 2023 at 09:50, Laszlo Ersek <lersek@redhat.com> wrote:
>
> a couple of requests to Oliver below:
>
> On 1/19/23 12:27, Ard Biesheuvel wrote:
> > On Thu, 19 Jan 2023 at 12:01, Laszlo Ersek <lersek@redhat.com> wrote:
> >>
> >> In QEMU v5.1.0, the CPU hotplug register block misbehaves: the negotiation
> >> protocol is (effectively) broken such that it suggests that switching from
> >> the legacy interface to the modern interface works, but in reality the
> >> switch never happens. The symptom has been witnessed when using TCG
> >> acceleration; KVM seems to mask the issue. The issue persists with the
> >> following (latest) stable QEMU releases: v5.2.0, v6.2.0, v7.2.0. Currently
> >> there is no stable release that addresses the problem.
> >>
> >> The QEMU bug confuses the Present and Possible counting in function
> >> PlatformMaxCpuCountInitialization(), in
> >> "OvmfPkg/Library/PlatformInitLib/Platform.c". OVMF ends up with Present=0
> >> Possible=1. This in turn further confuses MpInitLib in UefiCpuPkg (hence
> >> firmware-time multiprocessing will be broken). Worse, CPU hot(un)plug with
> >> SMI will be summarily broken in OvmfPkg/CpuHotplugSmm, which (considering
> >> the privilege level of SMM) is not that great.
> >>
> >> Detect the issue in PlatformCpuCountBugCheck(), and print an error message
> >> and *hang* if the issue is present.
> >>
> >> Users willing to take risks can override the hang with the experimental
> >> QEMU command line option
> >>
> >>   -fw_cfg name=opt/org.tianocore/X-Cpuhp-Bugcheck-Override,string=yes
> >>
> >> (The "-fw_cfg" QEMU option itself is not experimental; its above argument,
> >> as far it concerns the firmware, is experimental.)
> >>
> >> The problem was originally reported by Ard [0]. We analyzed it at [1] and
> >> [2]. A QEMU patch was sent at [3]; now merged as commit dab30fbef389
> >> ("acpi: cpuhp: fix guest-visible maximum access size to the legacy reg
> >> block", 2023-01-08), to be included in QEMU v8.0.0.
> >>
> >> [0] https://bugzilla.tianocore.org/show_bug.cgi?id=4234#c2
> >>
> >> [1] https://bugzilla.tianocore.org/show_bug.cgi?id=4234#c3
> >>
> >> [2] IO port write width clamping differs between TCG and KVM
> >>     http://mid.mail-archive.com/aaedee84-d3ed-a4f9-21e7-d221a28d1683@redhat.com
> >>     https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00199.html
> >>
> >> [3] acpi: cpuhp: fix guest-visible maximum access size to the legacy reg block
> >>     http://mid.mail-archive.com/20230104090138.214862-1-lersek@redhat.com
> >>     https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00278.html
> >>
> >> NOTE: PlatformInitLib is used in the following platform DSCs:
> >>
> >>   OvmfPkg/AmdSev/AmdSevX64.dsc
> >>   OvmfPkg/CloudHv/CloudHvX64.dsc
> >>   OvmfPkg/IntelTdx/IntelTdxX64.dsc
> >>   OvmfPkg/Microvm/MicrovmX64.dsc
> >>   OvmfPkg/OvmfPkgIa32.dsc
> >>   OvmfPkg/OvmfPkgIa32X64.dsc
> >>   OvmfPkg/OvmfPkgX64.dsc
> >>
> >> but I can only test this change with the last three platforms, running on
> >> QEMU.
> >>
> >> Test results:
> >>
> >>   TCG  QEMU     OVMF     override  result
> >>        patched  patched
> >>   ---  -------  -------  --------  --------------------------------------
> >>   0    0        0        0         CPU counts OK (KVM masks the QEMU bug)
> >>   0    0        1        0         CPU counts OK (KVM masks the QEMU bug)
> >>   0    1        0        0         CPU counts OK (QEMU fix, but KVM masks
> >>                                    the QEMU bug anyway)
> >>   0    1        1        0         CPU counts OK (QEMU fix, but KVM masks
> >>                                    the QEMU bug anyway)
> >>   1    0        0        0         boot with broken CPU counts (original
> >>                                    QEMU bug)
> >>   1    0        1        0         broken CPU count caught (boot hangs)
> >>   1    0        1        1         broken CPU count caught, bug check
> >>                                    overridden, boot continues
> >>   1    1        0        0         CPU counts OK (QEMU fix)
> >>   1    1        1        0         CPU counts OK (QEMU fix)
> >>
> >> Cc: Ard Biesheuvel <ardb+tianocore@kernel.org>
> >> Cc: Brijesh Singh <brijesh.singh@amd.com>
> >> Cc: Erdem Aktas <erdemaktas@google.com>
> >> Cc: Gerd Hoffmann <kraxel@redhat.com>
> >> Cc: James Bottomley <jejb@linux.ibm.com>
> >> Cc: Jiewen Yao <jiewen.yao@intel.com>
> >> Cc: Jordan Justen <jordan.l.justen@intel.com>
> >> Cc: Michael Brown <mcb30@ipxe.org>
> >> Cc: Min Xu <min.m.xu@intel.com>
> >> Cc: Oliver Steffen <osteffen@redhat.com>
> >> Cc: Sebastien Boeuf <sebastien.boeuf@intel.com>
> >> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> >> Bugzilla: https://bugzilla.tianocore.org/show_bug.cgi?id=4250
> >> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
> >
> > Thanks a lot for taking the time and investing the effort. I'm quite
> > happy that we have this 'escape hatch' now, which we could arguably
> > use temporarily in the VS2019 platform CI until its QEMU binary gets
> > updated, right?
>
> Yes, I have to agree there.
>
> Right now, because those QEMU binaries are affected by the regression,
> and because they use TCG, OVMF already sees Present=0 Possible=1. Due to
> the interference of Present=0 with the QEMU v2.7 reset bug workaround,
> we also get BootCpuCount=0. Furthermore, MaxCpuCount gets set to 1, from
> Possible. Thus, we exit PlatformMaxCpuCountInitialization() with
> PcdCpuBootLogicalProcessorNumber=0 (from BootCpuCount) and
> PcdCpuMaxLogicalProcessorNumber=1 (from MaxCpuCount).
>
> Then, in the "predictable subset" of consequences of the QEMU
> regression, we can say that MpInitLib interprets the above PCD values as
> "uniprocessor system with the boot CPU count not exposed by the
> platform". This (i.e., *just this*) does not fall outside of MpInitLib's
> domain (again, note my qualification "predictable subset").
>
> Now, if we apply the patch and also add the -fw_cfg switch to the
> Windows CI, *and* we also don't add any -smp flags (as far as I can
> tell, no -smp flag is used now), then the new PCD state will be
>
> PcdCpuBootLogicalProcessorNumber=1 (changed from zero)
> PcdCpuMaxLogicalProcessorNumber=1 (stays the same)
>
> As far as I can tell, *right now* this change should have no effect *in
> MpInitLib*, IOW nothing gets worse or better there. Namely,
> PcdCpuBootLogicalProcessorNumber is only consumed in WakeUpAP(), and
> only when InitFlag == ApInitConfig. InitFlag is set like that only in
> CollectProcessorCount(). However, CollectProcessorCount() is only called
> if PcdCpuMaxLogicalProcessorNumber is >1 (see MaxLogicalProcessorNumber
> in MpInitLibInitialize()). Meaning in effect that
> PcdCpuMaxLogicalProcessorNumber=1 makes PcdCpuBootLogicalProcessorNumber
> irrelevant, so its change from 0 to 1 is invisible *to MpInitLib*.
>
> Oliver:
>
> (1) can you please post a patch for the Windows CI so that the following
> option be passed to QEMU:
>
>   -fw_cfg name=opt/org.tianocore/X-Cpuhp-Bugcheck-Override,string=yes
>
> (This option is harmless when the firmware does not determine the QEMU
> bug, so it can be passed in advance; it will have no consequence at all.)
>
> In the patch, please reference
>
>   https://bugzilla.tianocore.org/show_bug.cgi?id=4250
>

Can I take the above as an ack on

https://edk2.groups.io/g/devel/message/98899

?

> (2) Please file a separate TianoCore BZ for *backing out* the change (=
> for removing the -fw_cfg switch), and assign it to yourself :)
>
> Once the Windows CI advances to a fixed QEMU binary, the "escape hatch"
> should be shut welded down.
>
> (3) Please give me a hint when the CI patch (1) has been merged; then I
> can go ahead and merge this v3 series as well.
>

I'll merge the whole lot once you're happy with the CI patch.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 2/2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression
  2023-01-20  8:50     ` Laszlo Ersek
  2023-01-20  9:10       ` Ard Biesheuvel
@ 2023-01-20  9:17       ` Laszlo Ersek
  2023-01-20  9:19         ` Laszlo Ersek
  1 sibling, 1 reply; 12+ messages in thread
From: Laszlo Ersek @ 2023-01-20  9:17 UTC (permalink / raw)
  To: Ard Biesheuvel, Oliver Steffen
  Cc: devel, Ard Biesheuvel, Brijesh Singh, Erdem Aktas, Gerd Hoffmann,
	James Bottomley, Jiewen Yao, Jordan Justen, Michael Brown, Min Xu,
	Sebastien Boeuf, Tom Lendacky

On 1/20/23 09:50, Laszlo Ersek wrote:

> Oliver:
>
> (1) can you please post a patch for the Windows CI so that the
> following option be passed to QEMU:
>
>   -fw_cfg name=opt/org.tianocore/X-Cpuhp-Bugcheck-Override,string=yes
>
> (This option is harmless when the firmware does not determine the QEMU
> bug, so it can be passed in advance; it will have no consequence at
> all.)
>
> In the patch, please reference
>
>   https://bugzilla.tianocore.org/show_bug.cgi?id=4250

I *think* the call chain is something like

Platform_CI              [edk2/OvmfPkg/PlatformCI/.azurepipelines/Windows-VS2019.yml]
  FlashRomImage          [edk2-pytool-extensions/edk2toolext/environment/uefi_build.py]
    FlashRomImage        [edk2/OvmfPkg/PlatformCI/PlatformBuildLib.py]
      qemu-system-x86_64 [edk2/OvmfPkg/PlatformCI/PlatformBuildLib.py]

So I believe a new variable, similar to QEMU_HEADLESS, should be
introduced, in "OvmfPkg/PlatformCI/PlatformBuildLib.py" and
"OvmfPkg/PlatformCI/.azurepipelines/Windows-VS2019.yml", for controlling
the above command line switch. Maybe it should be documented even, in
"OvmfPkg/PlatformCI/ReadMe.md".

Call the new variable QEMU_X_CPUHP_BUGCHECK_OVERRIDE?

Thank you!
Laszlo


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 2/2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression
  2023-01-20  9:17       ` Laszlo Ersek
@ 2023-01-20  9:19         ` Laszlo Ersek
  0 siblings, 0 replies; 12+ messages in thread
From: Laszlo Ersek @ 2023-01-20  9:19 UTC (permalink / raw)
  To: Ard Biesheuvel, Oliver Steffen
  Cc: devel, Ard Biesheuvel, Brijesh Singh, Erdem Aktas, Gerd Hoffmann,
	James Bottomley, Jiewen Yao, Jordan Justen, Michael Brown, Min Xu,
	Sebastien Boeuf, Tom Lendacky

On 1/20/23 10:17, Laszlo Ersek wrote:
> On 1/20/23 09:50, Laszlo Ersek wrote:
> 
>> Oliver:
>>
>> (1) can you please post a patch for the Windows CI so that the
>> following option be passed to QEMU:
>>
>>   -fw_cfg name=opt/org.tianocore/X-Cpuhp-Bugcheck-Override,string=yes
>>
>> (This option is harmless when the firmware does not determine the QEMU
>> bug, so it can be passed in advance; it will have no consequence at
>> all.)
>>
>> In the patch, please reference
>>
>>   https://bugzilla.tianocore.org/show_bug.cgi?id=4250
> 
> I *think* the call chain is something like
> 
> Platform_CI              [edk2/OvmfPkg/PlatformCI/.azurepipelines/Windows-VS2019.yml]
>   FlashRomImage          [edk2-pytool-extensions/edk2toolext/environment/uefi_build.py]
>     FlashRomImage        [edk2/OvmfPkg/PlatformCI/PlatformBuildLib.py]
>       qemu-system-x86_64 [edk2/OvmfPkg/PlatformCI/PlatformBuildLib.py]
> 
> So I believe a new variable, similar to QEMU_HEADLESS, should be
> introduced, in "OvmfPkg/PlatformCI/PlatformBuildLib.py" and
> "OvmfPkg/PlatformCI/.azurepipelines/Windows-VS2019.yml", for controlling
> the above command line switch. Maybe it should be documented even, in
> "OvmfPkg/PlatformCI/ReadMe.md".
> 
> Call the new variable QEMU_X_CPUHP_BUGCHECK_OVERRIDE?

Haha, I'm sooo late; Ard has already posted such patches ;)

Laszlo


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 2/2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression
  2023-01-20  9:10       ` Ard Biesheuvel
@ 2023-01-20 12:55         ` Laszlo Ersek
  0 siblings, 0 replies; 12+ messages in thread
From: Laszlo Ersek @ 2023-01-20 12:55 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Oliver Steffen, devel, Ard Biesheuvel, Brijesh Singh, Erdem Aktas,
	Gerd Hoffmann, James Bottomley, Jiewen Yao, Jordan Justen,
	Michael Brown, Min Xu, Sebastien Boeuf, Tom Lendacky

On 1/20/23 10:10, Ard Biesheuvel wrote:
> On Fri, 20 Jan 2023 at 09:50, Laszlo Ersek <lersek@redhat.com> wrote:
>>
>> a couple of requests to Oliver below:
>>
>> On 1/19/23 12:27, Ard Biesheuvel wrote:
>>> On Thu, 19 Jan 2023 at 12:01, Laszlo Ersek <lersek@redhat.com> wrote:
>>>>
>>>> In QEMU v5.1.0, the CPU hotplug register block misbehaves: the negotiation
>>>> protocol is (effectively) broken such that it suggests that switching from
>>>> the legacy interface to the modern interface works, but in reality the
>>>> switch never happens. The symptom has been witnessed when using TCG
>>>> acceleration; KVM seems to mask the issue. The issue persists with the
>>>> following (latest) stable QEMU releases: v5.2.0, v6.2.0, v7.2.0. Currently
>>>> there is no stable release that addresses the problem.
>>>>
>>>> The QEMU bug confuses the Present and Possible counting in function
>>>> PlatformMaxCpuCountInitialization(), in
>>>> "OvmfPkg/Library/PlatformInitLib/Platform.c". OVMF ends up with Present=0
>>>> Possible=1. This in turn further confuses MpInitLib in UefiCpuPkg (hence
>>>> firmware-time multiprocessing will be broken). Worse, CPU hot(un)plug with
>>>> SMI will be summarily broken in OvmfPkg/CpuHotplugSmm, which (considering
>>>> the privilege level of SMM) is not that great.
>>>>
>>>> Detect the issue in PlatformCpuCountBugCheck(), and print an error message
>>>> and *hang* if the issue is present.
>>>>
>>>> Users willing to take risks can override the hang with the experimental
>>>> QEMU command line option
>>>>
>>>>   -fw_cfg name=opt/org.tianocore/X-Cpuhp-Bugcheck-Override,string=yes
>>>>
>>>> (The "-fw_cfg" QEMU option itself is not experimental; its above argument,
>>>> as far it concerns the firmware, is experimental.)
>>>>
>>>> The problem was originally reported by Ard [0]. We analyzed it at [1] and
>>>> [2]. A QEMU patch was sent at [3]; now merged as commit dab30fbef389
>>>> ("acpi: cpuhp: fix guest-visible maximum access size to the legacy reg
>>>> block", 2023-01-08), to be included in QEMU v8.0.0.
>>>>
>>>> [0] https://bugzilla.tianocore.org/show_bug.cgi?id=4234#c2
>>>>
>>>> [1] https://bugzilla.tianocore.org/show_bug.cgi?id=4234#c3
>>>>
>>>> [2] IO port write width clamping differs between TCG and KVM
>>>>     http://mid.mail-archive.com/aaedee84-d3ed-a4f9-21e7-d221a28d1683@redhat.com
>>>>     https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00199.html
>>>>
>>>> [3] acpi: cpuhp: fix guest-visible maximum access size to the legacy reg block
>>>>     http://mid.mail-archive.com/20230104090138.214862-1-lersek@redhat.com
>>>>     https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00278.html
>>>>
>>>> NOTE: PlatformInitLib is used in the following platform DSCs:
>>>>
>>>>   OvmfPkg/AmdSev/AmdSevX64.dsc
>>>>   OvmfPkg/CloudHv/CloudHvX64.dsc
>>>>   OvmfPkg/IntelTdx/IntelTdxX64.dsc
>>>>   OvmfPkg/Microvm/MicrovmX64.dsc
>>>>   OvmfPkg/OvmfPkgIa32.dsc
>>>>   OvmfPkg/OvmfPkgIa32X64.dsc
>>>>   OvmfPkg/OvmfPkgX64.dsc
>>>>
>>>> but I can only test this change with the last three platforms, running on
>>>> QEMU.
>>>>
>>>> Test results:
>>>>
>>>>   TCG  QEMU     OVMF     override  result
>>>>        patched  patched
>>>>   ---  -------  -------  --------  --------------------------------------
>>>>   0    0        0        0         CPU counts OK (KVM masks the QEMU bug)
>>>>   0    0        1        0         CPU counts OK (KVM masks the QEMU bug)
>>>>   0    1        0        0         CPU counts OK (QEMU fix, but KVM masks
>>>>                                    the QEMU bug anyway)
>>>>   0    1        1        0         CPU counts OK (QEMU fix, but KVM masks
>>>>                                    the QEMU bug anyway)
>>>>   1    0        0        0         boot with broken CPU counts (original
>>>>                                    QEMU bug)
>>>>   1    0        1        0         broken CPU count caught (boot hangs)
>>>>   1    0        1        1         broken CPU count caught, bug check
>>>>                                    overridden, boot continues
>>>>   1    1        0        0         CPU counts OK (QEMU fix)
>>>>   1    1        1        0         CPU counts OK (QEMU fix)
>>>>
>>>> Cc: Ard Biesheuvel <ardb+tianocore@kernel.org>
>>>> Cc: Brijesh Singh <brijesh.singh@amd.com>
>>>> Cc: Erdem Aktas <erdemaktas@google.com>
>>>> Cc: Gerd Hoffmann <kraxel@redhat.com>
>>>> Cc: James Bottomley <jejb@linux.ibm.com>
>>>> Cc: Jiewen Yao <jiewen.yao@intel.com>
>>>> Cc: Jordan Justen <jordan.l.justen@intel.com>
>>>> Cc: Michael Brown <mcb30@ipxe.org>
>>>> Cc: Min Xu <min.m.xu@intel.com>
>>>> Cc: Oliver Steffen <osteffen@redhat.com>
>>>> Cc: Sebastien Boeuf <sebastien.boeuf@intel.com>
>>>> Cc: Tom Lendacky <thomas.lendacky@amd.com>
>>>> Bugzilla: https://bugzilla.tianocore.org/show_bug.cgi?id=4250
>>>> Signed-off-by: Laszlo Ersek <lersek@redhat.com>
>>>
>>> Thanks a lot for taking the time and investing the effort. I'm quite
>>> happy that we have this 'escape hatch' now, which we could arguably
>>> use temporarily in the VS2019 platform CI until its QEMU binary gets
>>> updated, right?
>>
>> Yes, I have to agree there.
>>
>> Right now, because those QEMU binaries are affected by the regression,
>> and because they use TCG, OVMF already sees Present=0 Possible=1. Due to
>> the interference of Present=0 with the QEMU v2.7 reset bug workaround,
>> we also get BootCpuCount=0. Furthermore, MaxCpuCount gets set to 1, from
>> Possible. Thus, we exit PlatformMaxCpuCountInitialization() with
>> PcdCpuBootLogicalProcessorNumber=0 (from BootCpuCount) and
>> PcdCpuMaxLogicalProcessorNumber=1 (from MaxCpuCount).
>>
>> Then, in the "predictable subset" of consequences of the QEMU
>> regression, we can say that MpInitLib interprets the above PCD values as
>> "uniprocessor system with the boot CPU count not exposed by the
>> platform". This (i.e., *just this*) does not fall outside of MpInitLib's
>> domain (again, note my qualification "predictable subset").
>>
>> Now, if we apply the patch and also add the -fw_cfg switch to the
>> Windows CI, *and* we also don't add any -smp flags (as far as I can
>> tell, no -smp flag is used now), then the new PCD state will be
>>
>> PcdCpuBootLogicalProcessorNumber=1 (changed from zero)
>> PcdCpuMaxLogicalProcessorNumber=1 (stays the same)
>>
>> As far as I can tell, *right now* this change should have no effect *in
>> MpInitLib*, IOW nothing gets worse or better there. Namely,
>> PcdCpuBootLogicalProcessorNumber is only consumed in WakeUpAP(), and
>> only when InitFlag == ApInitConfig. InitFlag is set like that only in
>> CollectProcessorCount(). However, CollectProcessorCount() is only called
>> if PcdCpuMaxLogicalProcessorNumber is >1 (see MaxLogicalProcessorNumber
>> in MpInitLibInitialize()). Meaning in effect that
>> PcdCpuMaxLogicalProcessorNumber=1 makes PcdCpuBootLogicalProcessorNumber
>> irrelevant, so its change from 0 to 1 is invisible *to MpInitLib*.
>>
>> Oliver:
>>
>> (1) can you please post a patch for the Windows CI so that the following
>> option be passed to QEMU:
>>
>>   -fw_cfg name=opt/org.tianocore/X-Cpuhp-Bugcheck-Override,string=yes
>>
>> (This option is harmless when the firmware does not determine the QEMU
>> bug, so it can be passed in advance; it will have no consequence at all.)
>>
>> In the patch, please reference
>>
>>   https://bugzilla.tianocore.org/show_bug.cgi?id=4250
>>
> 
> Can I take the above as an ack on
> 
> https://edk2.groups.io/g/devel/message/98899
> 
> ?
> 
>> (2) Please file a separate TianoCore BZ for *backing out* the change (=
>> for removing the -fw_cfg switch), and assign it to yourself :)
>>
>> Once the Windows CI advances to a fixed QEMU binary, the "escape hatch"
>> should be shut welded down.
>>
>> (3) Please give me a hint when the CI patch (1) has been merged; then I
>> can go ahead and merge this v3 series as well.
>>
> 
> I'll merge the whole lot once you're happy with the CI patch.
> 

(/me checks the timestamps of messages :) my tendency to work in batches
has its downsides as well, alas. Sorry about the confusion; I'll proceed
with the merge in the other thread.)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 0/2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression
  2023-01-19 11:01 [PATCH v3 0/2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression Laszlo Ersek
                   ` (3 preceding siblings ...)
  2023-01-19 12:05 ` Gerd Hoffmann
@ 2023-01-20 13:48 ` Laszlo Ersek
  4 siblings, 0 replies; 12+ messages in thread
From: Laszlo Ersek @ 2023-01-20 13:48 UTC (permalink / raw)
  To: devel
  Cc: Ard Biesheuvel, Brijesh Singh, Erdem Aktas, Gerd Hoffmann,
	James Bottomley, Jiewen Yao, Jordan Justen, Michael Brown, Min Xu,
	Oliver Steffen, Sebastien Boeuf, Tom Lendacky

On 1/19/23 12:01, Laszlo Ersek wrote:
> Repo:       https://pagure.io/lersek/edk2.git
> Branch:     cpuhp-reg-catch-4250-v3
> Test build: https://github.com/tianocore/edk2/pull/3930
> Bugzilla:   https://bugzilla.tianocore.org/show_bug.cgi?id=4250
> 
> v2 was posted at:
> - http://mid.mail-archive.com/20230112082845.128463-1-lersek@redhat.com
> - https://edk2.groups.io/g/devel/message/98336
> - https://listman.redhat.com/archives/edk2-devel-archive/2023-January/057634.html
> 
> Please see the Notes sections of the patches, regarding the updates.
> 
> The "PlatformCI_OvmfPkg_Windows_VS2019_PR" checks in test build linked
> above time out again (as expected); now with the following debug log:
> 
>> PlatformCpuCountBugCheck: Present=0 Possible=1
>> PlatformCpuCountBugCheck: Broken CPU hotplug register block found. Update QEMU to version 8+, or
>> PlatformCpuCountBugCheck: to a stable release with commit dab30fbef389 backported. Refer to
>> PlatformCpuCountBugCheck: <https://bugzilla.tianocore.org/show_bug.cgi?id=4250>.
>> PlatformCpuCountBugCheck: Consequences of the QEMU bug may include, but are not limited to:
>> PlatformCpuCountBugCheck: - all firmware logic, dependent on the CPU hotplug register block,
>> PlatformCpuCountBugCheck:   being confused, for example, multiprocessing-related logic;
>> PlatformCpuCountBugCheck: - guest OS data loss, including filesystem corruption, due to crash or
>> PlatformCpuCountBugCheck:   hang during ACPI S3 resume;
>> PlatformCpuCountBugCheck: - SMM privilege escalation, by a malicious guest OS or 3rd partty UEFI
>> PlatformCpuCountBugCheck:   agent, against the platform firmware.
>> PlatformCpuCountBugCheck: These symptoms need not necessarily be limited to the QEMU user
>> PlatformCpuCountBugCheck: attempting to hot(un)plug a CPU.
>> PlatformCpuCountBugCheck: The firmware will now stop (hang) deliberately, in order to prevent the
>> PlatformCpuCountBugCheck: above symptoms.
>> PlatformCpuCountBugCheck: You can forcibly override the hang, *at your own risk*, with the
>> PlatformCpuCountBugCheck: following *experimental* QEMU command line option:
>> PlatformCpuCountBugCheck:   -fw_cfg name=opt/org.tianocore/X-Cpuhp-Bugcheck-Override,string=yes
>> PlatformCpuCountBugCheck: Please only report such bugs that you can reproduce *without* the
>> PlatformCpuCountBugCheck: override.
>> Select Item: 0x19
>> ASSERT d:\a\1\s\OvmfPkg\Library\PlatformInitLib\Platform.c(520): ((BOOLEAN)(0==1))
> 
> The "PlatformCI_OvmfPkg_Ubuntu_GCC5_PR" checks succed, probably due to
> edk2 commits ef0916009843 ("CI: Use Fedora 35 container (Linux only)",
> 2023-01-17) and 7fab007f33e9 ("OvmfPkg: CI: Use Fedora 35 container
> (Linux only)", 2023-01-17), and due to
> <https://github.com/tianocore/containers/commit/47addc9a4f20> applying
> the upstream QEMU patch.
> 
> Laszlo
> 
> Cc: Ard Biesheuvel <ardb+tianocore@kernel.org>
> Cc: Brijesh Singh <brijesh.singh@amd.com>
> Cc: Erdem Aktas <erdemaktas@google.com>
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> Cc: James Bottomley <jejb@linux.ibm.com>
> Cc: Jiewen Yao <jiewen.yao@intel.com>
> Cc: Jordan Justen <jordan.l.justen@intel.com>
> Cc: Michael Brown <mcb30@ipxe.org>
> Cc: Min Xu <min.m.xu@intel.com>
> Cc: Oliver Steffen <osteffen@redhat.com>
> Cc: Sebastien Boeuf <sebastien.boeuf@intel.com>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> 
> Laszlo Ersek (2):
>   OvmfPkg/PlatformInitLib: factor out PlatformCpuCountBugCheck()
>   OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression
> 
>  OvmfPkg/Library/PlatformInitLib/Platform.c | 168 +++++++++++++++++---
>  1 file changed, 145 insertions(+), 23 deletions(-)
> 

Merged as the last two commits in range 51411435d559..bf5678b58026, via
<https://github.com/tianocore/edk2/pull/3935>.

Laszlo


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-01-20 13:48 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-01-19 11:01 [PATCH v3 0/2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression Laszlo Ersek
2023-01-19 11:01 ` [PATCH v3 1/2] OvmfPkg/PlatformInitLib: factor out PlatformCpuCountBugCheck() Laszlo Ersek
2023-01-19 11:01 ` [PATCH v3 2/2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression Laszlo Ersek
2023-01-19 11:27   ` Ard Biesheuvel
2023-01-20  8:50     ` Laszlo Ersek
2023-01-20  9:10       ` Ard Biesheuvel
2023-01-20 12:55         ` Laszlo Ersek
2023-01-20  9:17       ` Laszlo Ersek
2023-01-20  9:19         ` Laszlo Ersek
2023-01-19 11:25 ` [edk2-devel] [PATCH v3 0/2] " Michael Brown
2023-01-19 12:05 ` Gerd Hoffmann
2023-01-20 13:48 ` Laszlo Ersek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox