From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by mx.groups.io with SMTP id smtpd.web11.70366.1674205829871470691 for ; Fri, 20 Jan 2023 01:10:30 -0800 Authentication-Results: mx.groups.io; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=T2ZyZc/h; spf=pass (domain: kernel.org, ip: 139.178.84.217, mailfrom: ardb@kernel.org) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 8D03B61E89 for ; Fri, 20 Jan 2023 09:10:28 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id EBAD3C4339B for ; Fri, 20 Jan 2023 09:10:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1674205828; bh=JqaCDxoIgfhcOHeyRH/oUAHwrWWYYHQmM1Ql8SElUYs=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=T2ZyZc/h93kQAPA/1norxcIaQeXn8iQGLWxO6rjG8L0FnEdt7A3SKwBio876gqDm0 2ymLg8TXMsD2DYD/JJc7ZIMGR5AOkUAisN3IR9J8ttNy44eXCyVionGJ4jAyjQfYpN 4VVFGcCB1INFWJREop5/hBXV0SnG+F95dko55hH8/aCK1WKQxOyBDuvC+F4i+k4xMF AkMADZc256a9n4O23mP7ssp9JBaaNlhy74/LXpFc3uaP5QuK1JwPczW6tVWc2NIYxy zhoIUhOgiqNBR5q9wkGnu0ktAwDwc/AGbLORHg/tZf9BaN2AD4eYzxXPC1Fr9WOwnj H0ha1q7apSGIQ== Received: by mail-lf1-f51.google.com with SMTP id w11so3460640lfu.11 for ; Fri, 20 Jan 2023 01:10:27 -0800 (PST) X-Gm-Message-State: AFqh2kpcMXxcfZuiMnqIPtqppm72exVd0R9nkNLOlwWVc9cvcr+gu7t3 2tYiFDNRESbE4UkkinbwchkxEReWf9appWOGWRE= X-Google-Smtp-Source: AMrXdXtyfECvcqcdSsaFkAIlRGME11GcqbRt4wDeA0YMGsqCETvYYvVbUYuAu4MZYRGLxsHXV1/LVX8PzDAOhcG3xAQ= X-Received: by 2002:a19:c501:0:b0:4b8:9001:a694 with SMTP id w1-20020a19c501000000b004b89001a694mr731688lfe.426.1674205826000; Fri, 20 Jan 2023 01:10:26 -0800 (PST) MIME-Version: 1.0 References: <20230119110131.91923-1-lersek@redhat.com> <20230119110131.91923-3-lersek@redhat.com> In-Reply-To: From: "Ard Biesheuvel" Date: Fri, 20 Jan 2023 10:10:14 +0100 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH v3 2/2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression To: Laszlo Ersek Cc: Oliver Steffen , devel@edk2.groups.io, Ard Biesheuvel , Brijesh Singh , Erdem Aktas , Gerd Hoffmann , James Bottomley , Jiewen Yao , Jordan Justen , Michael Brown , Min Xu , Sebastien Boeuf , Tom Lendacky Content-Type: text/plain; charset="UTF-8" On Fri, 20 Jan 2023 at 09:50, Laszlo Ersek wrote: > > a couple of requests to Oliver below: > > On 1/19/23 12:27, Ard Biesheuvel wrote: > > On Thu, 19 Jan 2023 at 12:01, Laszlo Ersek wrote: > >> > >> In QEMU v5.1.0, the CPU hotplug register block misbehaves: the negotiation > >> protocol is (effectively) broken such that it suggests that switching from > >> the legacy interface to the modern interface works, but in reality the > >> switch never happens. The symptom has been witnessed when using TCG > >> acceleration; KVM seems to mask the issue. The issue persists with the > >> following (latest) stable QEMU releases: v5.2.0, v6.2.0, v7.2.0. Currently > >> there is no stable release that addresses the problem. > >> > >> The QEMU bug confuses the Present and Possible counting in function > >> PlatformMaxCpuCountInitialization(), in > >> "OvmfPkg/Library/PlatformInitLib/Platform.c". OVMF ends up with Present=0 > >> Possible=1. This in turn further confuses MpInitLib in UefiCpuPkg (hence > >> firmware-time multiprocessing will be broken). Worse, CPU hot(un)plug with > >> SMI will be summarily broken in OvmfPkg/CpuHotplugSmm, which (considering > >> the privilege level of SMM) is not that great. > >> > >> Detect the issue in PlatformCpuCountBugCheck(), and print an error message > >> and *hang* if the issue is present. > >> > >> Users willing to take risks can override the hang with the experimental > >> QEMU command line option > >> > >> -fw_cfg name=opt/org.tianocore/X-Cpuhp-Bugcheck-Override,string=yes > >> > >> (The "-fw_cfg" QEMU option itself is not experimental; its above argument, > >> as far it concerns the firmware, is experimental.) > >> > >> The problem was originally reported by Ard [0]. We analyzed it at [1] and > >> [2]. A QEMU patch was sent at [3]; now merged as commit dab30fbef389 > >> ("acpi: cpuhp: fix guest-visible maximum access size to the legacy reg > >> block", 2023-01-08), to be included in QEMU v8.0.0. > >> > >> [0] https://bugzilla.tianocore.org/show_bug.cgi?id=4234#c2 > >> > >> [1] https://bugzilla.tianocore.org/show_bug.cgi?id=4234#c3 > >> > >> [2] IO port write width clamping differs between TCG and KVM > >> http://mid.mail-archive.com/aaedee84-d3ed-a4f9-21e7-d221a28d1683@redhat.com > >> https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00199.html > >> > >> [3] acpi: cpuhp: fix guest-visible maximum access size to the legacy reg block > >> http://mid.mail-archive.com/20230104090138.214862-1-lersek@redhat.com > >> https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00278.html > >> > >> NOTE: PlatformInitLib is used in the following platform DSCs: > >> > >> OvmfPkg/AmdSev/AmdSevX64.dsc > >> OvmfPkg/CloudHv/CloudHvX64.dsc > >> OvmfPkg/IntelTdx/IntelTdxX64.dsc > >> OvmfPkg/Microvm/MicrovmX64.dsc > >> OvmfPkg/OvmfPkgIa32.dsc > >> OvmfPkg/OvmfPkgIa32X64.dsc > >> OvmfPkg/OvmfPkgX64.dsc > >> > >> but I can only test this change with the last three platforms, running on > >> QEMU. > >> > >> Test results: > >> > >> TCG QEMU OVMF override result > >> patched patched > >> --- ------- ------- -------- -------------------------------------- > >> 0 0 0 0 CPU counts OK (KVM masks the QEMU bug) > >> 0 0 1 0 CPU counts OK (KVM masks the QEMU bug) > >> 0 1 0 0 CPU counts OK (QEMU fix, but KVM masks > >> the QEMU bug anyway) > >> 0 1 1 0 CPU counts OK (QEMU fix, but KVM masks > >> the QEMU bug anyway) > >> 1 0 0 0 boot with broken CPU counts (original > >> QEMU bug) > >> 1 0 1 0 broken CPU count caught (boot hangs) > >> 1 0 1 1 broken CPU count caught, bug check > >> overridden, boot continues > >> 1 1 0 0 CPU counts OK (QEMU fix) > >> 1 1 1 0 CPU counts OK (QEMU fix) > >> > >> Cc: Ard Biesheuvel > >> Cc: Brijesh Singh > >> Cc: Erdem Aktas > >> Cc: Gerd Hoffmann > >> Cc: James Bottomley > >> Cc: Jiewen Yao > >> Cc: Jordan Justen > >> Cc: Michael Brown > >> Cc: Min Xu > >> Cc: Oliver Steffen > >> Cc: Sebastien Boeuf > >> Cc: Tom Lendacky > >> Bugzilla: https://bugzilla.tianocore.org/show_bug.cgi?id=4250 > >> Signed-off-by: Laszlo Ersek > > > > Thanks a lot for taking the time and investing the effort. I'm quite > > happy that we have this 'escape hatch' now, which we could arguably > > use temporarily in the VS2019 platform CI until its QEMU binary gets > > updated, right? > > Yes, I have to agree there. > > Right now, because those QEMU binaries are affected by the regression, > and because they use TCG, OVMF already sees Present=0 Possible=1. Due to > the interference of Present=0 with the QEMU v2.7 reset bug workaround, > we also get BootCpuCount=0. Furthermore, MaxCpuCount gets set to 1, from > Possible. Thus, we exit PlatformMaxCpuCountInitialization() with > PcdCpuBootLogicalProcessorNumber=0 (from BootCpuCount) and > PcdCpuMaxLogicalProcessorNumber=1 (from MaxCpuCount). > > Then, in the "predictable subset" of consequences of the QEMU > regression, we can say that MpInitLib interprets the above PCD values as > "uniprocessor system with the boot CPU count not exposed by the > platform". This (i.e., *just this*) does not fall outside of MpInitLib's > domain (again, note my qualification "predictable subset"). > > Now, if we apply the patch and also add the -fw_cfg switch to the > Windows CI, *and* we also don't add any -smp flags (as far as I can > tell, no -smp flag is used now), then the new PCD state will be > > PcdCpuBootLogicalProcessorNumber=1 (changed from zero) > PcdCpuMaxLogicalProcessorNumber=1 (stays the same) > > As far as I can tell, *right now* this change should have no effect *in > MpInitLib*, IOW nothing gets worse or better there. Namely, > PcdCpuBootLogicalProcessorNumber is only consumed in WakeUpAP(), and > only when InitFlag == ApInitConfig. InitFlag is set like that only in > CollectProcessorCount(). However, CollectProcessorCount() is only called > if PcdCpuMaxLogicalProcessorNumber is >1 (see MaxLogicalProcessorNumber > in MpInitLibInitialize()). Meaning in effect that > PcdCpuMaxLogicalProcessorNumber=1 makes PcdCpuBootLogicalProcessorNumber > irrelevant, so its change from 0 to 1 is invisible *to MpInitLib*. > > Oliver: > > (1) can you please post a patch for the Windows CI so that the following > option be passed to QEMU: > > -fw_cfg name=opt/org.tianocore/X-Cpuhp-Bugcheck-Override,string=yes > > (This option is harmless when the firmware does not determine the QEMU > bug, so it can be passed in advance; it will have no consequence at all.) > > In the patch, please reference > > https://bugzilla.tianocore.org/show_bug.cgi?id=4250 > Can I take the above as an ack on https://edk2.groups.io/g/devel/message/98899 ? > (2) Please file a separate TianoCore BZ for *backing out* the change (= > for removing the -fw_cfg switch), and assign it to yourself :) > > Once the Windows CI advances to a fixed QEMU binary, the "escape hatch" > should be shut welded down. > > (3) Please give me a hint when the CI patch (1) has been merged; then I > can go ahead and merge this v3 series as well. > I'll merge the whole lot once you're happy with the CI patch.