From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by mx.groups.io with SMTP id smtpd.web11.73296.1674219324819302450 for ; Fri, 20 Jan 2023 04:55:25 -0800 Authentication-Results: mx.groups.io; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=W1q9/1jJ; spf=pass (domain: redhat.com, ip: 170.10.129.124, mailfrom: lersek@redhat.com) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1674219323; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Satyzyc6hYjXtOsHwJ2PmsM/xgDl0bmPU/VpGh9aCcg=; b=W1q9/1jJfuC3G+YcXPpYhyHgOzRAGHv8t3iqeF/4l4TvX/rsQ6FWCMBkiOlk6/u6WcEb5q 8iAO0Y6s3TvcW5nuXACPAdCqidx4r6/91T7VTRLNrenyubzxSwAl4t+RM5hE5bHg4AsC05 SwaPf9wzA7VyGvZzddv7UQCU5Fj4NZ4= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-341-OBquDV6WNp-677d4IMOvCw-1; Fri, 20 Jan 2023 07:55:10 -0500 X-MC-Unique: OBquDV6WNp-677d4IMOvCw-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id B842D1C00419; Fri, 20 Jan 2023 12:55:09 +0000 (UTC) Received: from [10.39.193.187] (unknown [10.39.193.187]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 0B4382026D68; Fri, 20 Jan 2023 12:55:06 +0000 (UTC) Message-ID: <2690cc01-ae72-5ec5-eb9d-b97a5c5a8368@redhat.com> Date: Fri, 20 Jan 2023 13:55:05 +0100 MIME-Version: 1.0 Subject: Re: [PATCH v3 2/2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression To: Ard Biesheuvel Cc: Oliver Steffen , devel@edk2.groups.io, Ard Biesheuvel , Brijesh Singh , Erdem Aktas , Gerd Hoffmann , James Bottomley , Jiewen Yao , Jordan Justen , Michael Brown , Min Xu , Sebastien Boeuf , Tom Lendacky References: <20230119110131.91923-1-lersek@redhat.com> <20230119110131.91923-3-lersek@redhat.com> From: "Laszlo Ersek" In-Reply-To: X-Scanned-By: MIMEDefang 3.1 on 10.11.54.4 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 1/20/23 10:10, Ard Biesheuvel wrote: > On Fri, 20 Jan 2023 at 09:50, Laszlo Ersek wrote: >> >> a couple of requests to Oliver below: >> >> On 1/19/23 12:27, Ard Biesheuvel wrote: >>> On Thu, 19 Jan 2023 at 12:01, Laszlo Ersek wrote: >>>> >>>> In QEMU v5.1.0, the CPU hotplug register block misbehaves: the negotiation >>>> protocol is (effectively) broken such that it suggests that switching from >>>> the legacy interface to the modern interface works, but in reality the >>>> switch never happens. The symptom has been witnessed when using TCG >>>> acceleration; KVM seems to mask the issue. The issue persists with the >>>> following (latest) stable QEMU releases: v5.2.0, v6.2.0, v7.2.0. Currently >>>> there is no stable release that addresses the problem. >>>> >>>> The QEMU bug confuses the Present and Possible counting in function >>>> PlatformMaxCpuCountInitialization(), in >>>> "OvmfPkg/Library/PlatformInitLib/Platform.c". OVMF ends up with Present=0 >>>> Possible=1. This in turn further confuses MpInitLib in UefiCpuPkg (hence >>>> firmware-time multiprocessing will be broken). Worse, CPU hot(un)plug with >>>> SMI will be summarily broken in OvmfPkg/CpuHotplugSmm, which (considering >>>> the privilege level of SMM) is not that great. >>>> >>>> Detect the issue in PlatformCpuCountBugCheck(), and print an error message >>>> and *hang* if the issue is present. >>>> >>>> Users willing to take risks can override the hang with the experimental >>>> QEMU command line option >>>> >>>> -fw_cfg name=opt/org.tianocore/X-Cpuhp-Bugcheck-Override,string=yes >>>> >>>> (The "-fw_cfg" QEMU option itself is not experimental; its above argument, >>>> as far it concerns the firmware, is experimental.) >>>> >>>> The problem was originally reported by Ard [0]. We analyzed it at [1] and >>>> [2]. A QEMU patch was sent at [3]; now merged as commit dab30fbef389 >>>> ("acpi: cpuhp: fix guest-visible maximum access size to the legacy reg >>>> block", 2023-01-08), to be included in QEMU v8.0.0. >>>> >>>> [0] https://bugzilla.tianocore.org/show_bug.cgi?id=4234#c2 >>>> >>>> [1] https://bugzilla.tianocore.org/show_bug.cgi?id=4234#c3 >>>> >>>> [2] IO port write width clamping differs between TCG and KVM >>>> http://mid.mail-archive.com/aaedee84-d3ed-a4f9-21e7-d221a28d1683@redhat.com >>>> https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00199.html >>>> >>>> [3] acpi: cpuhp: fix guest-visible maximum access size to the legacy reg block >>>> http://mid.mail-archive.com/20230104090138.214862-1-lersek@redhat.com >>>> https://lists.gnu.org/archive/html/qemu-devel/2023-01/msg00278.html >>>> >>>> NOTE: PlatformInitLib is used in the following platform DSCs: >>>> >>>> OvmfPkg/AmdSev/AmdSevX64.dsc >>>> OvmfPkg/CloudHv/CloudHvX64.dsc >>>> OvmfPkg/IntelTdx/IntelTdxX64.dsc >>>> OvmfPkg/Microvm/MicrovmX64.dsc >>>> OvmfPkg/OvmfPkgIa32.dsc >>>> OvmfPkg/OvmfPkgIa32X64.dsc >>>> OvmfPkg/OvmfPkgX64.dsc >>>> >>>> but I can only test this change with the last three platforms, running on >>>> QEMU. >>>> >>>> Test results: >>>> >>>> TCG QEMU OVMF override result >>>> patched patched >>>> --- ------- ------- -------- -------------------------------------- >>>> 0 0 0 0 CPU counts OK (KVM masks the QEMU bug) >>>> 0 0 1 0 CPU counts OK (KVM masks the QEMU bug) >>>> 0 1 0 0 CPU counts OK (QEMU fix, but KVM masks >>>> the QEMU bug anyway) >>>> 0 1 1 0 CPU counts OK (QEMU fix, but KVM masks >>>> the QEMU bug anyway) >>>> 1 0 0 0 boot with broken CPU counts (original >>>> QEMU bug) >>>> 1 0 1 0 broken CPU count caught (boot hangs) >>>> 1 0 1 1 broken CPU count caught, bug check >>>> overridden, boot continues >>>> 1 1 0 0 CPU counts OK (QEMU fix) >>>> 1 1 1 0 CPU counts OK (QEMU fix) >>>> >>>> Cc: Ard Biesheuvel >>>> Cc: Brijesh Singh >>>> Cc: Erdem Aktas >>>> Cc: Gerd Hoffmann >>>> Cc: James Bottomley >>>> Cc: Jiewen Yao >>>> Cc: Jordan Justen >>>> Cc: Michael Brown >>>> Cc: Min Xu >>>> Cc: Oliver Steffen >>>> Cc: Sebastien Boeuf >>>> Cc: Tom Lendacky >>>> Bugzilla: https://bugzilla.tianocore.org/show_bug.cgi?id=4250 >>>> Signed-off-by: Laszlo Ersek >>> >>> Thanks a lot for taking the time and investing the effort. I'm quite >>> happy that we have this 'escape hatch' now, which we could arguably >>> use temporarily in the VS2019 platform CI until its QEMU binary gets >>> updated, right? >> >> Yes, I have to agree there. >> >> Right now, because those QEMU binaries are affected by the regression, >> and because they use TCG, OVMF already sees Present=0 Possible=1. Due to >> the interference of Present=0 with the QEMU v2.7 reset bug workaround, >> we also get BootCpuCount=0. Furthermore, MaxCpuCount gets set to 1, from >> Possible. Thus, we exit PlatformMaxCpuCountInitialization() with >> PcdCpuBootLogicalProcessorNumber=0 (from BootCpuCount) and >> PcdCpuMaxLogicalProcessorNumber=1 (from MaxCpuCount). >> >> Then, in the "predictable subset" of consequences of the QEMU >> regression, we can say that MpInitLib interprets the above PCD values as >> "uniprocessor system with the boot CPU count not exposed by the >> platform". This (i.e., *just this*) does not fall outside of MpInitLib's >> domain (again, note my qualification "predictable subset"). >> >> Now, if we apply the patch and also add the -fw_cfg switch to the >> Windows CI, *and* we also don't add any -smp flags (as far as I can >> tell, no -smp flag is used now), then the new PCD state will be >> >> PcdCpuBootLogicalProcessorNumber=1 (changed from zero) >> PcdCpuMaxLogicalProcessorNumber=1 (stays the same) >> >> As far as I can tell, *right now* this change should have no effect *in >> MpInitLib*, IOW nothing gets worse or better there. Namely, >> PcdCpuBootLogicalProcessorNumber is only consumed in WakeUpAP(), and >> only when InitFlag == ApInitConfig. InitFlag is set like that only in >> CollectProcessorCount(). However, CollectProcessorCount() is only called >> if PcdCpuMaxLogicalProcessorNumber is >1 (see MaxLogicalProcessorNumber >> in MpInitLibInitialize()). Meaning in effect that >> PcdCpuMaxLogicalProcessorNumber=1 makes PcdCpuBootLogicalProcessorNumber >> irrelevant, so its change from 0 to 1 is invisible *to MpInitLib*. >> >> Oliver: >> >> (1) can you please post a patch for the Windows CI so that the following >> option be passed to QEMU: >> >> -fw_cfg name=opt/org.tianocore/X-Cpuhp-Bugcheck-Override,string=yes >> >> (This option is harmless when the firmware does not determine the QEMU >> bug, so it can be passed in advance; it will have no consequence at all.) >> >> In the patch, please reference >> >> https://bugzilla.tianocore.org/show_bug.cgi?id=4250 >> > > Can I take the above as an ack on > > https://edk2.groups.io/g/devel/message/98899 > > ? > >> (2) Please file a separate TianoCore BZ for *backing out* the change (= >> for removing the -fw_cfg switch), and assign it to yourself :) >> >> Once the Windows CI advances to a fixed QEMU binary, the "escape hatch" >> should be shut welded down. >> >> (3) Please give me a hint when the CI patch (1) has been merged; then I >> can go ahead and merge this v3 series as well. >> > > I'll merge the whole lot once you're happy with the CI patch. > (/me checks the timestamps of messages :) my tendency to work in batches has its downsides as well, alas. Sorry about the confusion; I'll proceed with the merge in the other thread.)