From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by mx.groups.io with SMTP id smtpd.web10.62841.1673546314652488098 for ; Thu, 12 Jan 2023 09:58:34 -0800 Authentication-Results: mx.groups.io; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=JEdGJdCz; spf=pass (domain: redhat.com, ip: 170.10.133.124, mailfrom: lersek@redhat.com) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1673546313; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=6uDmuuKAGmUgbVx4FwU2ta24tSrAGgoHqHXQb7luCWc=; b=JEdGJdCzv30ynzUtXtMs5bzUahtZ7lvxLaF31UCxOdc0TMNXA167GdPyUZ9DfeovZsp+lg mbkb31Qgt9/2n5+9ERjz2ihsj9KC1V/yi4q8KdOWPcw12+d88/lYlxO+Nyn/zRimHhi/x7 HwnaTwkcUPFgBcXphPwi0+bOOheVLuc= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-652-bXQp8RxbMEmjCM0VPzwEgg-1; Thu, 12 Jan 2023 12:58:28 -0500 X-MC-Unique: bXQp8RxbMEmjCM0VPzwEgg-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 31CE487A9E0; Thu, 12 Jan 2023 17:58:28 +0000 (UTC) Received: from [10.39.192.93] (unknown [10.39.192.93]) by smtp.corp.redhat.com (Postfix) with ESMTPS id B2C041759E; Thu, 12 Jan 2023 17:58:25 +0000 (UTC) Message-ID: <49e4e8bb-3bbd-0ca8-ee59-e75560deffa7@redhat.com> Date: Thu, 12 Jan 2023 18:58:24 +0100 MIME-Version: 1.0 Subject: Re: [edk2-devel] [PATCH v2] OvmfPkg/PlatformInitLib: catch QEMU's CPU hotplug reg block regression To: Michael Brown , devel@edk2.groups.io Cc: Ard Biesheuvel , Brijesh Singh , Erdem Aktas , Gerd Hoffmann , James Bottomley , Jiewen Yao , Jordan Justen , Min Xu , Oliver Steffen , Sebastien Boeuf , Tom Lendacky References: <20230112082845.128463-1-lersek@redhat.com> <01020185a568604c-e16d8581-963a-4ff3-8566-bf0640ad327d-000000@eu-west-1.amazonses.com> <407c5cee-7a6c-cbc8-35cc-8f2c2724914c@redhat.com> <01020185a6bda78a-05d82180-4d1a-4af4-9a9b-ac78088d11ed-000000@eu-west-1.amazonses.com> From: "Laszlo Ersek" In-Reply-To: <01020185a6bda78a-05d82180-4d1a-4af4-9a9b-ac78088d11ed-000000@eu-west-1.amazonses.com> X-Scanned-By: MIMEDefang 3.1 on 10.11.54.5 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 1/12/23 17:08, Michael Brown wrote: > On 12/01/2023 13:22, Laszlo Ersek wrote: >>>> Detect the issue in PlatformMaxCpuCountInitialization(), and >>>> print an error message and *hang* if the issue is present. >>> >>> Would this mean that OVMF would refuse to start with all current >>> distro versions of qemu (when not using KVM), or am I >>> misunderstanding? >> >> Your understanding is correct. > > I apologise in advance if this is a stupid question, but: given that > we can detect the issue (as per this patch), and given also: > >>> On 12/01/2023 08:28, Laszlo Ersek wrote: >>>> In QEMU v5.1.0, the CPU hotplug register block misbehaves: the >>>> negotiation protocol is (effectively) broken such that it >>>> suggests that switching from the legacy interface to the modern >>>> interface works, but in reality the switch never happens. > ...would it work to detect the issue and treat it as "modern interface > is not supported: continue to use the legacy interface"? IOW, to > treat "Present=0" as indicating that the modern interface is not > supported. I considered that. It's not a bad idea at all, but it's more complicated than that. The code in PlatformPei (or more precisely, nowadays in PlatformInitLib) could be rearranged. It would be a really ugly patch, because now, after CmdData2 reads as 0, we're simply not prepared to renege on that determination. So the branches would have to be rearranged in a quite ugly manner. *But* it would be doable of course. However, in the CpuHotplugSmm driver, we also read the same register (QEMU_CPUHP_R_CMD_DATA2), for determining feature availability. See the QemuCpuhpReadCommandData2() call in "OvmfPkg/CpuHotplugSmm/CpuHotplug.c", function CpuHotplugEntry(). And the outcome of *that* check is final. No counting is done there anymore, because it's unnecessary in the first place. If the check succeeds, we proceed, if the check fails, we hang hard. That sanity check is also misled by the QEMU bug, and tweaking *that* check is out of question. And I didn't want to post a patch that would make PlatformPei deal with the QEMU bug one way, and CpuHotplugSmm another way. Technically, the following should be possible: in PlatformInitLib, rearrange the branches such that we can still do "something" if the QEMU bug is detected. Then, if PcdSmmSmramRequire is TRUE, hang hard, and if PcdSmmSmramRequire is FALSE, then just pretend the modern hotplug register block is not available. Unfortunately, I really dislike this. It will cause the same hang on distros that build OVMF -- possibly their *only* OVMF binary -- with -D SMM_REQUIRE. Then the request will morph into "can we refine this even further". And sure, you can refine it even further: - the CpuHotplugSmm driver exits gracefully if QEMU does not negotiate the "SMRAM at default SMBASE" feature (see edk2 commit d74d56fcfa31 ("OvmfPkg: introduce PcdQ35SmramAtDefaultSmbase", 2020-02-05), and QEMU commit f404220e279c ("q35: implement 128K SMRAM at default SMBASE address", 2020-01-22)). This can be disabled through the "mch.smbase-smram" property on the QEMU command line. - QEMU can be configured with other compat properties on the command line so that "CPU hotplug with SMI" and "CPU hot-unplug with SMI" *not* be offered to the firmware. Then QEMU will reject hotplug attempts, and the SMM hotplug code in edk2 will not be triggered by the (virtual) hardware. The case is that both QEMU and edk2 check for each other's supported features. It's a complex interwoven feature set with security impact, which is exactly why we added feature negotiation at every step -- effectively mutual negotiation wherever necessary. I cannot claim I remember every part of it, and playing tricks around feature negotiation with SMM impact makes me *extremely uncomfortable*. I absolutely don't want to author an OVMF patch, briefly before I disappear again (for good!), that "looks good" now, and then becomes a horrible SMM CVE in a year or two. I want to go for "obviously no bug", rather than "no obvious bug". In brief, my counter-argument is: once we start refining the hang, it will never be nuanced and polished enough for users, and we'll just heap on more complexity, until we introduce an obscure but nasty bug. I don't want to start down that path. I can live with this patch being rejected altogether (that's consistent with me having exited edk2 -- I can pretend I don't know about it; it's not my job anymore). I will not take responsibility for relaxing the proposed hang however, not even as a reviewer ACKing it. If someone can rearrange the code, drawing a *practical* but also secure boundary for the fallback, that's great, the patch can go in without me "being in the know". I want none of that responsibility. I don't want to force my hard-liner view on QEMU/OVMF users here; I need not participate in a relaxed approach either, however. This reads like a re-run of the "old grub in guest versus non-executable UEFI heap" discussion, and it's giving me the shivers. Your proposal is entirely justified, from a practical / user perspective, but I'm not the right person for it. Thanks Laszlo