From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail02.groups.io (mail02.groups.io [66.175.222.108]) by spool.mail.gandi.net (Postfix) with ESMTPS id 26C7F941BC4 for ; Tue, 31 Oct 2023 17:06:40 +0000 (UTC) DKIM-Signature: a=rsa-sha256; bh=giJOQmCintwcT7a1C3L2BZT0l/Ic0GEKxbp7Tq1o0u4=; c=relaxed/simple; d=groups.io; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From:In-Reply-To:Precedence:List-Subscribe:List-Help:Sender:List-Id:Mailing-List:Delivered-To:Reply-To:List-Unsubscribe-Post:List-Unsubscribe:Content-Language:Content-Type:Content-Transfer-Encoding; s=20140610; t=1698771998; v=1; b=po4le1HSF9/otvj2qcLL5tJ1R8s9PRKH0jGBPE1HgI5s4i7rleYMT7hR56202afEWIoPmVM3 yc8M40lYYspDGEvX+9yOd+d+LX5wXRzOOKJE4jwiAgkNpA5aP5bQi85R4KxSX9v81+dZAP+TFyW B1OCnFzfbg8+huZ8wnzH3QHc= X-Received: by 127.0.0.2 with SMTP id asbWYY7687511xYAKex8de0Q; Tue, 31 Oct 2023 10:06:38 -0700 X-Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by mx.groups.io with SMTP id smtpd.web10.1191.1698771998146191958 for ; Tue, 31 Oct 2023 10:06:38 -0700 X-Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-311-8Nyu8MBRMlmfwtmdjHWfBQ-1; Tue, 31 Oct 2023 13:06:25 -0400 X-MC-Unique: 8Nyu8MBRMlmfwtmdjHWfBQ-1 X-Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.rdu2.redhat.com [10.11.54.7]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 3CA7E3822550; Tue, 31 Oct 2023 17:06:25 +0000 (UTC) X-Received: from [10.39.195.34] (unknown [10.39.195.34]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 7B9F91C060BC; Tue, 31 Oct 2023 17:06:23 +0000 (UTC) Message-ID: <0a9e392d-3c51-1992-9c18-f11a0aaf9fe6@redhat.com> Date: Tue, 31 Oct 2023 18:06:22 +0100 MIME-Version: 1.0 Subject: Re: [edk2-devel] [PATCH 0/7] Support Tdx and sev in BaseIoLibIntrinsic and remove BaseIoLibIntrinsicSev To: devel@edk2.groups.io, dun.tan@intel.com, "thomas.lendacky@amd.com" , "leo.duran@amd.com" , "brijesh.singh@amd.com" , "Chang, Abner" , "michael.roth@amd.com" , "Attar, AbdulLateef (Abdul Lateef)" Cc: "Ni, Ray" , "Yao, Jiewen" References: <20231027054300.1382-1-dun.tan@intel.com> <3692a402-fede-9534-0517-72298ff2aff8@redhat.com> From: "Laszlo Ersek" In-Reply-To: X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.7 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Precedence: Bulk List-Subscribe: List-Help: Sender: devel@edk2.groups.io List-Id: Mailing-List: list devel@edk2.groups.io; contact devel+owner@edk2.groups.io Reply-To: devel@edk2.groups.io,lersek@redhat.com List-Unsubscribe-Post: List-Unsubscribe=One-Click List-Unsubscribe: X-Gm-Message-State: hwzcspcrNDBCwsbFVMW1U8oix7686176AA= Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-GND-Status: LEGIT Authentication-Results: spool.mail.gandi.net; dkim=pass header.d=groups.io header.s=20140610 header.b=po4le1HS; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=redhat.com (policy=none); spf=pass (spool.mail.gandi.net: domain of bounce@groups.io designates 66.175.222.108 as permitted sender) smtp.mailfrom=bounce@groups.io On 10/31/23 10:56, duntan wrote: > Hi Tom, > Thanks for your testing and comments. I'll modify the patch commit > message and code according to your comments. > Yes the patch changes the behavior for IoReadFifo/ IoWriteFifo API for > non-Tdx and non-SEV case. Looking forward to more comments from the > community. > > Hi Laszlo, > May I know the actually performance gap data in the case that you > mentioned? Good call to have me measure this. I cannot reproduce the performance hit *at all* with the isa-debugcon device (that is, through the OVMF debug log). I built OVMF X64 with PcdDebugPrintErrorLevel set to 0x8040004F, at commit a671a14e63fd. That was the baseline. I built both NOOPT and DEBUG. Then I repeated the same, with your patches applied. I cannot show any human-perceptible boot time difference. Note that the NOOPT comparison excludes LTO (i.e., compiler advances) as the potential factor that might mask the impact of your patches. I also disassembled some binaries producing debug output, and indeed, without your patches, we have rep outsb in them, and with your patches, we don't. Yet, there is no performance difference. This is striking news. I want to understand why we have thought thus far that individual IO Port traps are extremely expensive. I've now found evidence going back as far as 2012 -- the year when I joined edk2! Here are some facts: (1) in commit f1ec65ba24e1 ("OvmfPkg: Add QemuFwCfgLib library class and implementation", 2012-05-30) -- note the date! --, Jordan added the initial implementation of QemuFwCfgLib. That version already used an open-coded IoReadFifo8 function. There was no particular explanation captured in the commit message. I also happen to have the entire 2012 edk2-devel mailing list traffic saved locally (well, the messages posted after I joined), and the discussions around the various versions of the patch in question don't seem to specifically justify the REP variant. So, we cannot conclude much from this piece of evidence. (2) Then I decided to search that entire year's traffic (i.e., 2012's) for the word "fifo". I found relevant hits: - In commit 1fd376d97922 ("PcAtChipsetPkg/PciHostBridgeDxe: Improve KVM FIFO I/O read/write performance", 2012-06-01) -- note the year again! --, Jordan brought the FIFO variants to the EFI_PCI_ROOT_BRIDGE_IO_PROTOCOL implementation in edk2. The code was duplicated. Note that Jordan clearly stated in the commit message: > KVM can substantially boost the speed of the rep insb/insw/insl > and rep outsb/outsw/outsl instructions by transferring up to > a page of data per VM trap. - On the same day when Jordan posted the patch that would become above-noted commit 1fd376d97922, Jordan also asked on the list (subject "IoLib I/O FIFO read/write routines"): > Regarding the routines now defined in PcAtChipsetPkg/PciHostBridgeDxe/IoF= ifo.h. > IoReadFifo8, IoReadFifo16, IoReadFifo32, > IoWriteFifo8, IoWriteFifo16, and IoWriteFifo32 > > Would it make sense to consider adding these to MdePkg/Include/Library/Io= Lib.h? Again, this was still in 2012. So here we can conclude that Jordan saw some specific perf boost due to employing the FIFO variants in EFI_PCI_ROOT_BRIDGE_IO_PROTOCOL, but the use case was not documented. (3) During all this discussion here, I have been *distinctly* recalling that I've experienced an *actual performance regression* sometime, after we had unwittingly switched from the REP variant to the non-REP variant. And now I have actually found it. I'm going to quote the commit message in full: > commit fb8b54694c53e73e1b6a098686e908f54f9bb7a9 > Author: Laszlo Ersek > Date: Thu Apr 7 22:28:38 2016 +0200 > > UefiCpuPkg: CpuIo2Dxe: optimize FIFO reads and writes of IO ports > > * Short description: > > The CpuIoServiceRead() and CpuIoServiceWrite() functions transfer d= ata > between memory and IO ports with individual Io(Read|Write)(8|16|32) > function calls, each in an appropriately set up loop. > > On the Ia32 and X64 platforms however, FIFO reads and writes can be > optimized, by coding them in assembly, and delegating the loop to t= he > CPU, with the REP prefix. > > On KVM virtualization hosts, this difference has a huge performance > impact: if the loop is open-coded, then the virtual machine traps t= o the > hypervisor on every single UINT8 / UINT16 / UINT32 transfer, wherea= s > with the REP prefix, KVM can transfer up to a page of data per VM t= rap. > This is especially noticeable with IDE PIO transfers, where all the= data > are squeezed through IO ports. > > * Long description: > > The RootBridgeIoIoRW() function in > > PcAtChipsetPkg/PciHostBridgeDxe/PciRootBridgeIo.c > > used to have the exact same IO port acces optimization, dating back > verbatim to commit 1fd376d9792: > > PcAtChipsetPkg/PciHostBridgeDxe: Improve KVM FIFO I/O read/write > performance > > OvmfPkg cloned the "PcAtChipsetPkg/PciHostBridgeDxe" driver (for > unrelated reasons), and inherited the optimization from PcAtChipset= Pkg. > > The "PcAtChipsetPkg/PciHostBridgeDxe" driver was ultimately removed= in > commit 111d79db47: > > PcAtChipsetPkg/PciHostBridge: Remove PciHostBridge driver > > and OvmfPkg too was rebased to the new core Pci Host Bridge Driver,= in > commit 4014885ffd: > > OvmfPkg: switch to MdeModulePkg/Bus/Pci/PciHostBridgeDxe > > This caused the optimization to go lost. Namely, the > RootBridgeIoIoRead() and RootBridgeIoIoWrite() functions in the new= core > Pci Host Bridge Driver delegate IO port accesses to > EFI_CPU_IO2_PROTOCOL. And, in OvmfPkg (and likely most other Ia32 /= X64 > edk2 platforms), this protocol is provided by "UefiCpuPkg/CpuIo2Dxe= ", > which lacks the optimization. > > Therefore, this patch ports the C source code logic from commit > 1fd376d9792 (see above) to "UefiCpuPkg/CpuIo2Dxe", plus it ports th= e > NASM-converted assembly helper functions from OvmfPkg commits > 6026bf460037 and ace1d0517b65: > > OvmfPkg PciHostBridgeDxe: Convert Ia32/IoFifo.asm to NASM > > OvmfPkg PciHostBridgeDxe: Convert X64/IoFifo.asm to NASM > > In order to support the MSFT and INTEL toolchains as well, the *.as= m > files are ported from OvmfPkg as well, immediately from before the = above > conversion (that is, at 6026bf460037^). > > * Notes about the port: > > - The write and read branches from commit 1fd376d9792 are split to = the > separate functions CpuIoServiceWrite() and CpuIoServiceRead(). > > - The EfiPciWidthUintXX constants are replaced with EfiCpuIoWidthUi= ntXX. > > - The cast expression "(UINTN) Address" is replaced with > "(UINTN)Address" (i.e., no space), because that's how the receivi= ng > functions spell it as well. > > - The labels in the switch statements are unindented by one level, = to > match the edk2 coding style (and the rest of UefiCpuPkg) better. > > * The first signoff belongs to Jordan, because he authored all of > 1fd376d9792, 6026bf460037 and ace1d0517b65. > > Contributed-under: TianoCore Contribution Agreement 1.0 > Signed-off-by: Jordan Justen > Contributed-under: TianoCore Contribution Agreement 1.0 > Signed-off-by: Laszlo Ersek > Ref: https://www.redhat.com/archives/vfio-users/2016-April/msg00029.h= tml > Reported-by: Mark > Ref: http://thread.gmane.org/gmane.comp.bios.edk2.devel/10424/focus= =3D10432 > Reported-by: Jordan Justen > Cc: Jordan Justen > Cc: Ruiyu Ni > Cc: Jeff Fan > Cc: Mark > Tested-by: Mark > Reviewed-by: Jeff Fan This is "hard evidence". Per bullet (2), Jordan had added the FIFO / REP variants to EFI_PCI_ROOT_BRIDGE_IO_PROTOCOL, for some (non-specific) performance boost. and when that optimization was later lost, due to a mistake, we saw IDE PIO performance *tank* on QEMU/KVM. Therefore, I am now being reminded that the best performance regression test for this particular series is to boot an operating system using IDE (PIO) emulation, with OVMF! Because CpuIo2Dxe still calls the FIFO APIs. That's what I'm going to do next -- later this week, if everything goes well. I ask for your patience until I complete that regression test. (4) In commit 19c6d9feaaf8 ("MdePkg: Expand BaseIoLibIntrinsic (IoLib class) library", 2017-01-17), the duplicated FIFO variants were upstreamed to the IoLib instance(s) in MdePkg. Subsequently, commit b6d11d7c4678 ("MdePkg: BaseIoLibIntrinsic (IoLib class) library", 2017-04-13) introduced the SEV fallback. And then commit 97353a9c914d ("OvmfPkg: Update dsc to use IoLib from BaseIoLibIntrinsicSev.inf", 2017-07-10) put the SEV fallback to use in OVMF. In commit 80886a695377 ("OvmfPkg/PlatformDebugLibIoPort: write messages with IoWriteFifo8()", 2017-09-11), we switched PlatformDebugLibIoPort as well over to the REP variant (which had been by then centralized). In fact, using the REP fifo write in OVMF's DebugLib instance was recommended by Jordan already in 2012! He wrote (while reviewing the patch that would become commit b90aefa9e46c, "OvmfPkg: add support for debug console on port 0x402", 2012-07-26): > I'm not certain, but I think it might be worth-while to make the > str-len call, and then use the I/O fifo call. > > It could reduce the trapping count by ~30x for writing a debug string, > and we do write quite a lot of DEBUG... (I'm not sure it is enough to > make a noticeable difference though.) So that was what commit 80886a695377 implemented some five years later, citing that "In the general case this speeds up logging". (5) I also distinctly remember witnessing the debug logging speed difference between SEV and non-SEV guests. (Well, this is not a "fact", but a memory.) In commit c09d9571300a ("OvmfPkg: save on I/O port accesses when the debug port is not in use", 2017-11-17), we even quantified this. The commit message says, "therefore, in SEV guests, the boot time impact is huge (about 45 seconds _additional_ time spent writing to the debug port)". So, minimally commit fb8b54694c53 from bullet (3) is *hard evidence* that this difference in performance used to exist, regardless of SEV. As I said above, I will try to retest the same use case soon, with this patch set. If the performance difference turns out to be invisible now, then I can only chalk that up to QEMU/KVM technology having advanced a lot -- and in that case, I will rescind my NACK. To summarize, my operating memory was correct, and I wasn't wrong to NACK this patch set; at the same time, you are also very correct to point out that we need to measure *afresh*. Back to your email: On 10/31/23 10:56, duntan wrote: > Also, if a DEBUG binary is used, I think the boot time impact might be > acceptable? We exclusively build and ship DEBUG binaries. RELEASE binaries contain no ASSERTs. Many edk2 core modules -- unfortunately -- only employ ASSERTs as error checks, so a RELEASE build means that those core modules are built without error checks. Which is unacceptable. (But even if all those modules had proper, explicit error checking, we'd still want to preserve all the ASSERT()s.) I'll report back later with the IDE PIO performance test results. Thanks, Laszlo -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#110436): https://edk2.groups.io/g/devel/message/110436 Mute This Topic: https://groups.io/mt/102215661/7686176 Group Owner: devel+owner@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/leave/12367111/7686176/19134562= 12/xyzzy [rebecca@openfw.io] -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-