From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id B2A6E21E977F0 for ; Tue, 5 Sep 2017 21:34:30 -0700 (PDT) Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga101.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 05 Sep 2017 21:37:19 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.41,482,1498546800"; d="scan'208,217";a="1191973874" Received: from fmsmsx104.amr.corp.intel.com ([10.18.124.202]) by fmsmga001.fm.intel.com with ESMTP; 05 Sep 2017 21:37:19 -0700 Received: from fmsmsx116.amr.corp.intel.com (10.18.116.20) by fmsmsx104.amr.corp.intel.com (10.18.124.202) with Microsoft SMTP Server (TLS) id 14.3.319.2; Tue, 5 Sep 2017 21:37:19 -0700 Received: from shsmsx103.ccr.corp.intel.com (10.239.4.69) by fmsmsx116.amr.corp.intel.com (10.18.116.20) with Microsoft SMTP Server (TLS) id 14.3.319.2; Tue, 5 Sep 2017 21:37:18 -0700 Received: from shsmsx102.ccr.corp.intel.com ([169.254.2.39]) by SHSMSX103.ccr.corp.intel.com ([169.254.4.219]) with mapi id 14.03.0319.002; Wed, 6 Sep 2017 12:37:16 +0800 From: "Yao, Jiewen" To: 'Laszlo Ersek' , "Zeng, Star" , edk2-devel-01 CC: "Dong, Eric" , Brijesh Singh Thread-Topic: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() Thread-Index: AQHTJO6HlEgpQf73uk+WGvJRCtH+06KkA0GAgACz0wCAAMTM8IAAA0SAgADDw8D//836AIAA9pEQ Date: Wed, 6 Sep 2017 04:37:15 +0000 Message-ID: <74D8A39837DF1E4DA445A8C0B3885C503A9A852C@shsmsx102.ccr.corp.intel.com> References: <20170903195449.30261-1-lersek@redhat.com> <0C09AFA07DD0434D9E2A0C6AEB0483103B93A125@shsmsx102.ccr.corp.intel.com> <4b24b1eb-362f-3b46-97e2-bdfda53f40c9@redhat.com> <74D8A39837DF1E4DA445A8C0B3885C503A9A79BD@shsmsx102.ccr.corp.intel.com> <5f1fdc84-5824-bee2-5a1a-fbd67adf5443@redhat.com> <74D8A39837DF1E4DA445A8C0B3885C503A9A7F10@shsmsx102.ccr.corp.intel.com> <12d71f32-9dcf-4f6e-b033-d5f82104caca@redhat.com> In-Reply-To: <12d71f32-9dcf-4f6e-b033-d5f82104caca@redhat.com> Accept-Language: zh-CN, en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: dlp-product: dlpe-windows dlp-version: 11.0.0.116 dlp-reaction: no-action x-originating-ip: [10.239.127.40] MIME-Version: 1.0 X-Content-Filtered-By: Mailman/MimeDel 2.1.22 Subject: Re: [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() X-BeenThere: edk2-devel@lists.01.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: EDK II Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Sep 2017 04:34:30 -0000 Content-Language: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Thanks for the clarification. Comment in line. From: Laszlo Ersek [mailto:lersek@redhat.com] Sent: Wednesday, September 6, 2017 1:57 AM To: Yao, Jiewen ; Zeng, Star ; e= dk2-devel-01 Cc: Dong, Eric ; Brijesh Singh Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap co= mmon buffers at ExitBootServices() On 09/05/17 15:44, Yao, Jiewen wrote: > Good discussion. Comment in line. > > From: Laszlo Ersek [mailto:lersek@redhat.com] > Sent: Tuesday, September 5, 2017 5:16 PM > To: Yao, Jiewen >; Zeng= , Star >; edk2-devel-01 > > Cc: Dong, Eric > > Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap = common buffers at ExitBootServices() > > On 09/05/17 04:18, Yao, Jiewen wrote: >> HI Laszlo >> Thank you very much to raise this DMA topic. >> >> I have a chat with Star. We have couples of question, from high level to= low level. >> Hope you can clarify: >> >> >> 1) About disable DMA >> >> As you mentioned, in UEFI_Driver_Writer_Guide_V1.0.1_120308.pdf, 7.7 Add= ing the Exit Boot Services feature >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D >> Examples from the EDK II that use this feature are the PCI device driver= s for USB Host >> Controllers. Some USB Host Controllers are PCI Bus Masters that continuo= usly access a >> memory buffer to poll for operation requests. Access to this memory buff= er by a USB >> Host Controller may be required to boot an operation system, but this ac= tivity must be >> terminated when the OS calls ExitBootServices(). The typical action in t= he Exit Boot >> Services Event for these types of drivers is to disable the PCI bus mast= er and place the >> USB Host Controller into a halted state >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D >> >> I believe the fundamental requirement is to "disable DMA". >> As such, the simplest action is to clear PCI Bus Master Enable bit in th= e command register. >> >> Do you think clearing BME is enough? >> Is there any special reason that we must unmap the DMA buffer? > > Modifying the device state is enough to ask the device *nicely* to stop > doing DMA. > > But, I think, if we are on an IOMMU-protected system, where part of > PciIo.Map() is to add a translation (=3D system memory access permission) > to the IOMMU, then we should also revoke this permission explicitly, > *after* we asked the device nicely to stop doing DMA. > > Basically, ask the device nicely first, and then enforce the protection. > > ... BTW, I mentioned "IOMMU faults" earlier, and I have some further > thoughts on them. This is about tearing down IOMMU translations *first*, > in the IOMMU driver's ExitBootServices() handler, *before* the PCI > drivers get a chance -- in *their* ExitBootServices() handlers -- to ask > the PCI devices nicely to stop doing DMA. > > I think a quite problematic failure mode is possible here. Assume that a > disk controller has some write commands queued (which are mapped with > Bus Master Read or Bus Master CommonBuffer operations). If the > ExitBootServices() handler for the IOMMU driver runs first, then the > device could act upon these queued commands with missing IOMMU > translations, before the device's own PCI driver asked the device to > shut down DMA. In some cases, I guess this could result in those queued > write commands simply failing, without ill effects. However, dependent > on the IOMMU implementation, the write commands might not fail > explicitly, but read garbage from system memory, and write garbage to > disk blocks. That would be bad. > > I think this is exactly what could happen with the SEV IOMMU we have. > The memory would be first re-encrypted (this is equivalent to revoking a > translation), and then the write commands (queued in the emulated AHCI > controller, in the hypervisor) would read garbage from guest memory, and > write garbage to the virtual disk. > > [Jiewen] I am not clear on how it happens in SEV here. > Below is my thought, please correct me if I am wrong. > > First, the fundamental requirement is to disable BME at exit boot service= , > or make sure the controller is in halt state. > If a device driver fails to meet such requirement, it is a bug we need to= fix at first. > > Now, let's assume all device driver is in halt state at ExitBootService. = - that is my > understanding for all rest discussion. OK. > I checked OVMF and I do not see any code in IOMMU driver to register exit= boot services event, > which means the DMA buffer is still decrypted. More precisely, the IOMMU driver does not take it upon itself to re-encrypt DMA buffers at ExitBootServices(). Instead it relies on the PCI drivers to Unmap() CommonBuffers at ExitBootServices(). > If that is the case, the DMA transfer before BME disable is safety (DMA b= uffer is decrypted) and > there will be no DMA transfer any more at ExitBootService (BME disabled). > Then how does the write command queued in the emulated AHCI controller st= art read garbage? It is not happening right now. When I wrote that, I was just speculating about possible consequences; i.e., what *could* happen if the IOMMU driver itself re-encrypted memory at ExitBootServices(). [Jiewen] OK. Got it. *Then* the garbage case could happen. In other words, the case I described does not happen right now; I only described it as a counter-argument against *adding* an ExitBootServices() handler to the IOMMU driver. [Jiewen] I agree with you. A IOMMU driver should NOT use ExitBootServices to stop DMA transfer. Under OvmfPkg, the IOMMU behavior for SEV, and the PCI drivers for the virtio devices, are fully in synch. The PCI drivers *alone* are responsible for unmapping CommonBuffers at ExitBootServices(), and so the IOMMU driver doesn't do it. [Jiewen] I agree that IOMMU does not. But I hold my opinion on if the PCI driver is responsible for unmapping CommonBuffer at ExitBootServices(). The question is if the same distribution of responsibilities applies to PCI drivers that live outside of OvmfPkg, such as MdeModulePkg. [Jiewen] Yes. If we want to define a rule, the rule should be for *any* PCI= device driver, no matter where it is from. It could be in OVMF package, MdeModulePkg, a private repo from Independent BIOS Vendor (IBV) or Original Equipment Man= ufacturer (OEM), and even from Independent Hardware Vendor (IHV). A special notice: The IHV driver should only follow the interface defined in UEFI specification, such as PCI_IO. The IHV driver should NOT consume any interface defined in MdeModulePkg, such as EDKII_IOMMU. The current patch set suggests "yes". > The SEV IOMMU driver only modifies the page table. Not exactly. It modifies page table entries, and then re-writes the memory covered by those page table entries. [Jiewen] Yes, you are right. There is StashBuffer associated with every common buffer. Simply flipping the C-bit in page table entries does not reencrypt memory contents, it only sets up the next read and/or write that happens through that page mapping, for "decrypted view" or "encrypted view", from the virtualization host side. [Jiewen] Thanks for the info. > Then after ExitBootService, the OS will take control of CR3 and set corre= ct > encryption bit. This is true, the guest OS will set up new page tables, and in those PTEs, the C-bit ("encrypt") will likely be set by default. However, the guest OS will not *rewrite* all of the memory, with the C-bit set. This means that any memory that the firmware didn't actually re-encrypt (in the IOMMU driver) will still expose stale data to the hypervisor. [Jiewen] That is an interesting question. Is there any security concern associated? If the C-bit is cleared for a read/write buffer, is the content in the read/write buffer also exposed to hypervisor? I means if there is security concern, the concern should be applied to both common buffer and read/write buffer. Then we have to figure a way to resolve both buffer. To be honest, that is exactly my biggest question on this patch: why do we only handle the common buffer specially? > NOTE: The device should still be halt state, because the device is > also controlled by OS. This is the key point -- the device is *not* controlled by the guest OS, in the worst case. The virtual device is ultimately implemented by the hypervisor. We don't expect the virtual device (=3D the hypervisor) to mis-behave on purpose. However, if the hypervisor is compromised by an external attacker (for example over the network, or via privilege escalation from another guest), then all guest RAM that has not been encrypted up to that point becomes visible to the attacker. In other words, the hypervisor ("the device") becomes retro-actively malicious. [Jiewen] If that is the threat model, I have a question on the exposure: 1) If the concern is for existing data, it means all DMA data already writt= en. However, the DMA data is already consumed or produced by virtual device or a hypervisor. If the hypervisor is attacked, it already gets all the dat= a content. 2) if the concern is for future data, it means all user data to be written. However, the C-bit already prevent that. Maybe I do not fully understand the threat model defined. If you can explain more, that would be great. The point of SEV is to keep as much guest data encrypted at all times as possible, so that *whenever* the hypervisor is compromised by an attacker, the guest memory -- which the attacker can inevitably dump from the host side -- remains un-intellegible to the attacker. [Jiewen] OK. If this is a security question, I suggest we define a clear threat model at first. Or what we are doing might be unnecessary or insufficient. >> 2) About common buffer >> If there is a special requirement that we must unmap the DMA buffer, why= we only unmap the common buffer? >> >> I understand your statement that unmapping a read buffer or write buffer= may cause free memory action. But that is implementation choice. > > I believe I disagree with this (i.e., I think it is not an > implementation choice). > > In the general case, we can assume that PciIo.Map() allocates a bounce > buffer internally, when Read or Write action is requested. > > [Jiewen] When I say implementation choice, I mean Map() *may* allocates > a bounce buffer or it just uses the same address, if the platform is capa= ble of > using DRAM as DMA buffer directly. I have two counter-arguments: (a) As long as we don't want to reimplement PciIo.Map() from the scratch, the current layering of protocols in edk2 implies that even if the PCI device driver sets DUAL_ADDRESS_CYCLE, this flag can be cleared on the way to the IOMMU protocol, in PciHostBridgeDxe, if PciHostBridgeLib says "no DMA above 4GB". OvmfPkg's PciHostBridgeLib currently says "no DMA above 4GB", simply because that's always been the case, and I'm not sure if we can lift the limitation without regressions. (b) The more important counter-argument is that Map() and Unmap() need auxiliary memory (a temporary buffer) for actually encrypting and decrypting memory, so that the virtual device (implemented by the hypervisor) can read it or write it (temporarily). As I wrote above, encrypting or decrypting the memory contents works by writing the original data through a new page mapping that has the desired C-bit setting. In a guest operating system, in-place encryption is possible by having two virtual mappings to the same physical address range, one with the C-bit set, and another with the C-bit clear. For encrypting, the guest OS reads a cache-line sized chunk of data through the "C-bit clear" mapping, and writes it back through the "C-bit set" mapping. For decrypting, it does the inverse. In UEFI, we don't have two virtual page mappings for the same physical address, we only have a 1:1 mapping, on which we can flip the C-bit. This means that, for decryption for example, we have to copy the data from its original location (currently encrypted) to a "stash" (which is always encrypted), modify the PTEs for the original location (clear the C-bit, flush the TLB), then finally copy the data back from the "stash" (still encrypted) to the original location (which will effectively decrypt it, for the hypervisor to read). This "stash" requires room. Currently the "stash" has an equal number of pages to the actual region being mapped. For CommonBuffer operations, the stash is allocated in AllocateBuffer(), together with the normal buffer's allocation. It is freed similarly in FreeBuffer(). For Read/Write BMDMA operations, for which the PCI driver only calls Map()/Unmap(), we cannot avoid allocating the stash in Map(), and freeing it in Unmap(). We are practically forced to allocate a bounce buffer not due to address width limitations (32 vs 64), but due to C-bit status. Now, earlier I had an idea to add a 1-page sized global array variable to the IOMMU driver, and do the encryption/decryption page by page. In other words, "unroll" the PTE management and the copying into page-sized chunks. This would eliminate the need to allocate a "stash" dynamically. However, the drawback of this approach is that the PTE's must be toggled one by one even for a larger region. It could cause undesirably high page table splitting and bad performance. So, the SEV IOMMU could only eliminate dynamic memory alloc/dealloc for Read/Write DMA operations allocation if both conditions were changed: - if we knew for sure that 64-bit DMA was safe to advertize in PciHostBridgeLib, - if we replaced the dynamically allocated stash, and the "batch" PTE management + copying, with a single-page static buffer, and "unrolled" PTE management + copying. More generally speaking, 64-bit clean DMA might simply not be possible on a given hardware platform, and on that platform, PciIo.Map() could not avoid allocating bounce buffers for Read/Write DMA operations, generally speaking. Thus, we shouldn't require a PCI driver to call Unmap() at ExitBootServices(). [Jiewen] I agree what you said above. And I did not treat that to be counter-arguments. :-) > In turn, PciIo.Unmap() has to release the same buffer, internally. > PciIo.Unmap() does not know if it is called from an ExitBootServices() > notification function, or from a normal context. It cannot decide if it > is allowed to release memory, or not. There's no additional parameter to > provide this hint either. If PciIo.Unmap() never releases buffers, then > ExitBootServices() will be happy, but we're going to leak memory. > > Now, PciIo.Unmap() could move the bounce buffer to some internal "free > list" instead, rather than freeing it. And, the next PciIo.Map() call > could try to reuse a buffer from the "free list". But this is difficult > to do, because the buffers may have any size, and recycling them would > require an internal, general "page pool", in parallel to the Boot > Services page pool (AllocatePages, FreePages). > > If PciIo.Unmap() could be informed that it is running in > ExitBootServices() notification context -- called from PCI drivers --, > then unmapping Read and Write operations would be simple. > >> If we have strong requirement to unmap the DMA buffer, we should achieve= that. > > I totally agree that we *should* achieve that, but I don't see how the > current PciIo.Map() and PciIo.Unmap() APIs make it possible, for Read > and Write operations. > > For CommonBuffer operations, it is possible, because AllocateBuffer() > and FreeBuffer() are separate actions, and an ExitBootServices() > notification function can *avoid* calling FreeBuffer. (But only for > CommonBuffer operations.) PciIo.Unmap() cannot be told *not* to release > a bounce buffer for Read/Write. > > [Jiewen] Again, it is an implementation choice. It is possible that a hos= t > bridge allocate common buffer without calling any Allocate(). > But another implementation may also allocate internal data structure to > record every map() action, including common buffer. (For event log, or > tracing purpose, et etc.) True. My point is that - we could reasonably require that Unmap() work for CommonBuffer operations without releasing memory, - but we couldn't reasonably require the same for Read and Write operations. This is because for CommonBuffer, a platform that right now allocates extra stuff in Map(), could move the same allocations into AllocateBuffer(). Similarly, move the current de-allocations from Unmap() into FreeBuffer(). The same is not possible for Read/Write, so we shouldn't require that. [Jiewen] For "require that Unmap() work for CommonBuffer operations without releasing memory", I would hold my opinion until it is documented clearly in UEFI spec. My current concern is: First, this sentence is NOT defined in UEFI specification. Second, it might break an existing implementation or existing feature, such= as tracing. Till now, I did not see any restriction in UEFI spec say: In this function,= you are not allowed to call memory services. The only restriction is 1) TPL restriction, where memory service must be <=3D TPL_NOTIFY. 2) AP restriction, where no UEFI service/protocol is allowed for AP. > I do not think Free() should bring difference between common buffer or > read/write buffer. > I am a little worried that we have too many assumption here: > e.g. Map() common buffer never involves Allocate() even for driver intern= al > data structure , and Map() read/write buffer always involves Allocate() e= ven > there is no need to allocate anything. > With that assumption, we purposely avoid Unmap() read/write buffer, > but force Unmap() common buffer. Exactly. > I am not sure if that assumption is correct. I'm just trying to operate with the APIs currently defined by the UEFI spec, and these assumptions were the best I could come up with. [Jiewen] The PCI device driver must follow and *only* follow UEFI spec. Especially the IHV Option ROM should not consume any private API. For example, if PciIo.Unmap() took one more parameter: IN BOOLEAN CalledAtExitBootServices then all this would be solved at once. Namely, PciIo implementations would be required to call Unmap() for *all* DMA operations (read, write, commonbuffer) in their ExitBootServices() notification functions, with the parameter set to TRUE. All other contexts should pass FALSE. In turn, CalledAtExitBootServices=3D=3DTRUE would require PciIo.Unmap() to work exactly as otherwise, but all memory allocated earlier with gBS->AllocatePool() and gBS->AllocatePages() would have to be leaked on purpose. This is all. Given that this parameter does not exist, I'm trying to find other ways to: - perform the unmapping (re-encryption) at ExitBootServices, - without releasing memory, - and not in the IOMMU driver (because that could remove the translation before the PCI driver's own callback could disable the device, and that would be wrong) >> 3) About interaction with IOMMU >> I am not worried about IOMMU interaction. >> I believe an proper IOMMU implementation should not block any action tak= en by a PCI device driver. >> Current IntelVTdDxe driver disables VTd engine in ExitBootService, which= means the DMA access is no longer managed by VTd engine. It is NOT related= to disable DMA buffer. > > Ah, I understand! > > So, on one hand, that is totally safe, for PCI device drivers. It means > all pending DMA can get through, until the PCI drivers ask the devices > (nicely) to stop doing DMA. > [Jiewen] Yes. I think any IOMMU implementation should take that > into consideration. > > > > On the other hand, it is the opposite of the security goal of SEV. With > SEV, the default state is "all DMA forbidden" (=3D all guest memory > encrypted). This is also the status that we'd like to achieve when > ExitBootServices() completes. When control is transferred to the OS, all > guest memory should be encrypted again, even those areas that were once > used as CommonBuffers. > [Jiewen] I am not sure who will control "When control is transferred to t= he OS, all > guest memory should be encrypted again, even those areas that were once > used as CommonBuffers." > For SEV case, I think it should be controlled by OS, because it is just C= R3. If it was indeed just CR3, then I would fully agree with you. But -- to my understanding --, ensuring that the memory is actually encrypted requires that the guest *write* to the memory through a virtual address whose PTE has the C-bit set. And I don't think the guest OS can be expected to rewrite all of its memory at launch. :( [Jiewen] That is fine. I suggest we get clear on the threat model as the first step. The threat model for SEV usage in OVMF. I am sorry if that is already discussed before. I might ignore the conversa= tion. If you or any SEV owner can share the insight, that will be great. See my question above "If that is the threat model, I have a question on th= e exposure:..." > If so, we should not meet any problem, because we already disable BME > in device driver's ExitBootService event. I agree, but installing new page tables is not enough. > So apparently the default behaviors of the VT-d IOMMU and the SEV IOMMU > are opposites -- VT-d permits all DMA unless configured otherwise, while > SEV forbids all DMA unless configured otherwise. > [Jiewen] I do not think so. The default behaviors of VT-d IOMMU is to dis= able all DMA access. > I setup translation table, but mark all entry to be NOT-PRESENT. > I grant DMA access only if there is a specific request by a device. > > I open DMA access in ExitBootServices, just want to make sure it is compa= tible to > the existing OS, which might not have knowledge on IOMMU. > I will provide a patch to make it a policy very soon. As such VTd IOMMU m= ay > turn off IOMMU, or keep it enabled at ExitBootService. > An IOMMU aware OS may take over IOMMU directly and reprogram it. Thanks for the clarification. But, again, will this not lead to the possibility that the VT-d IOMMU's ExitBootServices() handler disables all DMA *before* the PCI drivers get a chance to run their own ExitBootServices() handlers, disabling the individual devices? [Jiewen] Sorry for the confusing. Let me explain: No, VTd never disables *all* DMA buffer at ExitBootService. "disable VTd" means to "disable IOMMU protection" and "allow all DMA". "Keep VTd enabled" means to "keep IOMMU protection enabled" and "only allow the DMA from Map() request". If that happens, then a series of IOMMU faults could be generated, which I described above. (I.e., such IOMMU faults could result, at least in the case of SEV, in garbage being written to disk, via queued commands.) [Jiewen] You are right. That would not happen. IOMMU driver should not bring any compatibility problem for the PCI driver, iff the PCI driver follows the UEFI specification. >> 4) About the driver you modified >> I believe we have some other PCI device driver (NVMe/SDMMC/UFS), besides= the ones you modified in this patch (ATA/UCHI/EHCI/XHCI). >> If we need take this approach, I assume they also need update, right? > > Yes, they should be updated similarly. > > SDMMC and UFS are not included in OVMF, so I couldn't test their behavior= . > > NVMe is included in OVMF, but it is used very rarely in practice (we can > call it a corner case), so I thought I would first submit patches for > the four main drivers (UHCI, EHCI, XHCI, AHCI) that are frequently used > with OVMF and QEMU. Tracking down the CommonBuffer lifecycles was > difficult :) > [Jiewen] Understood. That is OK. We can handle that in separate patch. Thanks! Now, to finish up, here's an idea I just had. Are we allowed to call gBS->SignalEvent() in an ExitBootServices() notification function? If we are, then we could do the following: * PciIo.Unmap() and friends would work as always (no restrictions on dynamic memory allocation or freeing, for any kind of DMA operation). * PCI drivers would not be expected to call PciIo.Unmap() in their ExitBootServices() handlers. * The IOMMU driver would have an ExitBootServices() notification function, to be enqueued at the TPL_CALLBACK or TPL_NOTIFY level (doesn't matter which). * In this notification function, the IOMMU driver would signal *another* event (a private one). The notification function for this event would be enqueued strictly at the TPL_CALLBACK level. * The notification function for the second event (private to the IOMMU) would iterate over all existent mappings, and unmap them, without allocating or freeing memory. The key point is that by signaling the second event, the IOMMU driver could order its own cleanup code after all other ExitBootServices() callbacks. (Assuming, at least, that no other ExitBootServices() callback employs the same trick to defer itself!) Therefore by the time the second callback runs, all PCI devices have been halted, and it is safe to tear down the translations. The ordering would be ensured by the following: - The UEFI spec says under CreateEventEx(), "All events are guaranteed to be signaled before the first notification action is taken." This means that, by the time the IOMMU's ExitBootServices() handler is entered, all other ExitBootServices() handlers have been *queued* at least, at TPL_CALLBACK or TPL_NOTIFY. - Therefore, when we signal our second callback from there, for TPL_CALLBACK, the second callback will be queued at the end of the TPL_CALLBACK queue. - CreateEventEx() also says that EFI_EVENT_GROUP_EXIT_BOOT_SERVICES "is functionally equivalent to the EVT_SIGNAL_EXIT_BOOT_SERVICES flag for the Type argument of CreateEvent." So it wouldn't matter if a driver used CreateEvent() or CreateEventEx() for setting up its own handler, the handler would be queued just the same. I think this is ugly and brittle, but perhaps the only way to clean up *all* translations safely, with regard to PciIo.Unmap() + ExitBootServices(), without updating the UEFI spec. What do you think? [Jiewen] If the order is correct, and all PCI device driver is halt at Exit= BootServices, that works. We do not need update PCI driver and we do not need update UEFI spec. We only need update IOMMU driver which is concerned, based upon the threat = model. That probably is best solution. :-) I assume you want to handle both common buffer and read/write buffer, right= ? And if possible, I still have interest to get clear on the threat model for= SEV in OVMF. If you or any SEV owner can share ... Thanks, Laszlo