Re: [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices()

From: Laszlo Ersek <lersek@redhat.com>
To: "Yao, Jiewen" <jiewen.yao@intel.com>,
	Brijesh Singh <brijesh.singh@amd.com>,
	"Zeng, Star" <star.zeng@intel.com>,
	edk2-devel-01 <edk2-devel@lists.01.org>
Cc: "Dong, Eric" <eric.dong@intel.com>
Subject: Re: [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices()
Date: Thu, 7 Sep 2017 18:40:57 +0200	[thread overview]
Message-ID: <5300a156-2244-ae24-7a66-0a90a3a33148@redhat.com> (raw)
In-Reply-To: <74D8A39837DF1E4DA445A8C0B3885C503A9A92BA@shsmsx102.ccr.corp.intel.com>

On 09/07/17 16:48, Yao, Jiewen wrote:
> Great. Thanks to confirm that. Now it is clear to me.
> 
> Then let's wait Laszlo's new patch to make all DMA buffer to private. :)

I've got the patches ready and smoke-tested. Let me look a bit more at
the logs, and re-review the patches. I hope I can post the series still
today (well, in my timezone anyway :) )

Thanks!
Laszlo

> From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of Brijesh Singh
> Sent: Thursday, September 7, 2017 10:40 PM
> To: Laszlo Ersek <lersek@redhat.com>; Yao, Jiewen <jiewen.yao@intel.com>; Zeng, Star <star.zeng@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org>
> Cc: brijesh.singh@amd.com; Dong, Eric <eric.dong@intel.com>
> Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices()
> 
> Hi Jiewen,
> 
> On 09/07/2017 06:46 AM, Laszlo Ersek wrote:
>> On 09/07/17 06:46, Yao, Jiewen wrote:
>>> Thanks for the sharing, Brijesh.
>>>
>>> "If a page was marked as "shared"
>>> then its guest responsibility to make it "private" after its done communicating with
>>> hypervisor."
>>>
>>> I believe I have same understanding - a *guest* should guarantee that.
>>>
>>> My question is: During the *transition* from firmware to OS, *which guest* should make the shared buffer to be private? Is it "guest firmware" or "guest OS"?
>>>
>>> Maybe I can ask the specific question to get it more clear.
>>>
>>> 1)       If the common DMA buffer is not unmapped at ExitBootService, is that treated as an issue?
>>>
>>> 2)       If the read/write DMA buffer is not unmapped at ExitBootService, is that treated as an issue?
>>
>> Very good questions, totally to the point.
>>
>> On the authoritative answer, I defer to Brijesh.
>>
> 
> 
> Both the above cases (#1 and #2) are problems. Since buffers was owned by firmware
> and firmware made it "shared" hence firmware is responsible to mark them as private
> after its done with the buffer. In other words, we must call Unmap() from ExitBootServices
> to ensure that buffers mapped using BusMasterCommonBuffer/BusMasterRead/BusMasterWrite
> is marked as "private" before we make it available to the guest OS. (we do similar thing
> in Linux OS).
> 
> Having any kind of side channel to communicate the encryption status of a page
> will not work -- we should be able to support a usecase where we boot OVMF in
> 64-bit but launch 32-bit Linux guest. When Linux boots in 32-bit mode it does not have
> access to encryption bit (C-bit is bit-47 in page table) and can't mark the page as
> private (even if we provide some kind of side-channel).
> 
> thank you very much for all your help.
> 
>> (
>>
>> My personal opinion is that both situations (#1 and #2) are problems,
>> because they break the *practical* security invariant for SEV guests:
>>
>> - most memory should be encrypted at all times, *and*
>>
>> - any memory that is decrypted must have an owner that is responsible
>>    for re-encrypting the memory eventually.
>>
>> Therefore, *either* the firmware has to re-encrypt all remaining DMA
>> buffers at ExitBootServices(), *or* a new information channel must be
>> designed, from firmware to OS, to carry over the decryption status.
>>
>> I strongly prefer the first option, for the following reason: the same
>> questions apply to all EDK2 IOMMU protocol interfaces, not just the one
>> exported by the SEV driver.
>>
>> )
>>
>> Thanks,
>> Laszlo
>>
>>> From: Brijesh Singh [mailto:brijesh.singh@amd.com]
>>> Sent: Wednesday, September 6, 2017 11:40 PM
>>> To: Laszlo Ersek <lersek@redhat.com<mailto:lersek@redhat.com>>; Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>>; Zeng, Star <star.zeng@intel.com<mailto:star.zeng@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>>
>>> Cc: brijesh.singh@amd.com<mailto:brijesh.singh@amd.com>; Dong, Eric <eric.dong@intel.com<mailto:eric.dong@intel.com>>
>>> Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices()
>>>
>>>
>>>
>>> On 09/06/2017 07:14 AM, Laszlo Ersek wrote:
>>>> On 09/06/17 06:37, Yao, Jiewen wrote:
>>>>> Thanks for the clarification. Comment in line.
>>>>>
>>>>> From: Laszlo Ersek [mailto:lersek@redhat.com]
>>>>> Sent: Wednesday, September 6, 2017 1:57 AM
>>>>> To: Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com<mailto:jiewen.yao@intel.com%3cmailto:jiewen.yao@intel.com>>>; Zeng, Star <star.zeng@intel.com<mailto:star.zeng@intel.com<mailto:star.zeng@intel.com%3cmailto:star.zeng@intel.com>>>; edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org%3cmailto:edk2-devel@lists.01.org>>>
>>>>> Cc: Dong, Eric <eric.dong@intel.com<mailto:eric.dong@intel.com<mailto:eric.dong@intel.com%3cmailto:eric.dong@intel.com>>>; Brijesh Singh <brijesh.singh@amd.com<mailto:brijesh.singh@amd.com<mailto:brijesh.singh@amd.com%3cmailto:brijesh.singh@amd.com>>>
>>>>> Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices()
>>>>
>>>>>> Then after ExitBootService, the OS will take control of CR3 and set correct
>>>>>> encryption bit.
>>>>>
>>>>> This is true, the guest OS will set up new page tables, and in those
>>>>> PTEs, the C-bit ("encrypt") will likely be set by default.
>>>>>
>>>>> However, the guest OS will not *rewrite* all of the memory, with the
>>>>> C-bit set. This means that any memory that the firmware didn't actually
>>>>> re-encrypt (in the IOMMU driver) will still expose stale data to the
>>>>> hypervisor.
>>>>> [Jiewen] That is an interesting question.
>>>>> Is there any security concern associated?
>>>>
>>>> I can't tell for sure. Answering this question is up to the proponents
>>>> of SEV, who have designed the threat model for SEV.
>>>>
>>>> My operating assumption is that we should keep memory as tightly
>>>> encrypted as possible at the firmware --> OS control transfer. It could
>>>> be an exaggerated expectation from my side; I'd just better be safe than
>>>> sorry :)
>>>>
>>>>
>>>
>>> Let me give some brief intro on SEV (Secure Encrypted Virtualization) and then
>>> we can discuss a security model (if needed).
>>>
>>> SEV is an extension to the AMD-V architecture which supports running encrypted
>>> virtual machines (VMs) under the control of a hypervisor. Encrypted VMs have their
>>> pages (code and data) secured such that only the guest itself has access to the
>>> unencrypted version. Each encrypted VMs is associated with a unique encryption
>>> key; if its data is accessed by a different entity using a different key the
>>> encrypted guest data will be incorrectly decrypted, leading to unintelligible data.
>>> You can also find some detail for SEV in white paper [1].
>>>
>>> SEV encrypted Vs have the concept of private and shared memory. The private memory
>>> is encrypted with the guest-specific key, while shared memory may be encrypted
>>> with hypervisor key. SEV guest VMs can choose which pages they would like to
>>> be private. But the instruction pages and guest page tables are always treated
>>> as private by the hardware. The DMA operation inside the guest must be performed
>>> on shared pages -- this is mainly because in virtualization world the device
>>> DMA needs some assistance from the hypervisor.
>>>
>>> The security model provided by the SEV ensure that hypervisor will no longer able
>>> to inspect or alter any guest code or data. The guest OS controls what it want to
>>> share with outside world (in this case with Hypervisor).
>>>
>>> In software implementation we took approach to encrypt everything possible starting
>>> early in boot. In this approach whenever OVMF wants to perform certain DMA operations
>>> it allocate a shared page, issues the command, free the shared page after the command
>>> is completed (in other word we use sw bounce buffer to complete the DMA operation).
>>>
>>> We have implemented IOMMU driver to provide the following functions:
>>>
>>> AllocateBuffer():
>>> --------------------
>>> it allocate a private pages, as per UEFI spec the driver will map the buffer allocated
>>> from this routine using BusMasterCommonBuffer hence we allocate extra stash pages
>>> in addition to requested pages.
>>>
>>>
>>> Map
>>> ----
>>> If requested operation is BusMasterRead or BusMasterWrite then we allocate a shared buffer
>>> and DeviceAddress point to shared buffer.
>>>
>>> If requested operation is BusMasterCommonBuffer then we perform in-place decryption of the
>>> contents and update the page-table to clear the C-bit (basically make it shared page)
>>>
>>> Unmap
>>> ------
>>> If operation was BusMasterRead or BusMasterWrite then we complete the unmapping and free
>>> the shared buffer allocated in Map().
>>>
>>> If operation was BusMasterCommonBuffer then we perform in-place encryption and set the C-bit
>>> (basically make the page private)
>>>
>>> FreeBuffer:
>>> -----------
>>> Free the pages
>>>
>>>
>>> Lets run with the below scenario:
>>>
>>> 1) guest marks a page as "shared" and gives its physical address to HV (e.g DMA read)
>>> 2) hypervisor completes the request operation
>>> 3) hypervisor caches the shared physical address and monitor it for information leak
>>> 4) sometime later if guest write data in its "shared" physical address then hypervisor can
>>>      easily read it without guest knowledge.
>>>
>>> SEV hardware can protect us against the attack where someone tries to inject or alter the
>>> guest code. As I noted above any instruction fetch is forced as private hence if attacker
>>> write some code into a shared buffer and point the RIP to his/her code then instruction
>>> fetch will try to decrypt it and get unintelligible opcode. If a page was marked as "shared"
>>> then its guest responsibility to make it "private" after its done communicating with
>>> hypervisor.
>>>
>>> [1] http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf
>>>
>>>
>>>>> If the C-bit is cleared for a read/write buffer, is the content in the
>>>>> read/write buffer also exposed to hypervisor?
>>>>
>>>> Not immediately. Only when the guest also rewrites the memory through
>>>> the appropriate virtual address.
>>>>
>>>> Basically, in the virtual machine, the C-bit in a PTE that covers a
>>>> particular guest virtual address (GVA) controls whether a guest write
>>>> through that GVA will result in memory encryption, and if a gues read
>>>> through that GVA will result in memory decryption.
>>>>
>>>> The current state of the C-bit doesn't matter for the hypervisor, what
>>>> matters is the last guest write through a GVA whose PTE had the C-bit
>>>> set, or clear. If the C-bit was clear when the guest last wrote the
>>>> area, then the hypervisor can read the data. If the C-bit was set, then
>>>> the hypervisor can only read garbage.
>>>>
>>>>
>>>>> I means if there is security concern, the concern should be applied to
>>>>> both common buffer and read/write buffer.
>>>>> Then we have to figure a way to resolve both buffer.
>>>>
>>>> Yes, this is correct. The PciIo.Unmap operation, as currently
>>>> implemented in OvmfPkg/IoMmuDxe/, handles the encryption/decryption
>>>> correctly for all operations, but it can only guarantee *not* freeing
>>>> memory for CommonBuffer. So Unmap()ing CommonBuffer at ExitBootServices
>>>> is safe, while Unmap()ing Read/Write is not. The encryption would be
>>>> re-established just fine for Read/Write as well, but we would change the
>>>> UEFI memmap.
>>>>
>>>> In OVMF, we currently manage this problem by making all asynchronous DMA
>>>> mappings CommonBuffer, even if they could othewise be Read or Write. We
>>>> use Read and Write only if the DMA operation completes before the
>>>> higher-level protocol function returns (so we can immediately Unmap the
>>>> operation, and the ExitBootServices() handler doesn't have to care).
>>>>
>>>>
>>>>> To be honest, that is exactly my biggest question on this patch:
>>>>> why do we only handle the common buffer specially?
>>>>
>>>> For the following reason:
>>>>
>>>> - Given that CommonBuffer mappings take a separate AllocateBuffer() /
>>>> FreeBuffer() call-pair, around Map/Unmap, we can *reasonably ask* PciIo
>>>> implementors -- beyond what the UEFI spec requries -- to keep *all*
>>>> dynamic memory management out of Map/Unmap. If they need dynamic memory
>>>> management, we ask them to do it in AllocateBuffer/FreeBuffer instead.
>>>> This way Unmap is safe in ExitBootServices handlers.
>>>>
>>>> - We cannot *reasonably ask* PciIo implementors to provide the same
>>>> guarantee for Read and Write mappings, simply because there are no other
>>>> APIs that could manage the dynamic memory for Map and Unmap --
>>>> AllocateBuffer and FreeBuffer are not required for Read and Write
>>>> mappings. PciIo implementors couldn't agree to our request even if they
>>>> wanted to. Therefore Unmapping Read/Write operations is inherently
>>>> unsafe in EBS handlers.
>>>>
>>>>
>>>>>> NOTE: The device should still be halt state, because the device is
>>>>>> also controlled by OS.
>>>>>
>>>>> This is the key point -- the device is *not* controlled by the guest OS,
>>>>> in the worst case.
>>>>>
>>>>> The virtual device is ultimately implemented by the hypervisor. We don't
>>>>> expect the virtual device (= the hypervisor) to mis-behave on purpose.
>>>>> However, if the hypervisor is compromised by an external attacker (for
>>>>> example over the network, or via privilege escalation from another
>>>>> guest), then all guest RAM that has not been encrypted up to that point
>>>>> becomes visible to the attacker. In other words, the hypervisor ("the
>>>>> device") becomes retro-actively malicious.
>>>>> [Jiewen] If that is the threat model, I have a question on the exposure:
>>>>> 1) If the concern is for existing data, it means all DMA data already written.
>>>>> However, the DMA data is already consumed or produced by virtual device
>>>>> or a hypervisor. If the hypervisor is attacked, it already gets all the data content.
>>>>
>>>> Maybe, maybe not. The data may have been sent to a different host over
>>>> the network, and wiped from memory.
>>>>
>>>> Or, the data may have been written to a disk image that is separately
>>>> encrypted by the host OS, and has been detached (unplugged) from the
>>>> guest, and also unmounted from the host, with the disk key securely
>>>> wiped from host memory.
>>>>
>>>> We can also imagine the reverse direction. Assume that the user of the
>>>> virtual machine goes into the UEFI shell in OVMF, starts the EDIT
>>>> program, and types some secret information into a text file on the ESP.
>>>> Further assume that the disk image is encrypted on the host OS. It is
>>>> conceivable that fragments of the secret could remain stuck in the AHCI
>>>> (disk) and USB (keyboard) DMA buffers.
>>>>
>>>> Maybe you think that these are "made up" examples, and I agree, I just
>>>> made them up. The point is, it is not my place to discount possible
>>>> attack vectors (I haven't designed SEV). Attackers will be limited by
>>>> their *own* imaginations only, not by mine :)
>>>>
>>>>
>>>>> 2) if the concern is for future data, it means all user data to be written.
>>>>> However, the C-bit already prevent that.
>>>>
>>>> As far as I understand SEV, future data is out of scope. The goal is to
>>>> prevent *retroactive* guest data leaks (= whatever is currently in host
>>>> OS memory) if an attacker compromises an otherwise non-malicious hypervisor.
>>>>
>>>> If an attacker compromises a virtualization host, then they can just sit
>>>> silent and watch data flow into and out of guests from that point
>>>> onward, because they can then monitor all DMA (which always happens in
>>>> clear-text) real-time.
>>>>
>>>>
>>>>> Maybe I do not fully understand the threat model defined.
>>>>> If you can explain more, that would be great.
>>>>
>>>> Well I tried to explain *my* understanding of SEV :) I hope Brijesh will
>>>> correct me if I'm wrong.
>>>>
>>>>
>>>>> The point of SEV is to keep as much guest data encrypted at all times as
>>>>> possible, so that *whenever* the hypervisor is compromised by an
>>>>> attacker, the guest memory -- which the attacker can inevitably dump
>>>>> from the host side -- remains un-intellegible to the attacker.
>>>>> [Jiewen] OK. If this is a security question, I suggest we define a clear
>>>>> threat model at first.
>>>>> Or what we are doing might be unnecessary or insufficient.
>>>>
>>>> I agree.
>>>>
>>>> As I said above, my operating principle has been to re-encrypt all DMA
>>>> buffers as soon as possible. For long-standing DMA buffers, re-encrypt
>>>> them at ExitBootServices at the latest.
>>>>
>>>>
>>>>> [Jiewen] For "require that Unmap() work for CommonBuffer
>>>>> operations without releasing memory", I would hold my opinion until
>>>>> it is documented clearly in UEFI spec.
>>>>
>>>> You are right, of course.
>>>>
>>>> It's just that we cannot avoid exploring ways, for this feature, that
>>>> currently fall outside of the spec. Whatever we do here will be outside
>>>> of the spec, one way or another. I suggested what I thought could be a
>>>> reasonable extension to the spec, one that could be implemented by PciIo
>>>> implementors even before the spec is modified.
>>>>
>>>> edk2's PciIo.Unmap already behaves like this, without breaking the spec.
>>>>
>>>> If there's a better way -- for example modifying CoreExitBootServices()
>>>> in edk2, to signal IOMMU drivers separately, *after* notifying other
>>>> ExitBootServices() handlers --, that might work as well.
>>>>
>>>>
>>>>> My current concern is:
>>>>> First, this sentence is NOT defined in UEFI specification.
>>>>
>>>> Correct.
>>>>
>>>>
>>>>> Second, it might break an existing implementation or existing feature, such as tracing.
>>>>
>>>> Correct.
>>>>
>>>>> Till now, I did not see any restriction in UEFI spec say: In this function, you are not allowed
>>>>> to call memory services.
>>>>> The only restriction is
>>>>> 1) TPL restriction, where memory service must be <= TPL_NOTIFY.
>>>>> 2) AP restriction, where no UEFI service/protocol is allowed for AP.
>>>>
>>>> I agree.
>>>>
>>>>
>>>>> I'm just trying to operate with the APIs currently defined by the UEFI
>>>>> spec, and these assumptions were the best I could come up with.
>>>>> [Jiewen] The PCI device driver must follow and *only* follow UEFI spec.
>>>>> Especially the IHV Option ROM should not consume any private API.
>>>>
>>>> I disagree about "only follow". If there are additional requirements
>>>> that can be satisfied without breaking the spec, driver implementors can
>>>> choose to conform to both sets of requirements.
>>>>
>>>> For example (if I understand correctly), Microsoft / Windows present a
>>>> bunch of requirements *beyond* the UEFI spec, for both platform and
>>>> add-on card firmware. Most vendors conform :)
>>>>
>>>>
>>>>>> [Jiewen] I am not sure who will control "When control is transferred to the OS, all
>>>>>> guest memory should be encrypted again, even those areas that were once
>>>>>> used as CommonBuffers."
>>>>>> For SEV case, I think it should be controlled by OS, because it is just CR3.
>>>>>
>>>>> If it was indeed just CR3, then I would fully agree with you.
>>>>>
>>>>> But -- to my understanding --, ensuring that the memory is actually
>>>>> encrypted requires that the guest *write* to the memory through a
>>>>> virtual address whose PTE has the C-bit set.
>>>>>
>>>>> And I don't think the guest OS can be expected to rewrite all of its
>>>>> memory at launch. :(
>>>>>
>>>>> [Jiewen] That is fine.
>>>>> I suggest we get clear on the threat model as the first step.
>>>>> The threat model for SEV usage in OVMF.
>>>>>
>>>>> I am sorry if that is already discussed before. I might ignore the conversation.
>>>>
>>>> No problem; it's always good to re-investigate our assumptions and
>>>> operating principles.
>>>>
>>>>
>>>>> If you or any SEV owner can share the insight, that will be great.
>>>>> See my question above "If that is the threat model, I have a question on the exposure:..."
>>>>
>>>> I shared *my* impressions of the threat model (which are somewhat
>>>> unclear, admittedly, and thus could make me overly cautious).
>>>>
>>>> I hope Brijesh can extend and/or correct my description.
>>>>
>>>>
>>>>>> So apparently the default behaviors of the VT-d IOMMU and the SEV IOMMU
>>>>>> are opposites -- VT-d permits all DMA unless configured otherwise, while
>>>>>> SEV forbids all DMA unless configured otherwise.
>>>>>> [Jiewen] I do not think so. The default behaviors of VT-d IOMMU is to disable all DMA access.
>>>>>> I setup translation table, but mark all entry to be NOT-PRESENT.
>>>>>> I grant DMA access only if there is a specific request by a device.
>>>>>>
>>>>>> I open DMA access in ExitBootServices, just want to make sure it is compatible to
>>>>>> the existing OS, which might not have knowledge on IOMMU.
>>>>>> I will provide a patch to make it a policy very soon. As such VTd IOMMU may
>>>>>> turn off IOMMU, or keep it enabled at ExitBootService.
>>>>>> An IOMMU aware OS may take over IOMMU directly and reprogram it.
>>>>>
>>>>> Thanks for the clarification.
>>>>>
>>>>> But, again, will this not lead to the possibility that the VT-d IOMMU's
>>>>> ExitBootServices() handler disables all DMA *before* the PCI drivers get
>>>>> a chance to run their own ExitBootServices() handlers, disabling the
>>>>> individual devices?
>>>>> [Jiewen] Sorry for the confusing. Let me explain:
>>>>> No, VTd never disables *all* DMA buffer at ExitBootService.
>>>>>
>>>>> "disable VTd" means to "disable IOMMU protection" and "allow all DMA".
>>>>> "Keep VTd enabled" means to "keep IOMMU protection enabled" and
>>>>> "only allow the DMA from Map() request".
>>>>
>>>> Okay, but this interpretation was exactly what I thought of first (see
>>>> above): "VT-d permits all DMA unless configured otherwise". You answered
>>>> that it wasn't the case.
>>>>
>>>> So basically, if I understand it correctly now, at ExitBootServices the
>>>> VT-d IOMMU driver opens up all RAM for DMA access. Is that correct?
>>>>
>>>> That is equivalent to decrypting all memory under SEV, and is the exact
>>>> opposite of what we want. (As I understand it.)
>>>>
>>>>
>>>>> If that happens, then a series of IOMMU faults could be generated, which
>>>>> I described above. (I.e., such IOMMU faults could result, at least in
>>>>> the case of SEV, in garbage being written to disk, via queued commands.)
>>>>> [Jiewen] You are right. That would not happen.
>>>>> IOMMU driver should not bring any compatibility problem for the PCI driver,
>>>>> iff the PCI driver follows the UEFI specification.
>>>>
>>>> Agreed.
>>>>
>>>>
>>>>> Now, to finish up, here's an idea I just had.
>>>>>
>>>>> Are we allowed to call gBS->SignalEvent() in an ExitBootServices()
>>>>> notification function?
>>>>>
>>>>> If we are, then we could do the following:
>>>>>
>>>>> * PciIo.Unmap() and friends would work as always (no restrictions on
>>>>>     dynamic memory allocation or freeing, for any kind of DMA operation).
>>>>>
>>>>> * PCI drivers would not be expected to call PciIo.Unmap() in their
>>>>>     ExitBootServices() handlers.
>>>>>
>>>>> * The IOMMU driver would have an ExitBootServices() notification
>>>>>     function, to be enqueued at the TPL_CALLBACK or TPL_NOTIFY level
>>>>>     (doesn't matter which).
>>>>>
>>>>> * In this notification function, the IOMMU driver would signal *another*
>>>>>     event (a private one). The notification function for this event would
>>>>>     be enqueued strictly at the TPL_CALLBACK level.
>>>>>
>>>>> * The notification function for the second event (private to the IOMMU)
>>>>>     would iterate over all existent mappings, and unmap them, without
>>>>>     allocating or freeing memory.
>>>>>
>>>>> The key point is that by signaling the second event, the IOMMU driver
>>>>> could order its own cleanup code after all other ExitBootServices()
>>>>> callbacks. (Assuming, at least, that no other ExitBootServices()
>>>>> callback employs the same trick to defer itself!) Therefore by the time
>>>>> the second callback runs, all PCI devices have been halted, and it is
>>>>> safe to tear down the translations.
>>>>>
>>>>> The ordering would be ensured by the following:
>>>>>
>>>>> - The UEFI spec says under CreateEventEx(), "All events are guaranteed
>>>>>     to be signaled before the first notification action is taken." This
>>>>>     means that, by the time the IOMMU's ExitBootServices() handler is
>>>>>     entered, all other ExitBootServices() handlers have been *queued* at
>>>>>     least, at TPL_CALLBACK or TPL_NOTIFY.
>>>>>
>>>>> - Therefore, when we signal our second callback from there, for
>>>>>     TPL_CALLBACK, the second callback will be queued at the end of the
>>>>>     TPL_CALLBACK queue.
>>>>>
>>>>> - CreateEventEx() also says that EFI_EVENT_GROUP_EXIT_BOOT_SERVICES "is
>>>>>     functionally equivalent to the EVT_SIGNAL_EXIT_BOOT_SERVICES flag
>>>>>     for the Type argument of CreateEvent." So it wouldn't matter if a
>>>>>     driver used CreateEvent() or CreateEventEx() for setting up its own
>>>>>     handler, the handler would be queued just the same.
>>>>>
>>>>> I think this is ugly and brittle, but perhaps the only way to clean up
>>>>> *all* translations safely, with regard to PciIo.Unmap() +
>>>>> ExitBootServices(), without updating the UEFI spec.
>>>>>
>>>>> What do you think?
>>>>>
>>>>> [Jiewen] If the order is correct, and all PCI device driver is halt at ExitBootServices, that works.
>>>>> We do not need update PCI driver and we do not need update UEFI spec.
>>>>> We only need update IOMMU driver which is concerned, based upon the threat model.
>>>>> That probably is best solution. :-)
>>>>
>>>> I'm very glad to hear that you like the idea.
>>>>
>>>> However, it depends on whether we are permitted, by the UEFI spec, to
>>>> signal another event in an ExitBootServices() notification function.
>>>>
>>>> Are we permitted to do that?
>>>>
>>>> Does the UEFI spec guarantee that the notification function for the
>>>> *second* event will be queued just like it would be under "normal"
>>>> circumstances?
>>>>
>>>> (I know we must not allocate or free memory in the notification function
>>>> of the *second* event either; I just want to know if the second event's
>>>> handler is *queued* like it would normally be.)
>>>>
>>>>
>>>>> I assume you want to handle both common buffer and read/write buffer, right?
>>>>
>>>> Yes. Under this idea, all kinds of operations would be cleaned up.
>>>>
>>>>
>>>>> And if possible, I still have interest to get clear on the threat model for SEV in OVMF.
>>>>> If you or any SEV owner can share ...
>>>>
>>>> Absolutely. Please see above.
>>>>
>>>> Thank you!
>>>> Laszlo
>>>>
>>> _______________________________________________
>>> edk2-devel mailing list
>>> edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>
>>> https://lists.01.org/mailman/listinfo/edk2-devel
>>>
>>
> _______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>
> https://lists.01.org/mailman/listinfo/edk2-devel
>