Re: [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices()

From: "Yao, Jiewen" <jiewen.yao@intel.com>
To: Laszlo Ersek <lersek@redhat.com>,
	"Zeng, Star" <star.zeng@intel.com>,
	edk2-devel-01 <edk2-devel@lists.01.org>
Cc: Brijesh Singh <brijesh.singh@amd.com>,
	"Dong, Eric" <eric.dong@intel.com>
Subject: Re: [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices()
Date: Thu, 7 Sep 2017 04:34:46 +0000	[thread overview]
Message-ID: <74D8A39837DF1E4DA445A8C0B3885C503A9A8EE2@shsmsx102.ccr.corp.intel.com> (raw)
In-Reply-To: <24d66e81-336e-3924-8045-4749d98e2fbb@redhat.com>

Thanks.
It seems the discussion becomes too long. I try to make it short.

1)       As long as we use same mechanism to handle both common buffer and read/write buffer. I have no concern at all. :)

2)       I am sorry that I do not have an immediate answer for your question on event handling in ExitBootServices.
Although it seems tricky, I believe it works. Can you have a try?

3)       Looking at the code (DxeMain.c), I have another idea for your consideration only.

  //
  // Notify other drivers that we are exiting boot services.
  //
  CoreNotifySignalList (&gEfiEventExitBootServicesGuid);

  //
  // Report that ExitBootServices() has been called
  //
  REPORT_STATUS_CODE (
    EFI_PROGRESS_CODE,
    (EFI_SOFTWARE_EFI_BOOT_SERVICE | EFI_SW_BS_PC_EXIT_BOOT_SERVICES)
    );

The cleanup driver may register a report status code handle to do the cleanup. :)

We already have such example in MdeModulePkg\Universal\Acpi\FirmwarePerformanceDataTableDxe\ FirmwarePerformanceDxe.c.
FpdtStatusCodeListenerDxe()

  } else if (Value == (EFI_SOFTWARE_EFI_BOOT_SERVICE | EFI_SW_BS_PC_EXIT_BOOT_SERVICES)) {
......

The requirement will become: If a platform wants to add cleanup, it must use a real report status code library.

Thank you
Yao Jiewen

From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of Laszlo Ersek
Sent: Wednesday, September 6, 2017 8:15 PM
To: Yao, Jiewen <jiewen.yao@intel.com>; Zeng, Star <star.zeng@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org>
Cc: Brijesh Singh <brijesh.singh@amd.com>; Dong, Eric <eric.dong@intel.com>
Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices()

On 09/06/17 06:37, Yao, Jiewen wrote:
> Thanks for the clarification. Comment in line.
>
> From: Laszlo Ersek [mailto:lersek@redhat.com]
> Sent: Wednesday, September 6, 2017 1:57 AM
> To: Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>>; Zeng, Star <star.zeng@intel.com<mailto:star.zeng@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>>
> Cc: Dong, Eric <eric.dong@intel.com<mailto:eric.dong@intel.com>>; Brijesh Singh <brijesh.singh@amd.com<mailto:brijesh.singh@amd.com>>
> Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices()

>> Then after ExitBootService, the OS will take control of CR3 and set correct
>> encryption bit.
>
> This is true, the guest OS will set up new page tables, and in those
> PTEs, the C-bit ("encrypt") will likely be set by default.
>
> However, the guest OS will not *rewrite* all of the memory, with the
> C-bit set. This means that any memory that the firmware didn't actually
> re-encrypt (in the IOMMU driver) will still expose stale data to the
> hypervisor.
> [Jiewen] That is an interesting question.
> Is there any security concern associated?

I can't tell for sure. Answering this question is up to the proponents
of SEV, who have designed the threat model for SEV.

My operating assumption is that we should keep memory as tightly
encrypted as possible at the firmware --> OS control transfer. It could
be an exaggerated expectation from my side; I'd just better be safe than
sorry :)

> If the C-bit is cleared for a read/write buffer, is the content in the
> read/write buffer also exposed to hypervisor?

Not immediately. Only when the guest also rewrites the memory through
the appropriate virtual address.

Basically, in the virtual machine, the C-bit in a PTE that covers a
particular guest virtual address (GVA) controls whether a guest write
through that GVA will result in memory encryption, and if a gues read
through that GVA will result in memory decryption.

The current state of the C-bit doesn't matter for the hypervisor, what
matters is the last guest write through a GVA whose PTE had the C-bit
set, or clear. If the C-bit was clear when the guest last wrote the
area, then the hypervisor can read the data. If the C-bit was set, then
the hypervisor can only read garbage.

> I means if there is security concern, the concern should be applied to
> both common buffer and read/write buffer.
> Then we have to figure a way to resolve both buffer.

Yes, this is correct. The PciIo.Unmap operation, as currently
implemented in OvmfPkg/IoMmuDxe/, handles the encryption/decryption
correctly for all operations, but it can only guarantee *not* freeing
memory for CommonBuffer. So Unmap()ing CommonBuffer at ExitBootServices
is safe, while Unmap()ing Read/Write is not. The encryption would be
re-established just fine for Read/Write as well, but we would change the
UEFI memmap.

In OVMF, we currently manage this problem by making all asynchronous DMA
mappings CommonBuffer, even if they could othewise be Read or Write. We
use Read and Write only if the DMA operation completes before the
higher-level protocol function returns (so we can immediately Unmap the
operation, and the ExitBootServices() handler doesn't have to care).

> To be honest, that is exactly my biggest question on this patch:
> why do we only handle the common buffer specially?

For the following reason:

- Given that CommonBuffer mappings take a separate AllocateBuffer() /
FreeBuffer() call-pair, around Map/Unmap, we can *reasonably ask* PciIo
implementors -- beyond what the UEFI spec requries -- to keep *all*
dynamic memory management out of Map/Unmap. If they need dynamic memory
management, we ask them to do it in AllocateBuffer/FreeBuffer instead.
This way Unmap is safe in ExitBootServices handlers.

- We cannot *reasonably ask* PciIo implementors to provide the same
guarantee for Read and Write mappings, simply because there are no other
APIs that could manage the dynamic memory for Map and Unmap --
AllocateBuffer and FreeBuffer are not required for Read and Write
mappings. PciIo implementors couldn't agree to our request even if they
wanted to. Therefore Unmapping Read/Write operations is inherently
unsafe in EBS handlers.

>> NOTE: The device should still be halt state, because the device is
>> also controlled by OS.
>
> This is the key point -- the device is *not* controlled by the guest OS,
> in the worst case.
>
> The virtual device is ultimately implemented by the hypervisor. We don't
> expect the virtual device (= the hypervisor) to mis-behave on purpose.
> However, if the hypervisor is compromised by an external attacker (for
> example over the network, or via privilege escalation from another
> guest), then all guest RAM that has not been encrypted up to that point
> becomes visible to the attacker. In other words, the hypervisor ("the
> device") becomes retro-actively malicious.
> [Jiewen] If that is the threat model, I have a question on the exposure:
> 1) If the concern is for existing data, it means all DMA data already written.
> However, the DMA data is already consumed or produced by virtual device
> or a hypervisor. If the hypervisor is attacked, it already gets all the data content.

Maybe, maybe not. The data may have been sent to a different host over
the network, and wiped from memory.

Or, the data may have been written to a disk image that is separately
encrypted by the host OS, and has been detached (unplugged) from the
guest, and also unmounted from the host, with the disk key securely
wiped from host memory.

We can also imagine the reverse direction. Assume that the user of the
virtual machine goes into the UEFI shell in OVMF, starts the EDIT
program, and types some secret information into a text file on the ESP.
Further assume that the disk image is encrypted on the host OS. It is
conceivable that fragments of the secret could remain stuck in the AHCI
(disk) and USB (keyboard) DMA buffers.

Maybe you think that these are "made up" examples, and I agree, I just
made them up. The point is, it is not my place to discount possible
attack vectors (I haven't designed SEV). Attackers will be limited by
their *own* imaginations only, not by mine :)

> 2) if the concern is for future data, it means all user data to be written.
> However, the C-bit already prevent that.

As far as I understand SEV, future data is out of scope. The goal is to
prevent *retroactive* guest data leaks (= whatever is currently in host
OS memory) if an attacker compromises an otherwise non-malicious hypervisor.

If an attacker compromises a virtualization host, then they can just sit
silent and watch data flow into and out of guests from that point
onward, because they can then monitor all DMA (which always happens in
clear-text) real-time.

> Maybe I do not fully understand the threat model defined.
> If you can explain more, that would be great.

Well I tried to explain *my* understanding of SEV :) I hope Brijesh will
correct me if I'm wrong.

> The point of SEV is to keep as much guest data encrypted at all times as
> possible, so that *whenever* the hypervisor is compromised by an
> attacker, the guest memory -- which the attacker can inevitably dump
> from the host side -- remains un-intellegible to the attacker.
> [Jiewen] OK. If this is a security question, I suggest we define a clear
> threat model at first.
> Or what we are doing might be unnecessary or insufficient.

I agree.

As I said above, my operating principle has been to re-encrypt all DMA
buffers as soon as possible. For long-standing DMA buffers, re-encrypt
them at ExitBootServices at the latest.

> [Jiewen] For "require that Unmap() work for CommonBuffer
> operations without releasing memory", I would hold my opinion until
> it is documented clearly in UEFI spec.

You are right, of course.

It's just that we cannot avoid exploring ways, for this feature, that
currently fall outside of the spec. Whatever we do here will be outside
of the spec, one way or another. I suggested what I thought could be a
reasonable extension to the spec, one that could be implemented by PciIo
implementors even before the spec is modified.

edk2's PciIo.Unmap already behaves like this, without breaking the spec.

If there's a better way -- for example modifying CoreExitBootServices()
in edk2, to signal IOMMU drivers separately, *after* notifying other
ExitBootServices() handlers --, that might work as well.

> My current concern is:
> First, this sentence is NOT defined in UEFI specification.

Correct.

> Second, it might break an existing implementation or existing feature, such as tracing.

Correct.

> Till now, I did not see any restriction in UEFI spec say: In this function, you are not allowed
> to call memory services.
> The only restriction is
> 1) TPL restriction, where memory service must be <= TPL_NOTIFY.
> 2) AP restriction, where no UEFI service/protocol is allowed for AP.

I agree.

> I'm just trying to operate with the APIs currently defined by the UEFI
> spec, and these assumptions were the best I could come up with.
> [Jiewen] The PCI device driver must follow and *only* follow UEFI spec.
> Especially the IHV Option ROM should not consume any private API.

I disagree about "only follow". If there are additional requirements
that can be satisfied without breaking the spec, driver implementors can
choose to conform to both sets of requirements.

For example (if I understand correctly), Microsoft / Windows present a
bunch of requirements *beyond* the UEFI spec, for both platform and
add-on card firmware. Most vendors conform :)

>> [Jiewen] I am not sure who will control "When control is transferred to the OS, all
>> guest memory should be encrypted again, even those areas that were once
>> used as CommonBuffers."
>> For SEV case, I think it should be controlled by OS, because it is just CR3.
>
> If it was indeed just CR3, then I would fully agree with you.
>
> But -- to my understanding --, ensuring that the memory is actually
> encrypted requires that the guest *write* to the memory through a
> virtual address whose PTE has the C-bit set.
>
> And I don't think the guest OS can be expected to rewrite all of its
> memory at launch. :(
>
> [Jiewen] That is fine.
> I suggest we get clear on the threat model as the first step.
> The threat model for SEV usage in OVMF.
>
> I am sorry if that is already discussed before. I might ignore the conversation.

No problem; it's always good to re-investigate our assumptions and
operating principles.

> If you or any SEV owner can share the insight, that will be great.
> See my question above "If that is the threat model, I have a question on the exposure:..."

I shared *my* impressions of the threat model (which are somewhat
unclear, admittedly, and thus could make me overly cautious).

I hope Brijesh can extend and/or correct my description.

>> So apparently the default behaviors of the VT-d IOMMU and the SEV IOMMU
>> are opposites -- VT-d permits all DMA unless configured otherwise, while
>> SEV forbids all DMA unless configured otherwise.
>> [Jiewen] I do not think so. The default behaviors of VT-d IOMMU is to disable all DMA access.
>> I setup translation table, but mark all entry to be NOT-PRESENT.
>> I grant DMA access only if there is a specific request by a device.
>>
>> I open DMA access in ExitBootServices, just want to make sure it is compatible to
>> the existing OS, which might not have knowledge on IOMMU.
>> I will provide a patch to make it a policy very soon. As such VTd IOMMU may
>> turn off IOMMU, or keep it enabled at ExitBootService.
>> An IOMMU aware OS may take over IOMMU directly and reprogram it.
>
> Thanks for the clarification.
>
> But, again, will this not lead to the possibility that the VT-d IOMMU's
> ExitBootServices() handler disables all DMA *before* the PCI drivers get
> a chance to run their own ExitBootServices() handlers, disabling the
> individual devices?
> [Jiewen] Sorry for the confusing. Let me explain:
> No, VTd never disables *all* DMA buffer at ExitBootService.
>
> "disable VTd" means to "disable IOMMU protection" and "allow all DMA".
> "Keep VTd enabled" means to "keep IOMMU protection enabled" and
> "only allow the DMA from Map() request".

Okay, but this interpretation was exactly what I thought of first (see
above): "VT-d permits all DMA unless configured otherwise". You answered
that it wasn't the case.

So basically, if I understand it correctly now, at ExitBootServices the
VT-d IOMMU driver opens up all RAM for DMA access. Is that correct?

That is equivalent to decrypting all memory under SEV, and is the exact
opposite of what we want. (As I understand it.)

> If that happens, then a series of IOMMU faults could be generated, which
> I described above. (I.e., such IOMMU faults could result, at least in
> the case of SEV, in garbage being written to disk, via queued commands.)
> [Jiewen] You are right. That would not happen.
> IOMMU driver should not bring any compatibility problem for the PCI driver,
> iff the PCI driver follows the UEFI specification.

Agreed.

> Now, to finish up, here's an idea I just had.
>
> Are we allowed to call gBS->SignalEvent() in an ExitBootServices()
> notification function?
>
> If we are, then we could do the following:
>
> * PciIo.Unmap() and friends would work as always (no restrictions on
>   dynamic memory allocation or freeing, for any kind of DMA operation).
>
> * PCI drivers would not be expected to call PciIo.Unmap() in their
>   ExitBootServices() handlers.
>
> * The IOMMU driver would have an ExitBootServices() notification
>   function, to be enqueued at the TPL_CALLBACK or TPL_NOTIFY level
>   (doesn't matter which).
>
> * In this notification function, the IOMMU driver would signal *another*
>   event (a private one). The notification function for this event would
>   be enqueued strictly at the TPL_CALLBACK level.
>
> * The notification function for the second event (private to the IOMMU)
>   would iterate over all existent mappings, and unmap them, without
>   allocating or freeing memory.
>
> The key point is that by signaling the second event, the IOMMU driver
> could order its own cleanup code after all other ExitBootServices()
> callbacks. (Assuming, at least, that no other ExitBootServices()
> callback employs the same trick to defer itself!) Therefore by the time
> the second callback runs, all PCI devices have been halted, and it is
> safe to tear down the translations.
>
> The ordering would be ensured by the following:
>
> - The UEFI spec says under CreateEventEx(), "All events are guaranteed
>   to be signaled before the first notification action is taken." This
>   means that, by the time the IOMMU's ExitBootServices() handler is
>   entered, all other ExitBootServices() handlers have been *queued* at
>   least, at TPL_CALLBACK or TPL_NOTIFY.
>
> - Therefore, when we signal our second callback from there, for
>   TPL_CALLBACK, the second callback will be queued at the end of the
>   TPL_CALLBACK queue.
>
> - CreateEventEx() also says that EFI_EVENT_GROUP_EXIT_BOOT_SERVICES "is
>   functionally equivalent to the EVT_SIGNAL_EXIT_BOOT_SERVICES flag
>   for the Type argument of CreateEvent." So it wouldn't matter if a
>   driver used CreateEvent() or CreateEventEx() for setting up its own
>   handler, the handler would be queued just the same.
>
> I think this is ugly and brittle, but perhaps the only way to clean up
> *all* translations safely, with regard to PciIo.Unmap() +
> ExitBootServices(), without updating the UEFI spec.
>
> What do you think?
>
> [Jiewen] If the order is correct, and all PCI device driver is halt at ExitBootServices, that works.
> We do not need update PCI driver and we do not need update UEFI spec.
> We only need update IOMMU driver which is concerned, based upon the threat model.
> That probably is best solution. :-)

I'm very glad to hear that you like the idea.

However, it depends on whether we are permitted, by the UEFI spec, to
signal another event in an ExitBootServices() notification function.

Are we permitted to do that?

Does the UEFI spec guarantee that the notification function for the
*second* event will be queued just like it would be under "normal"
circumstances?

(I know we must not allocate or free memory in the notification function
of the *second* event either; I just want to know if the second event's
handler is *queued* like it would normally be.)

> I assume you want to handle both common buffer and read/write buffer, right?

Yes. Under this idea, all kinds of operations would be cleaned up.

> And if possible, I still have interest to get clear on the threat model for SEV in OVMF.
> If you or any SEV owner can share ...

Absolutely. Please see above.

Thank you!
Laszlo
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>
https://lists.01.org/mailman/listinfo/edk2-devel