* [edk2-devel] AARCH64 Cacheability Attributes
@ 2025-06-04 17:54 Oliver Smith-Denny via groups.io
2025-06-05 17:27 ` Ard Biesheuvel via groups.io
0 siblings, 1 reply; 5+ messages in thread
From: Oliver Smith-Denny via groups.io @ 2025-06-04 17:54 UTC (permalink / raw)
To: Ard Biesheuvel, Leif Lindholm; +Cc: devel@edk2.groups.io
Hi Ard and Leif,
We have been debugging an issue on an AARCH64 platform that has led us
to believe a UEFI spec update with a new caching type may be needed,
but we wanted to get your input before proposing that.
Diving into the specific issue that led us here first: we have a
platform with an XHCI controller that connects to the PCI
hierarchy through NonDiscoverablePciDeviceDxe, registering as
a non-coherent MMIO device, which prompts that driver to set
up its PciIo structure with the noncoherent routines:
https://github.com/tianocore/edk2/blob/8c04bcc7ed0efbb2e3fde23f787ff0249e62f874/MdeModulePkg/Bus/Pci/NonDiscoverablePciDeviceDxe/NonDiscoverablePciDeviceIo.c#L1098.
When attempting to PXE boot over a USB NIC, SnpDxe ends up having
many different alignment faults trying to access the host DMA
buffers it has set up:
https://github.com/tianocore/edk2/blob/8c04bcc7ed0efbb2e3fde23f787ff0249e62f874/NetworkPkg/SnpDxe/Snp.c#L364
(SnpDxe is poorly architected and we are putting up a PR to resolve
some of the glaring issues, like allocating its internal driver
structure in DMA-able memory).
Tracking this down, this is because the non-coherent APIs exposed
in the PciIo->AllocateBuffer routine of NonDiscoverablePciDeviceDxe
set the attributes of the buffers to what the caller has provided
or if the caller has not provided attributes (as SnpDxe does not,
and can't, it doesn't know what the underlying bus is and what the
cacheability state must be), it will set UC.
ArmMmuLib translates UC to device memory:
https://github.com/tianocore/edk2/blob/8c04bcc7ed0efbb2e3fde23f787ff0249e62f874/ArmPkg/Library/ArmMmuLib/AArch64/ArmMmuLibCore.c#L451
Which causes the AP to have stricter alignment requirements (among other
things) and then causes faults with various code patterns in SnpDxe and
the closed source vendor provided UNDI driver, including some patterns
that GCC has created where it tries to optimize writing 0 to the
structure and so does unaligned writes.
We explored various workarounds: setting -mstrict-align (which is a big
hammer, though I don't have exact numbers), using the coherent APIs (I
suspect the platform XHCI driver was copied from somewhere and that it
is actually coherent on this platform), and rearchitecting parts of
SnpDxe.
However, while we have a solution there, this led us to the greater
problem: edk2 (and the UEFI spec) were built for x86 and the
cacheability attributes are x86 ones that we are attempting to
shoehorn into aarch64 ones. I am guessing (though certainly the two
of you would have much better knowledge here) that when ArmPkg was
created, there were many such decisions made to avoid changing the rest
of edk2 and just bolting Arm onto the side. Because some UC memory must
be device memory, all of UC is made into device so that things "just
work", until they don't. I think that having all UC being set to
device memory is a recipe for more alignment faults because we have no
guarantees in higher level drivers that accesses are aligned (or
compiler guarantees).
OSes do of course make the distinction between normal non-cacheable and
device memory. My proposal is to follow this model (and the ARM ARM) and
update the UEFI spec to include a new cacheability attribute,
EFI_MEMORY_UC_IO (or some such name, I don't really care) which for x86
will just map to the same thing as EFI_MEMORY_UC. On aarch64,
EFI_MEMORY_UC_IO maps to device memory and EFI_MEMORY_UC maps to
normal non-cacheable (an aside is that EFI_MEMORY_WC probably continues
to also map to normal non-cacheable). Then, drivers and the cores are
updated to reflect whether they are setting attributes on true MMIO or
host side DMA buffers (or whatever else).
I believe this alleviates the issues with widespread device memory and
the alignment faults while bringing the UEFI spec and edk2 into better
alignment with the aarch64 architecture.
We have explored a few other possibilities that are more of the bolted
on the side variety, such as the GCD knowning what is mapped MMIO and
so ArmCpuDxe can query the GCD if it sees EFI_MEMORY_UC being passed in
and query if this is actually MMIO or normal memory and map accordingly,
but besides being additional overhead, this feels like another band aid
instead of the holistic solution of actually integrating aarch64 into
the UEFI spec/edk2. Also, there is concern that because the UEFI spec
defines these attributes (and the PI spec references them) we would be
unable to reflect the real state of memory when returning the results of
a query, etc. Furthermore, adding a heuristic for the core to follow
is always a little troubling, in the end, the calling driver knows what
this memory region is being used for and what the cacheability should
be.
Please let us know your thoughts, we may well be missing some important
nuances here, but if you agree with the general direction, we'll take
this to the USWG.
Thanks,
Oliver
-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#121391): https://edk2.groups.io/g/devel/message/121391
Mute This Topic: https://groups.io/mt/113471239/7686176
Group Owner: devel+owner@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [rebecca@openfw.io]
-=-=-=-=-=-=-=-=-=-=-=-
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [edk2-devel] AARCH64 Cacheability Attributes
2025-06-04 17:54 [edk2-devel] AARCH64 Cacheability Attributes Oliver Smith-Denny via groups.io
@ 2025-06-05 17:27 ` Ard Biesheuvel via groups.io
2025-06-05 17:55 ` Oliver Smith-Denny via groups.io
0 siblings, 1 reply; 5+ messages in thread
From: Ard Biesheuvel via groups.io @ 2025-06-05 17:27 UTC (permalink / raw)
To: Oliver Smith-Denny; +Cc: Leif Lindholm, devel@edk2.groups.io
On Wed, 4 Jun 2025 at 19:54, Oliver Smith-Denny
<osde@linux.microsoft.com> wrote:
>
> Hi Ard and Leif,
>
> We have been debugging an issue on an AARCH64 platform that has led us
> to believe a UEFI spec update with a new caching type may be needed,
> but we wanted to get your input before proposing that.
>
> Diving into the specific issue that led us here first: we have a
> platform with an XHCI controller that connects to the PCI
> hierarchy through NonDiscoverablePciDeviceDxe, registering as
> a non-coherent MMIO device, which prompts that driver to set
> up its PciIo structure with the noncoherent routines:
> https://github.com/tianocore/edk2/blob/8c04bcc7ed0efbb2e3fde23f787ff0249e62f874/MdeModulePkg/Bus/Pci/NonDiscoverablePciDeviceDxe/NonDiscoverablePciDeviceIo.c#L1098.
>
> When attempting to PXE boot over a USB NIC, SnpDxe ends up having
> many different alignment faults trying to access the host DMA
> buffers it has set up:
> https://github.com/tianocore/edk2/blob/8c04bcc7ed0efbb2e3fde23f787ff0249e62f874/NetworkPkg/SnpDxe/Snp.c#L364
>
> (SnpDxe is poorly architected and we are putting up a PR to resolve
> some of the glaring issues, like allocating its internal driver
> structure in DMA-able memory).
>
> Tracking this down, this is because the non-coherent APIs exposed
> in the PciIo->AllocateBuffer routine of NonDiscoverablePciDeviceDxe
> set the attributes of the buffers to what the caller has provided
> or if the caller has not provided attributes (as SnpDxe does not,
> and can't, it doesn't know what the underlying bus is and what the
> cacheability state must be), it will set UC.
>
> ArmMmuLib translates UC to device memory:
> https://github.com/tianocore/edk2/blob/8c04bcc7ed0efbb2e3fde23f787ff0249e62f874/ArmPkg/Library/ArmMmuLib/AArch64/ArmMmuLibCore.c#L451
>
> Which causes the AP to have stricter alignment requirements (among other
> things) and then causes faults with various code patterns in SnpDxe and
> the closed source vendor provided UNDI driver, including some patterns
> that GCC has created where it tries to optimize writing 0 to the
> structure and so does unaligned writes.
>
> We explored various workarounds: setting -mstrict-align (which is a big
> hammer, though I don't have exact numbers), using the coherent APIs (I
> suspect the platform XHCI driver was copied from somewhere and that it
> is actually coherent on this platform), and rearchitecting parts of
> SnpDxe.
>
> However, while we have a solution there, this led us to the greater
> problem: edk2 (and the UEFI spec) were built for x86 and the
> cacheability attributes are x86 ones that we are attempting to
> shoehorn into aarch64 ones. I am guessing (though certainly the two
> of you would have much better knowledge here) that when ArmPkg was
> created, there were many such decisions made to avoid changing the rest
> of edk2 and just bolting Arm onto the side. Because some UC memory must
> be device memory, all of UC is made into device so that things "just
> work", until they don't. I think that having all UC being set to
> device memory is a recipe for more alignment faults because we have no
> guarantees in higher level drivers that accesses are aligned (or
> compiler guarantees).
>
> OSes do of course make the distinction between normal non-cacheable and
> device memory. My proposal is to follow this model (and the ARM ARM) and
> update the UEFI spec to include a new cacheability attribute,
> EFI_MEMORY_UC_IO (or some such name, I don't really care) which for x86
> will just map to the same thing as EFI_MEMORY_UC. On aarch64,
> EFI_MEMORY_UC_IO maps to device memory and EFI_MEMORY_UC maps to
> normal non-cacheable (an aside is that EFI_MEMORY_WC probably continues
> to also map to normal non-cacheable). Then, drivers and the cores are
> updated to reflect whether they are setting attributes on true MMIO or
> host side DMA buffers (or whatever else).
>
> I believe this alleviates the issues with widespread device memory and
> the alignment faults while bringing the UEFI spec and edk2 into better
> alignment with the aarch64 architecture.
>
> We have explored a few other possibilities that are more of the bolted
> on the side variety, such as the GCD knowning what is mapped MMIO and
> so ArmCpuDxe can query the GCD if it sees EFI_MEMORY_UC being passed in
> and query if this is actually MMIO or normal memory and map accordingly,
> but besides being additional overhead, this feels like another band aid
> instead of the holistic solution of actually integrating aarch64 into
> the UEFI spec/edk2. Also, there is concern that because the UEFI spec
> defines these attributes (and the PI spec references them) we would be
> unable to reflect the real state of memory when returning the results of
> a query, etc. Furthermore, adding a heuristic for the core to follow
> is always a little troubling, in the end, the calling driver knows what
> this memory region is being used for and what the cacheability should
> be.
>
> Please let us know your thoughts, we may well be missing some important
> nuances here, but if you agree with the general direction, we'll take
> this to the USWG.
>
Hi,
Thanks for elaborating.
First of all, I would assume that the device is actually non-coherent,
or it wouldn't work at all using the non-coherent routines, given that
these perform cache invalidation on buffers used for inbound DMA,
which would nuke any data the device would have put there via its
cacheable mapping (line 1453 in the same file).
I think it was a mistake for ARM platforms to ever use UC|WC||WT|WB by
default for describing conventional memory. On ARM, only WC and WB
make sense for RAM, and UC should be used for MMIO (i.e., memory
ranges that are not backed any kind of RAM but by device registers
where both reads and writes may have side effects). Note that not only
misaligned accesses are problematic, also DC ZVA instructions will
fail, and these are heavily used to accelerate SetMem()
Simply dropping UC from the code that declares the DRAM in the
platform startup code should fix the issue entirely, afaict. (Note
that ArmVirtQemu et al declare their system memory as only WB capable
and nothing else - this is needed because KVM is fundamentally
coherent, as the VMM uses cacheable mappings to access guest memory)
Adding more memory types to the spec is not going to solve anything,
I'm afraid. We already have some dubious ISA mask types that were
added without proper motivation, and I'd rather have fewer types
(e.g., on ARM, WT is also just normal non-cacheable under the hood in
most cases and I don't have a clue how it is intended to be used).
Instead, we should stop using UC for conventional memory.
-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#121398): https://edk2.groups.io/g/devel/message/121398
Mute This Topic: https://groups.io/mt/113471239/7686176
Group Owner: devel+owner@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [rebecca@openfw.io]
-=-=-=-=-=-=-=-=-=-=-=-
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [edk2-devel] AARCH64 Cacheability Attributes
2025-06-05 17:27 ` Ard Biesheuvel via groups.io
@ 2025-06-05 17:55 ` Oliver Smith-Denny via groups.io
2025-06-05 18:19 ` Ard Biesheuvel via groups.io
0 siblings, 1 reply; 5+ messages in thread
From: Oliver Smith-Denny via groups.io @ 2025-06-05 17:55 UTC (permalink / raw)
To: devel, ardb; +Cc: Leif Lindholm
On 6/5/2025 10:27 AM, Ard Biesheuvel via groups.io wrote:
> Hi,
>
> Thanks for elaborating.
>
> First of all, I would assume that the device is actually non-coherent,
> or it wouldn't work at all using the non-coherent routines, given that
> these perform cache invalidation on buffers used for inbound DMA,
> which would nuke any data the device would have put there via its
> cacheable mapping (line 1453 in the same file).
Well, that's the odd thing, the device works when we have it use the
coherent and the non-coherent routines :). We aren't the platform team
though, so I am following up with them and the silicon vendor to get a
real answer here other than "well it worked."
>
> I think it was a mistake for ARM platforms to ever use UC|WC||WT|WB by
> default for describing conventional memory. On ARM, only WC and WB
> make sense for RAM, and UC should be used for MMIO (i.e., memory
> ranges that are not backed any kind of RAM but by device registers
> where both reads and writes may have side effects). Note that not only
> misaligned accesses are problematic, also DC ZVA instructions will
> fail, and these are heavily used to accelerate SetMem()
I agree that it was a mistake for ARM platforms to use all the x86
cache types, definitely creates a confusing mess, but that leads me
to the belief that the spec should be updated to be able to fit the
architecture (which I agree, may not be another caching type, I need
to think on this more).
>
> Simply dropping UC from the code that declares the DRAM in the
> platform startup code should fix the issue entirely, afaict. (Note
> that ArmVirtQemu et al declare their system memory as only WB capable
> and nothing else - this is needed because KVM is fundamentally
> coherent, as the VMM uses cacheable mappings to access guest memory)
That's a good point and we'll test with that. This would work for this
case because NonDiscoverablePciDxe toggles between WC and UC and chooses
WC if UC isn't supported, but there is nothing that specifies that
behavior globally. It would be valid for a driver to say I need to do
non-coherent DMA (or some other use case) and to say so I will set the
attributes to UC. On ARM, you would have to set that to WC, but now the
driver needs to know architecture level details, which is not ideal. For
x86, you would want to set UC.
>
> Adding more memory types to the spec is not going to solve anything,
> I'm afraid. We already have some dubious ISA mask types that were
> added without proper motivation, and I'd rather have fewer types
> (e.g., on ARM, WT is also just normal non-cacheable under the hood in
> most cases and I don't have a clue how it is intended to be used).
> Instead, we should stop using UC for conventional memory.
>
Yeah, I agree that already the invalid cacheability types are troubling
and adding more may just make the situation worse. We'll think on this
some more, I still believe this is a spec problem, that the spec does
not envision the ARM64 architecture and very explicitly envisions the
x86 architecture, which leads to issues like this (of course in many
areas).
I worry about relying on code workarounds for spec issues creates an
untenable position where a driver thinks it is doing the right thing,
per spec, but then the juggling under the hood to figure out how to
translate it to an aarch64 concept breaks down somewhere.
Thanks,
Oliver
-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#121400): https://edk2.groups.io/g/devel/message/121400
Mute This Topic: https://groups.io/mt/113471239/7686176
Group Owner: devel+owner@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [rebecca@openfw.io]
-=-=-=-=-=-=-=-=-=-=-=-
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [edk2-devel] AARCH64 Cacheability Attributes
2025-06-05 17:55 ` Oliver Smith-Denny via groups.io
@ 2025-06-05 18:19 ` Ard Biesheuvel via groups.io
2025-06-05 20:21 ` Oliver Smith-Denny via groups.io
0 siblings, 1 reply; 5+ messages in thread
From: Ard Biesheuvel via groups.io @ 2025-06-05 18:19 UTC (permalink / raw)
To: Oliver Smith-Denny; +Cc: devel, Leif Lindholm
On Thu, 5 Jun 2025 at 19:55, Oliver Smith-Denny
<osde@linux.microsoft.com> wrote:
>
> On 6/5/2025 10:27 AM, Ard Biesheuvel via groups.io wrote:
> > Hi,
...
> >
> > Simply dropping UC from the code that declares the DRAM in the
> > platform startup code should fix the issue entirely, afaict. (Note
> > that ArmVirtQemu et al declare their system memory as only WB capable
> > and nothing else - this is needed because KVM is fundamentally
> > coherent, as the VMM uses cacheable mappings to access guest memory)
>
> That's a good point and we'll test with that. This would work for this
> case because NonDiscoverablePciDxe toggles between WC and UC and chooses
> WC if UC isn't supported, but there is nothing that specifies that
> behavior globally. It would be valid for a driver to say I need to do
> non-coherent DMA (or some other use case) and to say so I will set the
> attributes to UC. On ARM, you would have to set that to WC, but now the
> driver needs to know architecture level details, which is not ideal. For
> x86, you would want to set UC.
>
No. Whether DMA is coherent or not is /not/ a property of the device,
it is a property of how that device was integrated into the system.
So a device driver should never have to reason about the difference
between coherent or non-coherent DMA, unless it is tightly coupled
with the platform, in which case this would not be an issue.
Note that the NonCoherentDmaLib in EmbeddedPkg already prefers WC over
UC. I don't remember why I didn't do the same when I wrote
NonDiscoverablePciDxe, because it is the obvious correct thing to do
on ARM (and x86 doesn't use this driver at all afaik)
> >
> > Adding more memory types to the spec is not going to solve anything,
> > I'm afraid. We already have some dubious ISA mask types that were
> > added without proper motivation, and I'd rather have fewer types
> > (e.g., on ARM, WT is also just normal non-cacheable under the hood in
> > most cases and I don't have a clue how it is intended to be used).
> > Instead, we should stop using UC for conventional memory.
> >
>
> Yeah, I agree that already the invalid cacheability types are troubling
> and adding more may just make the situation worse. We'll think on this
> some more, I still believe this is a spec problem, that the spec does
> not envision the ARM64 architecture and very explicitly envisions the
> x86 architecture, which leads to issues like this (of course in many
> areas).
>
> I worry about relying on code workarounds for spec issues creates an
> untenable position where a driver thinks it is doing the right thing,
> per spec, but then the juggling under the hood to figure out how to
> translate it to an aarch64 concept breaks down somewhere.
>
The driver should not care about the difference. If there are
currently drivers that need to, we should fix the abstractions
instead.
-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#121401): https://edk2.groups.io/g/devel/message/121401
Mute This Topic: https://groups.io/mt/113471239/7686176
Group Owner: devel+owner@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [rebecca@openfw.io]
-=-=-=-=-=-=-=-=-=-=-=-
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [edk2-devel] AARCH64 Cacheability Attributes
2025-06-05 18:19 ` Ard Biesheuvel via groups.io
@ 2025-06-05 20:21 ` Oliver Smith-Denny via groups.io
0 siblings, 0 replies; 5+ messages in thread
From: Oliver Smith-Denny via groups.io @ 2025-06-05 20:21 UTC (permalink / raw)
To: devel, ardb; +Cc: Leif Lindholm
On 6/5/2025 11:19 AM, Ard Biesheuvel via groups.io wrote:
> On Thu, 5 Jun 2025 at 19:55, Oliver Smith-Denny
> <osde@linux.microsoft.com> wrote:
>> That's a good point and we'll test with that. This would work for this
>> case because NonDiscoverablePciDxe toggles between WC and UC and chooses
>> WC if UC isn't supported, but there is nothing that specifies that
>> behavior globally. It would be valid for a driver to say I need to do
>> non-coherent DMA (or some other use case) and to say so I will set the
>> attributes to UC. On ARM, you would have to set that to WC, but now the
>> driver needs to know architecture level details, which is not ideal. For
>> x86, you would want to set UC.
>>
>
> No. Whether DMA is coherent or not is /not/ a property of the device,
> it is a property of how that device was integrated into the system.
I agree with that, I was thinking more of the PciNonDiscoverable case,
but more on that below.
>
> So a device driver should never have to reason about the difference
> between coherent or non-coherent DMA, unless it is tightly coupled
> with the platform, in which case this would not be an issue.
>
> Note that the NonCoherentDmaLib in EmbeddedPkg already prefers WC over
> UC. I don't remember why I didn't do the same when I wrote
> NonDiscoverablePciDxe, because it is the obvious correct thing to do
> on ARM (and x86 doesn't use this driver at all afaik)
>
I think you are right here, that we should move to prefer WC over UC in
NonDiscoverablePciDxe since these are host buffers and may need to be
normal uncacheable but should never be device memory (so to that end,
should it only set WC and not UC at all?).
>
>>>
>>> Adding more memory types to the spec is not going to solve anything,
>>> I'm afraid. We already have some dubious ISA mask types that were
>>> added without proper motivation, and I'd rather have fewer types
>>> (e.g., on ARM, WT is also just normal non-cacheable under the hood in
>>> most cases and I don't have a clue how it is intended to be used).
>>> Instead, we should stop using UC for conventional memory.
>>>
>>
>> Yeah, I agree that already the invalid cacheability types are troubling
>> and adding more may just make the situation worse. We'll think on this
>> some more, I still believe this is a spec problem, that the spec does
>> not envision the ARM64 architecture and very explicitly envisions the
>> x86 architecture, which leads to issues like this (of course in many
>> areas).
>>
>> I worry about relying on code workarounds for spec issues creates an
>> untenable position where a driver thinks it is doing the right thing,
>> per spec, but then the juggling under the hood to figure out how to
>> translate it to an aarch64 concept breaks down somewhere.
>>
>
> The driver should not care about the difference. If there are
> currently drivers that need to, we should fix the abstractions
> instead.
>
I will go through edk2 and see if anything is breaking this assumption,
I agree that the platform is the only one who knows the cacheability.
I still think it is unfortunate that we have to do some mapping between
x86 attributes and aarch64 attributes, but if we can ensure edk2 is
clean w.r.t. class drivers not setting cacheability attributes, I think
we can get away with it.
If you are amenable to it, I'll put up a PR for NonDiscoverablePciDxe to
prefer WC over UC (or someone on my team will), though want your input
on whether we allow UC at all there.
Thanks,
Oliver
-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#121402): https://edk2.groups.io/g/devel/message/121402
Mute This Topic: https://groups.io/mt/113471239/7686176
Group Owner: devel+owner@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [rebecca@openfw.io]
-=-=-=-=-=-=-=-=-=-=-=-
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-06-05 20:21 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-04 17:54 [edk2-devel] AARCH64 Cacheability Attributes Oliver Smith-Denny via groups.io
2025-06-05 17:27 ` Ard Biesheuvel via groups.io
2025-06-05 17:55 ` Oliver Smith-Denny via groups.io
2025-06-05 18:19 ` Ard Biesheuvel via groups.io
2025-06-05 20:21 ` Oliver Smith-Denny via groups.io
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox