* [edk2-devel] AARCH64 Cacheability Attributes @ 2025-06-04 17:54 Oliver Smith-Denny via groups.io 2025-06-05 17:27 ` Ard Biesheuvel via groups.io 0 siblings, 1 reply; 5+ messages in thread From: Oliver Smith-Denny via groups.io @ 2025-06-04 17:54 UTC (permalink / raw) To: Ard Biesheuvel, Leif Lindholm; +Cc: devel@edk2.groups.io Hi Ard and Leif, We have been debugging an issue on an AARCH64 platform that has led us to believe a UEFI spec update with a new caching type may be needed, but we wanted to get your input before proposing that. Diving into the specific issue that led us here first: we have a platform with an XHCI controller that connects to the PCI hierarchy through NonDiscoverablePciDeviceDxe, registering as a non-coherent MMIO device, which prompts that driver to set up its PciIo structure with the noncoherent routines: https://github.com/tianocore/edk2/blob/8c04bcc7ed0efbb2e3fde23f787ff0249e62f874/MdeModulePkg/Bus/Pci/NonDiscoverablePciDeviceDxe/NonDiscoverablePciDeviceIo.c#L1098. When attempting to PXE boot over a USB NIC, SnpDxe ends up having many different alignment faults trying to access the host DMA buffers it has set up: https://github.com/tianocore/edk2/blob/8c04bcc7ed0efbb2e3fde23f787ff0249e62f874/NetworkPkg/SnpDxe/Snp.c#L364 (SnpDxe is poorly architected and we are putting up a PR to resolve some of the glaring issues, like allocating its internal driver structure in DMA-able memory). Tracking this down, this is because the non-coherent APIs exposed in the PciIo->AllocateBuffer routine of NonDiscoverablePciDeviceDxe set the attributes of the buffers to what the caller has provided or if the caller has not provided attributes (as SnpDxe does not, and can't, it doesn't know what the underlying bus is and what the cacheability state must be), it will set UC. ArmMmuLib translates UC to device memory: https://github.com/tianocore/edk2/blob/8c04bcc7ed0efbb2e3fde23f787ff0249e62f874/ArmPkg/Library/ArmMmuLib/AArch64/ArmMmuLibCore.c#L451 Which causes the AP to have stricter alignment requirements (among other things) and then causes faults with various code patterns in SnpDxe and the closed source vendor provided UNDI driver, including some patterns that GCC has created where it tries to optimize writing 0 to the structure and so does unaligned writes. We explored various workarounds: setting -mstrict-align (which is a big hammer, though I don't have exact numbers), using the coherent APIs (I suspect the platform XHCI driver was copied from somewhere and that it is actually coherent on this platform), and rearchitecting parts of SnpDxe. However, while we have a solution there, this led us to the greater problem: edk2 (and the UEFI spec) were built for x86 and the cacheability attributes are x86 ones that we are attempting to shoehorn into aarch64 ones. I am guessing (though certainly the two of you would have much better knowledge here) that when ArmPkg was created, there were many such decisions made to avoid changing the rest of edk2 and just bolting Arm onto the side. Because some UC memory must be device memory, all of UC is made into device so that things "just work", until they don't. I think that having all UC being set to device memory is a recipe for more alignment faults because we have no guarantees in higher level drivers that accesses are aligned (or compiler guarantees). OSes do of course make the distinction between normal non-cacheable and device memory. My proposal is to follow this model (and the ARM ARM) and update the UEFI spec to include a new cacheability attribute, EFI_MEMORY_UC_IO (or some such name, I don't really care) which for x86 will just map to the same thing as EFI_MEMORY_UC. On aarch64, EFI_MEMORY_UC_IO maps to device memory and EFI_MEMORY_UC maps to normal non-cacheable (an aside is that EFI_MEMORY_WC probably continues to also map to normal non-cacheable). Then, drivers and the cores are updated to reflect whether they are setting attributes on true MMIO or host side DMA buffers (or whatever else). I believe this alleviates the issues with widespread device memory and the alignment faults while bringing the UEFI spec and edk2 into better alignment with the aarch64 architecture. We have explored a few other possibilities that are more of the bolted on the side variety, such as the GCD knowning what is mapped MMIO and so ArmCpuDxe can query the GCD if it sees EFI_MEMORY_UC being passed in and query if this is actually MMIO or normal memory and map accordingly, but besides being additional overhead, this feels like another band aid instead of the holistic solution of actually integrating aarch64 into the UEFI spec/edk2. Also, there is concern that because the UEFI spec defines these attributes (and the PI spec references them) we would be unable to reflect the real state of memory when returning the results of a query, etc. Furthermore, adding a heuristic for the core to follow is always a little troubling, in the end, the calling driver knows what this memory region is being used for and what the cacheability should be. Please let us know your thoughts, we may well be missing some important nuances here, but if you agree with the general direction, we'll take this to the USWG. Thanks, Oliver -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#121391): https://edk2.groups.io/g/devel/message/121391 Mute This Topic: https://groups.io/mt/113471239/7686176 Group Owner: devel+owner@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [rebecca@openfw.io] -=-=-=-=-=-=-=-=-=-=-=- ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [edk2-devel] AARCH64 Cacheability Attributes 2025-06-04 17:54 [edk2-devel] AARCH64 Cacheability Attributes Oliver Smith-Denny via groups.io @ 2025-06-05 17:27 ` Ard Biesheuvel via groups.io 2025-06-05 17:55 ` Oliver Smith-Denny via groups.io 0 siblings, 1 reply; 5+ messages in thread From: Ard Biesheuvel via groups.io @ 2025-06-05 17:27 UTC (permalink / raw) To: Oliver Smith-Denny; +Cc: Leif Lindholm, devel@edk2.groups.io On Wed, 4 Jun 2025 at 19:54, Oliver Smith-Denny <osde@linux.microsoft.com> wrote: > > Hi Ard and Leif, > > We have been debugging an issue on an AARCH64 platform that has led us > to believe a UEFI spec update with a new caching type may be needed, > but we wanted to get your input before proposing that. > > Diving into the specific issue that led us here first: we have a > platform with an XHCI controller that connects to the PCI > hierarchy through NonDiscoverablePciDeviceDxe, registering as > a non-coherent MMIO device, which prompts that driver to set > up its PciIo structure with the noncoherent routines: > https://github.com/tianocore/edk2/blob/8c04bcc7ed0efbb2e3fde23f787ff0249e62f874/MdeModulePkg/Bus/Pci/NonDiscoverablePciDeviceDxe/NonDiscoverablePciDeviceIo.c#L1098. > > When attempting to PXE boot over a USB NIC, SnpDxe ends up having > many different alignment faults trying to access the host DMA > buffers it has set up: > https://github.com/tianocore/edk2/blob/8c04bcc7ed0efbb2e3fde23f787ff0249e62f874/NetworkPkg/SnpDxe/Snp.c#L364 > > (SnpDxe is poorly architected and we are putting up a PR to resolve > some of the glaring issues, like allocating its internal driver > structure in DMA-able memory). > > Tracking this down, this is because the non-coherent APIs exposed > in the PciIo->AllocateBuffer routine of NonDiscoverablePciDeviceDxe > set the attributes of the buffers to what the caller has provided > or if the caller has not provided attributes (as SnpDxe does not, > and can't, it doesn't know what the underlying bus is and what the > cacheability state must be), it will set UC. > > ArmMmuLib translates UC to device memory: > https://github.com/tianocore/edk2/blob/8c04bcc7ed0efbb2e3fde23f787ff0249e62f874/ArmPkg/Library/ArmMmuLib/AArch64/ArmMmuLibCore.c#L451 > > Which causes the AP to have stricter alignment requirements (among other > things) and then causes faults with various code patterns in SnpDxe and > the closed source vendor provided UNDI driver, including some patterns > that GCC has created where it tries to optimize writing 0 to the > structure and so does unaligned writes. > > We explored various workarounds: setting -mstrict-align (which is a big > hammer, though I don't have exact numbers), using the coherent APIs (I > suspect the platform XHCI driver was copied from somewhere and that it > is actually coherent on this platform), and rearchitecting parts of > SnpDxe. > > However, while we have a solution there, this led us to the greater > problem: edk2 (and the UEFI spec) were built for x86 and the > cacheability attributes are x86 ones that we are attempting to > shoehorn into aarch64 ones. I am guessing (though certainly the two > of you would have much better knowledge here) that when ArmPkg was > created, there were many such decisions made to avoid changing the rest > of edk2 and just bolting Arm onto the side. Because some UC memory must > be device memory, all of UC is made into device so that things "just > work", until they don't. I think that having all UC being set to > device memory is a recipe for more alignment faults because we have no > guarantees in higher level drivers that accesses are aligned (or > compiler guarantees). > > OSes do of course make the distinction between normal non-cacheable and > device memory. My proposal is to follow this model (and the ARM ARM) and > update the UEFI spec to include a new cacheability attribute, > EFI_MEMORY_UC_IO (or some such name, I don't really care) which for x86 > will just map to the same thing as EFI_MEMORY_UC. On aarch64, > EFI_MEMORY_UC_IO maps to device memory and EFI_MEMORY_UC maps to > normal non-cacheable (an aside is that EFI_MEMORY_WC probably continues > to also map to normal non-cacheable). Then, drivers and the cores are > updated to reflect whether they are setting attributes on true MMIO or > host side DMA buffers (or whatever else). > > I believe this alleviates the issues with widespread device memory and > the alignment faults while bringing the UEFI spec and edk2 into better > alignment with the aarch64 architecture. > > We have explored a few other possibilities that are more of the bolted > on the side variety, such as the GCD knowning what is mapped MMIO and > so ArmCpuDxe can query the GCD if it sees EFI_MEMORY_UC being passed in > and query if this is actually MMIO or normal memory and map accordingly, > but besides being additional overhead, this feels like another band aid > instead of the holistic solution of actually integrating aarch64 into > the UEFI spec/edk2. Also, there is concern that because the UEFI spec > defines these attributes (and the PI spec references them) we would be > unable to reflect the real state of memory when returning the results of > a query, etc. Furthermore, adding a heuristic for the core to follow > is always a little troubling, in the end, the calling driver knows what > this memory region is being used for and what the cacheability should > be. > > Please let us know your thoughts, we may well be missing some important > nuances here, but if you agree with the general direction, we'll take > this to the USWG. > Hi, Thanks for elaborating. First of all, I would assume that the device is actually non-coherent, or it wouldn't work at all using the non-coherent routines, given that these perform cache invalidation on buffers used for inbound DMA, which would nuke any data the device would have put there via its cacheable mapping (line 1453 in the same file). I think it was a mistake for ARM platforms to ever use UC|WC||WT|WB by default for describing conventional memory. On ARM, only WC and WB make sense for RAM, and UC should be used for MMIO (i.e., memory ranges that are not backed any kind of RAM but by device registers where both reads and writes may have side effects). Note that not only misaligned accesses are problematic, also DC ZVA instructions will fail, and these are heavily used to accelerate SetMem() Simply dropping UC from the code that declares the DRAM in the platform startup code should fix the issue entirely, afaict. (Note that ArmVirtQemu et al declare their system memory as only WB capable and nothing else - this is needed because KVM is fundamentally coherent, as the VMM uses cacheable mappings to access guest memory) Adding more memory types to the spec is not going to solve anything, I'm afraid. We already have some dubious ISA mask types that were added without proper motivation, and I'd rather have fewer types (e.g., on ARM, WT is also just normal non-cacheable under the hood in most cases and I don't have a clue how it is intended to be used). Instead, we should stop using UC for conventional memory. -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#121398): https://edk2.groups.io/g/devel/message/121398 Mute This Topic: https://groups.io/mt/113471239/7686176 Group Owner: devel+owner@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [rebecca@openfw.io] -=-=-=-=-=-=-=-=-=-=-=- ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [edk2-devel] AARCH64 Cacheability Attributes 2025-06-05 17:27 ` Ard Biesheuvel via groups.io @ 2025-06-05 17:55 ` Oliver Smith-Denny via groups.io 2025-06-05 18:19 ` Ard Biesheuvel via groups.io 0 siblings, 1 reply; 5+ messages in thread From: Oliver Smith-Denny via groups.io @ 2025-06-05 17:55 UTC (permalink / raw) To: devel, ardb; +Cc: Leif Lindholm On 6/5/2025 10:27 AM, Ard Biesheuvel via groups.io wrote: > Hi, > > Thanks for elaborating. > > First of all, I would assume that the device is actually non-coherent, > or it wouldn't work at all using the non-coherent routines, given that > these perform cache invalidation on buffers used for inbound DMA, > which would nuke any data the device would have put there via its > cacheable mapping (line 1453 in the same file). Well, that's the odd thing, the device works when we have it use the coherent and the non-coherent routines :). We aren't the platform team though, so I am following up with them and the silicon vendor to get a real answer here other than "well it worked." > > I think it was a mistake for ARM platforms to ever use UC|WC||WT|WB by > default for describing conventional memory. On ARM, only WC and WB > make sense for RAM, and UC should be used for MMIO (i.e., memory > ranges that are not backed any kind of RAM but by device registers > where both reads and writes may have side effects). Note that not only > misaligned accesses are problematic, also DC ZVA instructions will > fail, and these are heavily used to accelerate SetMem() I agree that it was a mistake for ARM platforms to use all the x86 cache types, definitely creates a confusing mess, but that leads me to the belief that the spec should be updated to be able to fit the architecture (which I agree, may not be another caching type, I need to think on this more). > > Simply dropping UC from the code that declares the DRAM in the > platform startup code should fix the issue entirely, afaict. (Note > that ArmVirtQemu et al declare their system memory as only WB capable > and nothing else - this is needed because KVM is fundamentally > coherent, as the VMM uses cacheable mappings to access guest memory) That's a good point and we'll test with that. This would work for this case because NonDiscoverablePciDxe toggles between WC and UC and chooses WC if UC isn't supported, but there is nothing that specifies that behavior globally. It would be valid for a driver to say I need to do non-coherent DMA (or some other use case) and to say so I will set the attributes to UC. On ARM, you would have to set that to WC, but now the driver needs to know architecture level details, which is not ideal. For x86, you would want to set UC. > > Adding more memory types to the spec is not going to solve anything, > I'm afraid. We already have some dubious ISA mask types that were > added without proper motivation, and I'd rather have fewer types > (e.g., on ARM, WT is also just normal non-cacheable under the hood in > most cases and I don't have a clue how it is intended to be used). > Instead, we should stop using UC for conventional memory. > Yeah, I agree that already the invalid cacheability types are troubling and adding more may just make the situation worse. We'll think on this some more, I still believe this is a spec problem, that the spec does not envision the ARM64 architecture and very explicitly envisions the x86 architecture, which leads to issues like this (of course in many areas). I worry about relying on code workarounds for spec issues creates an untenable position where a driver thinks it is doing the right thing, per spec, but then the juggling under the hood to figure out how to translate it to an aarch64 concept breaks down somewhere. Thanks, Oliver -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#121400): https://edk2.groups.io/g/devel/message/121400 Mute This Topic: https://groups.io/mt/113471239/7686176 Group Owner: devel+owner@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [rebecca@openfw.io] -=-=-=-=-=-=-=-=-=-=-=- ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [edk2-devel] AARCH64 Cacheability Attributes 2025-06-05 17:55 ` Oliver Smith-Denny via groups.io @ 2025-06-05 18:19 ` Ard Biesheuvel via groups.io 2025-06-05 20:21 ` Oliver Smith-Denny via groups.io 0 siblings, 1 reply; 5+ messages in thread From: Ard Biesheuvel via groups.io @ 2025-06-05 18:19 UTC (permalink / raw) To: Oliver Smith-Denny; +Cc: devel, Leif Lindholm On Thu, 5 Jun 2025 at 19:55, Oliver Smith-Denny <osde@linux.microsoft.com> wrote: > > On 6/5/2025 10:27 AM, Ard Biesheuvel via groups.io wrote: > > Hi, ... > > > > Simply dropping UC from the code that declares the DRAM in the > > platform startup code should fix the issue entirely, afaict. (Note > > that ArmVirtQemu et al declare their system memory as only WB capable > > and nothing else - this is needed because KVM is fundamentally > > coherent, as the VMM uses cacheable mappings to access guest memory) > > That's a good point and we'll test with that. This would work for this > case because NonDiscoverablePciDxe toggles between WC and UC and chooses > WC if UC isn't supported, but there is nothing that specifies that > behavior globally. It would be valid for a driver to say I need to do > non-coherent DMA (or some other use case) and to say so I will set the > attributes to UC. On ARM, you would have to set that to WC, but now the > driver needs to know architecture level details, which is not ideal. For > x86, you would want to set UC. > No. Whether DMA is coherent or not is /not/ a property of the device, it is a property of how that device was integrated into the system. So a device driver should never have to reason about the difference between coherent or non-coherent DMA, unless it is tightly coupled with the platform, in which case this would not be an issue. Note that the NonCoherentDmaLib in EmbeddedPkg already prefers WC over UC. I don't remember why I didn't do the same when I wrote NonDiscoverablePciDxe, because it is the obvious correct thing to do on ARM (and x86 doesn't use this driver at all afaik) > > > > Adding more memory types to the spec is not going to solve anything, > > I'm afraid. We already have some dubious ISA mask types that were > > added without proper motivation, and I'd rather have fewer types > > (e.g., on ARM, WT is also just normal non-cacheable under the hood in > > most cases and I don't have a clue how it is intended to be used). > > Instead, we should stop using UC for conventional memory. > > > > Yeah, I agree that already the invalid cacheability types are troubling > and adding more may just make the situation worse. We'll think on this > some more, I still believe this is a spec problem, that the spec does > not envision the ARM64 architecture and very explicitly envisions the > x86 architecture, which leads to issues like this (of course in many > areas). > > I worry about relying on code workarounds for spec issues creates an > untenable position where a driver thinks it is doing the right thing, > per spec, but then the juggling under the hood to figure out how to > translate it to an aarch64 concept breaks down somewhere. > The driver should not care about the difference. If there are currently drivers that need to, we should fix the abstractions instead. -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#121401): https://edk2.groups.io/g/devel/message/121401 Mute This Topic: https://groups.io/mt/113471239/7686176 Group Owner: devel+owner@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [rebecca@openfw.io] -=-=-=-=-=-=-=-=-=-=-=- ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [edk2-devel] AARCH64 Cacheability Attributes 2025-06-05 18:19 ` Ard Biesheuvel via groups.io @ 2025-06-05 20:21 ` Oliver Smith-Denny via groups.io 0 siblings, 0 replies; 5+ messages in thread From: Oliver Smith-Denny via groups.io @ 2025-06-05 20:21 UTC (permalink / raw) To: devel, ardb; +Cc: Leif Lindholm On 6/5/2025 11:19 AM, Ard Biesheuvel via groups.io wrote: > On Thu, 5 Jun 2025 at 19:55, Oliver Smith-Denny > <osde@linux.microsoft.com> wrote: >> That's a good point and we'll test with that. This would work for this >> case because NonDiscoverablePciDxe toggles between WC and UC and chooses >> WC if UC isn't supported, but there is nothing that specifies that >> behavior globally. It would be valid for a driver to say I need to do >> non-coherent DMA (or some other use case) and to say so I will set the >> attributes to UC. On ARM, you would have to set that to WC, but now the >> driver needs to know architecture level details, which is not ideal. For >> x86, you would want to set UC. >> > > No. Whether DMA is coherent or not is /not/ a property of the device, > it is a property of how that device was integrated into the system. I agree with that, I was thinking more of the PciNonDiscoverable case, but more on that below. > > So a device driver should never have to reason about the difference > between coherent or non-coherent DMA, unless it is tightly coupled > with the platform, in which case this would not be an issue. > > Note that the NonCoherentDmaLib in EmbeddedPkg already prefers WC over > UC. I don't remember why I didn't do the same when I wrote > NonDiscoverablePciDxe, because it is the obvious correct thing to do > on ARM (and x86 doesn't use this driver at all afaik) > I think you are right here, that we should move to prefer WC over UC in NonDiscoverablePciDxe since these are host buffers and may need to be normal uncacheable but should never be device memory (so to that end, should it only set WC and not UC at all?). > >>> >>> Adding more memory types to the spec is not going to solve anything, >>> I'm afraid. We already have some dubious ISA mask types that were >>> added without proper motivation, and I'd rather have fewer types >>> (e.g., on ARM, WT is also just normal non-cacheable under the hood in >>> most cases and I don't have a clue how it is intended to be used). >>> Instead, we should stop using UC for conventional memory. >>> >> >> Yeah, I agree that already the invalid cacheability types are troubling >> and adding more may just make the situation worse. We'll think on this >> some more, I still believe this is a spec problem, that the spec does >> not envision the ARM64 architecture and very explicitly envisions the >> x86 architecture, which leads to issues like this (of course in many >> areas). >> >> I worry about relying on code workarounds for spec issues creates an >> untenable position where a driver thinks it is doing the right thing, >> per spec, but then the juggling under the hood to figure out how to >> translate it to an aarch64 concept breaks down somewhere. >> > > The driver should not care about the difference. If there are > currently drivers that need to, we should fix the abstractions > instead. > I will go through edk2 and see if anything is breaking this assumption, I agree that the platform is the only one who knows the cacheability. I still think it is unfortunate that we have to do some mapping between x86 attributes and aarch64 attributes, but if we can ensure edk2 is clean w.r.t. class drivers not setting cacheability attributes, I think we can get away with it. If you are amenable to it, I'll put up a PR for NonDiscoverablePciDxe to prefer WC over UC (or someone on my team will), though want your input on whether we allow UC at all there. Thanks, Oliver -=-=-=-=-=-=-=-=-=-=-=- Groups.io Links: You receive all messages sent to this group. View/Reply Online (#121402): https://edk2.groups.io/g/devel/message/121402 Mute This Topic: https://groups.io/mt/113471239/7686176 Group Owner: devel+owner@edk2.groups.io Unsubscribe: https://edk2.groups.io/g/devel/unsub [rebecca@openfw.io] -=-=-=-=-=-=-=-=-=-=-=- ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-06-05 20:21 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-06-04 17:54 [edk2-devel] AARCH64 Cacheability Attributes Oliver Smith-Denny via groups.io 2025-06-05 17:27 ` Ard Biesheuvel via groups.io 2025-06-05 17:55 ` Oliver Smith-Denny via groups.io 2025-06-05 18:19 ` Ard Biesheuvel via groups.io 2025-06-05 20:21 ` Oliver Smith-Denny via groups.io
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox