Thanks.

> That is extremely slow. How does /proc/iomem look like? Anything overlapping the ECAM maybe?

Slow and fast guests' and host's /proc/iomem outputs are attached. For the fast guest, I also included the mapping after a reboot with `pci=realloc pci=nocrs` set, since that is the config that actually allows the driver to load. I don't see any regions labeled "PCI ECAM", not sure if that's an issue or if it might just appear as something else on some configs.

> Could be in the guest, could also be the host (which creates the EPT page tables).

Noted. In any case, seems like an upstream report is warranted. I'll probably start with kvm and vfio-pci lists.

> Rf the dynamic mmio window would be larger it will simply use the dynamic mmio window, otherwise the PcdPciMmio64Size + PcdPciMmio64Base values

Understood, seems like that wouldn't work for me to override dynamic behavior then, since it appears that the window size that resulted in fast boot for us was evidently smaller than what the dynamic MMIO window computes. (I'm basing that off the fact that BARs fail to assign initially with the classic MMIO window, but work later after recomputation caused by `pci=realloc`, whereas they all get assigned correctly first try with the slow config.)

> Another way to check the boot time is to hot-plug the cards after booting

I tried this today as well. With and without `pci=realloc pci=nocrs`:

1. If I boot the slow config, where GPUs are passed through on boot, I can unplug and hot-replug the GPUs, and it only takes a few seconds before they are usable after re-plug, so that seems normal.

2. However, if I start the VM without the GPUs connected initially, while I can hotplug them, their BARs fail to get assigned, and they are unusable. I tried increasing the X-PciMmio64Mb knob up to the 16TB max, and also tried with pci=realloc pci=nocrs pci=big_root_window in the guest, and also tried removing/rescanning via /sys/bus/pci/devices/<addr>/remove and /sys/bus/pci/rescan in the guest - none of these resulted in usable GPUs. I also tried with pci=realloc,assign-busses and saw the same:

[ +0.002317] pci 0000:08:00.0: reg 0x10: [mem 0x00000000-0x00ffffff 64bit pref]
[ +0.000341] pci 0000:08:00.0: reg 0x18: [mem 0x00000000-0x1fffffffff 64bit pref]
[ +0.000266] pci 0000:08:00.0: reg 0x20: [mem 0x00000000-0x01ffffff 64bit pref]
[ +0.000424] pci 0000:08:00.0: Max Payload Size set to 128 (was 256, max 256)
[ +0.000457] pci 0000:08:00.0: Enabling HDA controller
[ +0.003461] pci 0000:08:00.0: 252.048 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x16 link at 0000:00:01.7 (capable of 504.112 Gb/s with 32.0 GT/s PCIe x16 link)
[ +0.001435] pci 0000:08:00.0: BAR 2: no space for [mem size 0x2000000000 64bit pref]
[ +0.000003] pci 0000:08:00.0: BAR 2: failed to assign [mem size 0x2000000000 64bit pref]
[ +0.000002] pci 0000:08:00.0: BAR 4: assigned [mem 0x383800000000-0x383801ffffff 64bit pref]
[ +0.000263] pci 0000:08:00.0: BAR 0: assigned [mem 0x383802000000-0x383802ffffff 64bit pref]

With a sufficiently sized MMIO window via the OVMF knob, it looks like there is enough space for at least one GPU at the root:

80000000000-8dfffffffff : PCI Bus 0000:00

but bus 0000:08 (which is what my hotplugged GPU gets assigned to) only has 32GB space, which isn't enough for the 128GB BAR:

383800000000-383fffffffff : PCI Bus 0000:08

I am not sure if there's a way to force bus 0000:08 to use more of the overall MMIO window space after boot, but I assume there must be given that it is appropriately sized in the working configs. (currently, the rest of it is taken up by PCI slots not in use, and even if I remove those from my libvirt config, slot 0000:08 stays the same size.) (or, what would be even better would be a way to do exactly what pci=realloc does after I hotplug, since even the memory topology for that config is quite different.)

> Not directly.

Thinking ahead here: hypothetically, if I were to propose a patch to add a knob for this similar to X-PciMmio64Mb to MemDetect.c, do you think it could be acceptable? It seems that the immediately viable workaround for our specific use case would be to disable PlatformDynamicMMIOWindow via a qemu option, and if this is an issue with many large BAR Nvidia GPUs, it could be broadly useful until the root issue is fixed in the kernel. I already patched and tested a knob for this in a local build, and it works (and shouldn't introduce any regressions, since omission of the flag would just mean PDMW gets called as it does today.)

(may not be necessary if I can figure out how to get the bus resized to work with hotplug, though)

Thanks,

Mitchell Augustin

_._,_._,_

Groups.io Links:

You receive all messages sent to this group.

_._,_._,_