Hello, Mitchell.

> Thanks for the suggestion. I'm not necessarily saying this patch itself has an issue, just that it is the point in the git history at which this slow boot time issue manifests for us. This may be because the patch does actually fix the other issue I described above related to BAR assignment not working correctly in versions before that patch, despite boot being faster back then. (in those earlier versions, the PCI devices for the GPUs were passed through, but the BAR assignment was erroneous, so we couldn't actually use them - the Nvidia GPU driver would just throw errors.)

tl;dr For GPU instances, a huge amount of memory is required for the VM to be able to map BARs. So, the amount of memory required for MMIO could be insufficient and OVMF was rejecting some PCI devices during the initialisation phase. To fix this, there is an opt/ovmf/X-PciMmio64Mb option that increases the MMIO size. This patch adds functionality that automatically adjusts the MMIO size based on the number of physical bits. As a starting point, I would try running an old build of OVMF and running grep on ‘rejected’ to make sure that no GPUs were taken out of service while OVMF was running.

> After I initially posted here, we also discovered another kernel issue that was contributing to the boot times for this config exceeding 5 minutes - so with that isolated, I can say that my config only takes about a 5 minutes for a full boot: 1-2 minutes for `virsh start` (which scales with guest memory allocation), and about 2-3 minutes of time spent on PCIe initialization / BAR assignment for 2 to 4 GPUs (attached). This was still the case when I tried with my GPUs attached in the way you suggested. I'll attach the xml config for that and for my original VM in case I may have configured something incorrectly there.
> With that said, I have a more basic question - do you expect that it should take upwards of 30 seconds after `virsh start` completes before I see any output in `virsh console`, or that PCI devices' memory window assignments in the VM should take 45-90 seconds per passed-through GPU? (given that when the same kernel on the host initializes these devices, it doesn't take nearly this long?)

I'm not sure I can help you, we don't use virsh. But the linux kernel also takes a long time to initialise NVIDIA GPU using SeaBIOS. Another way to check the boot time is to hot-plug the cards after booting. I don't know how this works in virsh. I made a script for expect to emulate hot-plug:

```
#!/bin/bash
CWD="$(dirname "$(realpath "$0")")"
/usr/bin/expect <<EOF
spawn $CWD/qmp-shell $CWD/qmp.sock
send -- "query-pci\r"
send -- "device_add driver=pci-gpu-testdev bus=s30 regions=mpx2M vendorid=5555 deviceid=4126\r"
```

> I'm going to attempt to profile ovmf next to see what part of the code path is taking up the most time, but if you already have an idea of what that might be (and whether it is actually a bug or expected to take that long), that insight would be appreciated.

We just started migration from SeaBIOS to UEFI/SecureBoot, so I know only some parts of the OVMF code which is used for enumeration/initialisation of PCI devices. I'm not core developer of edk2, just solving the same problems with starting VMs with GPUs.


-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#120803): https://edk2.groups.io/g/devel/message/120803
Mute This Topic: https://groups.io/mt/109651206/7686176
Group Owner: devel+owner@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [rebecca@openfw.io]
-=-=-=-=-=-=-=-=-=-=-=-