@Gerd
> Do you also see the slowdown without the GPU in a otherwise identical
guest configuration?
> Looks quite high to me. What amount of guest memory we are talking
about?
It is a pretty large memory allocation - over 900GB - so I'm not surprised that the initial allocation during `virsh start` takes a while when PCIe devices are passed through, since that allocation has to happen at init time. `virsh start` also takes the same amount of time with or without the dynamic mmio window size patch, but its time does scale with amount of memory allocated. (although I expect that, given that the time consuming part is just that memory allocation.)
> More details would be helpful indeed. Is that a general overall
slowdown? Is it some specific part which takes alot of time?
The part of the kernel boot that I highlighted in
https://edk2.groups.io/g/devel/attachment/120801/2/this-part-takes-2-3-minutes.txt (which I think is PCIe device initialization and BAR assignment) is the part that seems slower than it should be. Each section of that log starting with "acpiphp: Slot <slot> registered" takes probably 15 seconds, so this whole section adds up to a few minutes. That part also does not scale with memory allocation, just with number of GPUs passed through. (in this log, I had 4 GPUs attached, IIRC).
Without the dynamic mmio window size patch, if I set my guest kernel to use `pci=nocrs pci=realloc`, this boot slowdown disappears and I am able to use the GPU with some conditions (details below).
@xpahos:
> This patch adds functionality that automatically adjusts the MMIO size based on the number of physical bits. As a starting point, I would try running an old build of OVMF and running grep on ‘rejected’ to make sure that no GPUs were taken out of service while OVMF was running.
I haven't looked for this in OVMF debug output, but what you say here seems realistic, given that my VMs without the dynamic mmio window size patch throw many errors like this during guest kernel boot:
[ 4.650955] pci 0000:00:01.5: BAR 15: no space for [mem size 0x3000000000 64bit pref]
[ 4.651700] pci 0000:00:01.5: BAR 15: failed to assign [mem size 0x3000000000 64bit pref]
(and subsequently, the GPUs are not usable in the VMs (but the PCI devices are still present)). So it would make sense if the fast boot time in those versions is simply attributed to the kernel "giving up" on all of those right away, before the slow path starts. The only confusing part to me then is why I would not see
this part going so slowly when I use a version of OVMF with the dynamic mmio window size patch reverted but with my guest kernel having `pci=realloc pci=nocrs` set. Under those circumstances, I have a fast boot time and my passed-through GPUs work. (although I do still see some outputs like this during linux boot:
[ 4.592009] pci 0000:06:00.0: can't claim BAR 0 [mem 0xffffffffff000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
[ 4.593477] pci 0000:06:00.0: can't claim BAR 2 [mem 0xffffffe000000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
[ 4.593817] pci 0000:06:00.0: can't claim BAR 4 [mem 0xfffffffffe000000-0xffffffffffffffff 64bit pref]: no compatible bridge window
and sometimes the loading of the Nvidia driver does introduce some
brief lockups)
> But the linux kernel also takes a long time to initialise NVIDIA GPU using SeaBIOS
This is good to know... given this and the above, I'm starting to wonder if it might actually be a kernel issue...
_._,_._,_
Groups.io Links:
You receive all messages sent to this group.
View/Reply Online (#120805) |
|
Mute This Topic
| New Topic
Your Subscription |
Contact Group Owner |
Unsubscribe
[rebecca@openfw.io]
_._,_._,_