On Thu, Mar 2, 2023 at 11:50 AM Ard Biesheuvel wrote: > On Thu, 9 Feb 2023 at 16:15, Ard Biesheuvel wrote: > > > > On Tue, 7 Feb 2023 at 13:58, Oliver Steffen wrote: > > > > > > On Tue, Feb 7, 2023 at 12:57 PM Ard Biesheuvel > wrote: > > >> > > >> On Tue, 7 Feb 2023 at 11:51, Oliver Steffen > wrote: > > >> > > > >> > On Thu, Feb 2, 2023 at 12:09 PM Oliver Steffen > wrote: > > >> >> > > >> >> > > >> >> On Wed, Feb 1, 2023 at 2:29 PM Ard Biesheuvel > wrote: > > >> >>> > > >> >>> On Wed, 1 Feb 2023 at 13:59, Oliver Steffen > wrote: > > >> >>> > > > >> >>> > On Wed, Feb 1, 2023 at 12:52 PM Ard Biesheuvel > wrote: > > >> >>> >> > > >> >>> >> On Wed, 1 Feb 2023 at 10:14, Oliver Steffen < > osteffen@redhat.com> wrote: > > >> >>> >> > > > >> >> > > >> >> [...] > > >> >>> > > >> >>> >> > I am sorry, this story does not seem to be over yet. > > >> >>> >> > > > >> >>> >> > We are using the Erratum patch and also included the commit > 406504c7 in > > >> >>> >> > the kernel. > > >> >>> >> > Now the firmware crashes sometimes (10 out of 89 tests). > > >> >>> >> > > > >> >>> >> > > >> >>> >> Thanks for the report. Is this still on ThunderX2? > > >> >>> >> > > >> >>> >> > Any hints are very welcome! > > >> >>> >> > > > >> >>> >> > > >> >>> >> Do you have access to those build artifacts? > > >> >>> > > > >> >>> > > > >> >>> > > https://kojihub.stream.centos.org/kojifiles/work/tasks/5251/1835251/edk2-aarch64-20221207gitfff6d81270b5-4.el9.test.noarch.rpm > > >> >>> > > > >> >>> > and/or here: > > >> >>> > > > >> >>> > https://kojihub.stream.centos.org/koji/taskinfo?taskID=1835251 > > >> >>> > > > >> >>> > Source for reference: > > >> >>> > > https://gitlab.com/redhat/centos-stream/src/edk2/-/merge_requests/24 > > >> >>> > > > >> >>> > > >> >>> Any chance the .dll files (which are actually ELF executables) > have > > >> >>> been preserved somewhere? > > >> >> > > >> >> Here is the build folder (~90MB): > > >> >> > https://gitlab.com/osteffen/thunderx2-debug/-/raw/main/armvirt-thunderx2-issue.tar.xz > > >> >> > > >> >> I am waiting for the tests with the additional debug output to run. > > >> > > > >> > > > >> > We reran the test suite with the Erratum and the additional debug > > >> > output enabled. Strangely, the problem does not occur anymore, the > > >> > firmware boots up normally. > > >> > > > >> > We retried the tests without the additional debug output. > > >> > RHEL ships two firmware flavors for AARCH64: a silent and a verbose > > >> > version. > > >> > > >> Are these RELEASE vs DEBUG builds? > > > > > > > > > All builds are DEBUG, just the amount of information printed on > > > the serial is different (almost zero for the "silent" one.) > > > > > >> > > >> > Both were tried. We see no problems with the verbose > > >> > one. The silent one fails noticeably more often if a software TPM > device > > >> > is present. > > >> > > > >> > > >> This smells like some missing cache or TLB maintenance - the verbose > > >> one exits to the host much more often, and likely relies on cache/TLB > > >> maintenance occurring in the hypervisor. > > >> > > >> So the build always includes TPM support but the issue only occurs > > >> when the sw TPM is actually exposed by QEMU? > > > > > > > > > Yes. > > > All builds include support for TPM, but the issue occurs more > frequently > > > if a sw TPM is exposed by QEMU. > > > > > > > Any chance you could provide a specific command line for launching > > QEMU? I am trying to reproduce this, but I am not making any progress. > > > > >> > > >> > Could this be related to how much stuff is going on in the early > phase > > >> > of the firmware (when logging is enabled: formatting of messages and > > >> > sending to serial port...) ? > > >> > > > >> > > >> I'll try to see if I can rig something up that logs into a buffer > > >> rather than straight to the serial, and dump it all out when handling > > >> the crash > > >> > > > > This takes a bit more time than I can afford to spend on this atm, and > > I'd like to be able to reproduce before I go down this rabbit hole. > > Have there been any developments regarding this issue? > Nothing from my side. I tried to come up with a more reliable/faster reproducer but then stopped because of other stuff. If you have any idea what I could try next let me know. -Oliver