On Tue, Feb 7, 2023 at 12:57 PM Ard Biesheuvel <ardb@kernel.org> wrote:
On Tue, 7 Feb 2023 at 11:51, Oliver Steffen <osteffen@redhat.com> wrote:
>
> On Thu, Feb 2, 2023 at 12:09 PM Oliver Steffen <osteffen@redhat.com> wrote:
>>
>>
>> On Wed, Feb 1, 2023 at 2:29 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>>>
>>> On Wed, 1 Feb 2023 at 13:59, Oliver Steffen <osteffen@redhat.com> wrote:
>>> >
>>> > On Wed, Feb 1, 2023 at 12:52 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>>> >>
>>> >> On Wed, 1 Feb 2023 at 10:14, Oliver Steffen <osteffen@redhat.com> wrote:
>>> >> >
>>
>> [...]
>>>
>>> >> > I am sorry, this story does not seem to be over yet.
>>> >> >
>>> >> > We are using the Erratum patch and also included the commit 406504c7 in
>>> >> > the kernel.
>>> >> > Now the firmware crashes sometimes (10 out of 89 tests).
>>> >> >
>>> >>
>>> >> Thanks for the report. Is this still on ThunderX2?
>>> >>
>>> >> > Any hints are very welcome!
>>> >> >
>>> >>
>>> >> Do  you have access to those build artifacts?
>>> >
>>> >
>>> > https://kojihub.stream.centos.org/kojifiles/work/tasks/5251/1835251/edk2-aarch64-20221207gitfff6d81270b5-4.el9.test.noarch.rpm
>>> >
>>> > and/or here:
>>> >
>>> > https://kojihub.stream.centos.org/koji/taskinfo?taskID=1835251
>>> >
>>> > Source for reference:
>>> > https://gitlab.com/redhat/centos-stream/src/edk2/-/merge_requests/24
>>> >
>>>
>>> Any chance the .dll files (which are actually ELF executables) have
>>> been preserved somewhere?
>>
>> Here is the build folder (~90MB):
>> https://gitlab.com/osteffen/thunderx2-debug/-/raw/main/armvirt-thunderx2-issue.tar.xz
>>
>> I am waiting for the tests with the additional debug output to run.
>
>
> We reran the test suite with the Erratum and the additional debug
> output enabled.  Strangely, the problem does not occur anymore, the
> firmware boots up normally.
>
> We retried the tests without the additional debug output.
> RHEL ships two firmware flavors for AARCH64: a silent and a verbose
> version.

Are these RELEASE vs DEBUG builds?

All builds are DEBUG, just the amount of information printed on
the serial is different (almost zero for the "silent" one.)
 
> Both were tried. We see no problems with the verbose
> one. The silent one fails noticeably more often if a software TPM device
> is present.
>

This smells like some missing cache or TLB maintenance - the verbose
one exits to the host much more often, and likely relies on cache/TLB
maintenance occurring in the hypervisor.

So the build always includes TPM support but the issue only occurs
when the sw TPM is actually exposed by QEMU?
 
Yes.
All builds include support for TPM, but the issue occurs more frequently
if a sw TPM is exposed by QEMU.
 
> Could this be related to how much stuff is going on in the early phase
> of the firmware (when logging is enabled: formatting of messages and
> sending to serial port...) ?
>

I'll try to see if I can rig something up that logs into a buffer
rather than straight to the serial, and dump it all out when handling
the crash

Awesome.

Thanks,
 Oliver