From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
To: aaron.young@oracle.com, Leif Lindholm <leif.lindholm@linaro.org>
Cc: Laszlo Ersek <lersek@redhat.com>,
"edk2-devel@lists.01.org" <edk2-devel@lists.01.org>,
Brijesh Singh <brijesh.singh@amd.com>
Subject: Re: Regression with PXE boot on OvmfPkg/VirtioNetDxe driver on aarch64 system
Date: Wed, 3 Oct 2018 19:35:09 +0200 [thread overview]
Message-ID: <CAKv+Gu87iumO0yTjEBNBY6ukZXkWwFN6beFmhMkwTmcwB7PWyw@mail.gmail.com> (raw)
In-Reply-To: <3b4a4980-42d0-51a0-eccd-ceba83b6c78f@oracle.com>
(+ Leif)
On 3 October 2018 at 18:56, <aaron.young@oracle.com> wrote:
>
> Hello,
>
> I have some more information on this issue that I wanted to share.
>
> The SNP client appears to be GRUB - i.e. the ASSERT() is asserted just
> after PXE code has loaded GRUB, presented the GRUB menu and a boot option is
> selected.
>
> And, I have confirmed that it, as predicted, attempts to call
> SNP->Transmit() twice with the same Buffer - without successfully reaping
> the Buffer via SNP->GetStatus() in between calls - thus tripping the
> ASSERT() in SnpSharedHelpers.c:VirtioNetMapTxBuf().
>
> This was quite easy to see since GRUB appears to use only a single Buffer
> for which to transmit packets a few bytes at a time. i.e. it just calls
> SNP->Transmit() and then calls SNP->GetStatus() over and over again with the
> same buffer to transmit its packets.
>
> From my debug, I can see it work successfully most of the time (i.e.
> SNP->Transmit() is called successfully with a corresponding subsequent call
> to SNP->GetStatus() which returns the Buffer). However, occasionally,
> SNP->GetStatus() will return TxBuf=NULL (indicating that the Buffer has not
> yet been transmitted). In this case, GRUB appears to attempt to call
> SNP->GetStatus() *twice* before giving up (possibly due to a timer which I
> can see in the GRUB code) and then attempts to transfer the Buffer again -
> resulting in the ASSERT().
>
> So, this explains the regression since the old code would allow the same
> Buffer to to be re-transmitted.
>
> As a quick test, I coded up a quick workaround in VirtioNetMapTxBuf() to
> walk through Dev->TxBufCollection to see if the Buffer is present - i.e. has
> already been transmitted (and not yet reaped via SNP->GetStatus()) - and if
> so, just return EFI_SUCCESS from VirtioNetTransmit(). This allowed the PXE
> boot/installation to succeed. I'm not advocating this as a solution, but is
> an interesting data point.
>
> Any suggestion where we go from here?
>
> Is there any way to make the code more tolerant of these re-transmissions
> (like the code was before)?
>
> I'm not confident that declaring GRUB as not spec compliant is the answer.
>
> Thanks,
>
> -Aaron
>
>
>
> On 08/31/18 06:17, Laszlo Ersek wrote:
>>
>> On 08/30/18 22:17, aaron.young@oracle.com wrote:
>>>
>>> We have encountered what we believe to be a regression in the
>>> OvmfPkg/VirtioNetDxe driver.
>>>
>>> This regression causes PXE boot to fail with the following ASSERT:
>>>
>>> ASSERT
>>>
>>> [VirtioNetDxe]/builddir/build/BUILD/ovmf-92d07e48907f/OvmfPkg/VirtioNetDxe/SnpSharedHelpers.c(184):
>>> ((BOOLEAN)(0==1))
>>>
>>> The failure was encountered on a aarch64/Armv8 system (Ampere server)
>>> using the latest mainline (edk2.git/master) code as of 8-28-2018.
>>>
>>> On a hunch, I reverted the following patchset which resulted in PXE
>>> boot being successful - which suggests this patchset is likely the
>>> culprit:
>>>
>>> https://lists.01.org/pipermail/edk2-devel/2017-September/014679.html
>>>
>>> The command used to initiate the VM PXE boot/install is:
>>>
>>> virt-install --name guest1 --vcpus 16 --ram 32000 --arch aarch64
>>> --disk guest1.img,size=100 --network
>>> type=direct,source=enP4p1s0,model=virtio --pxe
>>>
>>> NOTE that I also tried simply removing the offending ASSERT() and PXE
>>> boot still fails with multiple errors (exceptions of some sort that I
>>> didn't log).
>>>
>>>
>>> What I'd like to know:
>>>
>>> 1. Is this a known failure?
>>
>> Please read the comment in the context of the failed assertion. If this
>> assertion is triggered, then that's an error in the code that consumes
>> the SNP instance provided by VirtioNetDxe. The client code invokes
>> behavior that is undefined by the UEFI spec.
>>
>> You can find more details at:
>>
>>
>> http://mid.mail-archive.com/2adc266a-949d-c546-9bed-d78fea19d969@redhat.com
>>
>>> 2. Was aarch64 given much testing with this new patchset?
>>
>> Yes. I described the tests that I had done before pushing the patch set
>> here:
>>
>>
>> http://mid.mail-archive.com/581a568d-b0ce-34a1-8e52-95adf14c03e0@redhat.com
>>
>>
>> http://mid.mail-archive.com/1e4267dc-0133-e833-f805-8381ef82c52e@redhat.com
>>
>>
>> And most recently (independently of the patch set you refer to), I
>> successfully tested VirtioNetDxe on aarch64 as part of PXEv4, PXEv6,
>> HTTPv4, HTTPv6 net-boots, for commit ae08ea246fe9
>> ("ArmVirtPkg/ArmVirtQemu: enable the IPv6 stack", 2018-07-13):
>>
>> http://mid.mail-archive.com/20180712234112.3566-1-lersek@redhat.com
>>
>>> 3. Is reverting this patchset a safe workaround for now until a fix is
>>> found? Or is there another simple workaround perhaps?
>>
>> Reverting the set should make the symptom go away, but that just papers
>> over the undefined behavior in the protocol caller.
>>
>> The goal of the series was to enable VirtioNetDxe to function correctly
>> on top of:
>> - a virtio-1.0 PCI transport, provided by Virtio10Dxe, such that
>> - explicit DMA mapping is performed using the EFI_PCI_IO_PROTOCOL
>> underlying the virtio transport, such that
>> - the platform provides an IOMMU protocol implementation for
>> PciHostBridgeDxe, which underlies PciBusDxe.
>>
>> If this use case does not apply to your platform, then you can revert
>> the patch set.
>>
>> In particular, in the ArmVirtQemu platform, we don't hook
>> "OvmfPkg/Library/PlatformHasIoMmuLib/PlatformHasIoMmuLib.inf" into
>> "PciHostBridgeDxe" via NULL class resolution, thus PciHostBridgeDxe
>> there will not wait for another platform driver to decide dynamically
>> whether the platform has an actual IOMMU protocol or not. And,
>> accordingly, we never produce an IOMMU protocol instance in ArmVirtQemu.
>>
>> (Obviously the patch set was extensively regression-tested to confirm it
>> was a no-op on platforms where the above use case wouldn't apply.)
>>
>>> 4. Any hunches what the issue could be?
>>
>> Code that uses VirtioNetDxe's SNP protocol interface invokes undefined
>> behavior; please see my answer to your Q1.
>>
>>> Thanks for any info/help/suggestions,
>>
>> I'd like to give you a hint on debugging this further as well:
>>
>> (1) in the VirtioNetTransmit() function, please add DEBUG messages both
>> before and after the call to VirtioNetMapTxBuf(). Before, log "Buffer"
>> (with %p). After, log "DeviceAddress" (with "%Lx").
>>
>> (2) In the VirtioNetGetStatus() function, please add DEBUG messages both
>> before and after the call to VirtioNetUnmapTxBuf(). Before, log
>> "DeviceAddress" (with "%Lx"). After, log "*TxBuf" (with "%p").
>>
>> In each of (1) and (2), the values printed before/after should be
>> identical (= identity-mapping for DMA).
>>
>> Please write a python script that parses the log generated by (1) and
>> (2). The python script should maintain an associative array where the
>> key type is "address", and the value type is "integer" (starting value 0).
>>
>> For each "Buffer" address printed by (1), the python script should
>> increase the counter for that address.
>>
>> For each "*TxBuf" address printed by (2), the python script should
>> decrease the counter for that address.
>>
>> If any counter goes above 1, then the client code invokes undefined
>> behavior. If any counter goes under 0, then we messed up something. If
>> all counters only take values 0 and 1, but you still hit the assert,
>> then again we messed up something.
>>
>> The idea is to check whether:
>>
>> - the SNP consumer code tries to Transmit() the same buffer for a second
>> time *before* it gets it back from GetStatus() -- this is the error that
>> I'm currently suspecting,
>>
>> - or we messed up something (return a buffer that hasn't been queued for
>> transmission, or else we mistake a not-yet-queued buffer for an already
>> queued buffer when it is Transmit()ted for the first time).
>>
>>
>> ... BTW, if you build ArmVirtQemu for the RELEASE target, the ASSERT()
>> will disappear from VirtioNetMapTxBuf(), and Transmit() will simply
>> return EFI_DEVICE_ERROR. This is consistent with your last paragraph --
>> i.e., "boot still fails with multiple errors" -- and the reason for
>> returning this error is that the SNP consumer really tries to invoke
>> undefined behavior. It's not always possible to detect UB, but whenever
>> we can, without a huge performance hit, then in DEBUG builds it's
>> appropriate to trip an assert, and in RELEASE builds, to return an error.
>>
>> Thanks
>> Laszlo
>>
>>> -Aaron Young
>>>
>> Thanks
>> Laszlo
>
>
next prev parent reply other threads:[~2018-10-03 17:35 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-08-30 20:17 Regression with PXE boot on OvmfPkg/VirtioNetDxe driver on aarch64 system aaron.young
2018-08-31 13:17 ` Laszlo Ersek
2018-10-03 16:56 ` aaron.young
2018-10-03 17:35 ` Ard Biesheuvel [this message]
2018-10-03 23:59 ` aaron.young
2018-10-04 9:24 ` Leif Lindholm
2018-10-04 13:06 ` Laszlo Ersek
2018-10-15 20:52 ` aaron.young
2018-10-16 1:55 ` Ard Biesheuvel
2018-10-17 9:50 ` Laszlo Ersek
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-list from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAKv+Gu87iumO0yTjEBNBY6ukZXkWwFN6beFmhMkwTmcwB7PWyw@mail.gmail.com \
--to=devel@edk2.groups.io \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox