public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed
From: aaron.young@oracle.com
To: Laszlo Ersek <lersek@redhat.com>, edk2-devel@lists.01.org
Cc: brijesh.singh@amd.com, Ard Biesheuvel <ard.biesheuvel@linaro.org>
Subject: Re: Regression with PXE boot on OvmfPkg/VirtioNetDxe driver on aarch64 system
Date: Wed, 3 Oct 2018 09:56:36 -0700	[thread overview]
Message-ID: <3b4a4980-42d0-51a0-eccd-ceba83b6c78f@oracle.com> (raw)
In-Reply-To: <b034c163-ea74-b059-6186-5ffda946b536@redhat.com>


Hello,

  I have some more information on this issue that I wanted to share.

  The SNP client appears to be GRUB - i.e. the ASSERT() is asserted just 
after PXE code has loaded GRUB, presented the GRUB menu and a boot 
option is selected.

  And, I have confirmed that it, as predicted, attempts to call 
SNP->Transmit() twice with the same Buffer - without successfully 
reaping the Buffer via SNP->GetStatus() in between calls - thus tripping 
the ASSERT() in SnpSharedHelpers.c:VirtioNetMapTxBuf().

  This was quite easy to see since GRUB appears to use only a single 
Buffer for which to transmit packets a few bytes at a time. i.e. it just 
calls SNP->Transmit() and then calls SNP->GetStatus() over and over 
again with the same buffer to transmit its packets.

  From my debug, I can see it work successfully most of the time (i.e. 
SNP->Transmit() is called successfully with a corresponding subsequent 
call to SNP->GetStatus() which returns the Buffer). However, 
occasionally, SNP->GetStatus() will return TxBuf=NULL (indicating that 
the Buffer has not yet been transmitted). In this case, GRUB appears to 
attempt to call SNP->GetStatus() *twice* before giving up (possibly due 
to a timer which I can see in the GRUB code) and then attempts to 
transfer the Buffer again - resulting in the ASSERT().

  So, this explains the regression since the old code would allow the 
same Buffer to to be re-transmitted.

  As a quick test, I coded up a quick workaround in VirtioNetMapTxBuf() 
to walk through Dev->TxBufCollection to see if the Buffer is present - 
i.e. has already been transmitted (and not yet reaped via 
SNP->GetStatus()) - and if so, just return EFI_SUCCESS from 
VirtioNetTransmit(). This allowed the PXE boot/installation to succeed. 
I'm not advocating this as a solution, but is an interesting data point.

  Any suggestion where we go from here?

  Is there any way to make the code more tolerant of these 
re-transmissions (like the code was before)?

  I'm not confident that declaring GRUB as not spec compliant is the answer.

  Thanks,

  -Aaron


On 08/31/18 06:17, Laszlo Ersek wrote:
> On 08/30/18 22:17, aaron.young@oracle.com wrote:
>>    We have encountered what we believe to be a regression in the
>> OvmfPkg/VirtioNetDxe driver.
>>
>>    This regression causes PXE boot to fail with the following ASSERT:
>>
>> ASSERT
>> [VirtioNetDxe]/builddir/build/BUILD/ovmf-92d07e48907f/OvmfPkg/VirtioNetDxe/SnpSharedHelpers.c(184):
>> ((BOOLEAN)(0==1))
>>
>>    The failure was encountered on a aarch64/Armv8 system (Ampere server)
>> using the latest mainline (edk2.git/master) code as of 8-28-2018.
>>
>>    On a hunch, I reverted the following patchset which resulted in PXE
>> boot being successful - which suggests this patchset is likely the culprit:
>>
>> https://lists.01.org/pipermail/edk2-devel/2017-September/014679.html
>>
>>    The command used to initiate the VM PXE boot/install is:
>>
>>    virt-install --name guest1  --vcpus 16 --ram 32000 --arch aarch64
>> --disk guest1.img,size=100 --network
>> type=direct,source=enP4p1s0,model=virtio --pxe
>>
>>    NOTE that I also tried simply removing the offending ASSERT() and PXE
>> boot still fails with multiple errors (exceptions of some sort that I
>> didn't log).
>>
>>
>>    What I'd like to know:
>>
>>    1. Is this a known failure?
> Please read the comment in the context of the failed assertion. If this
> assertion is triggered, then that's an error in the code that consumes
> the SNP instance provided by VirtioNetDxe. The client code invokes
> behavior that is undefined by the UEFI spec.
>
> You can find more details at:
>
> http://mid.mail-archive.com/2adc266a-949d-c546-9bed-d78fea19d969@redhat.com
>
>>    2. Was aarch64 given much testing with this new patchset?
> Yes. I described the tests that I had done before pushing the patch set
> here:
>
> http://mid.mail-archive.com/581a568d-b0ce-34a1-8e52-95adf14c03e0@redhat.com
>
> http://mid.mail-archive.com/1e4267dc-0133-e833-f805-8381ef82c52e@redhat.com
>
>
> And most recently (independently of the patch set you refer to), I
> successfully tested VirtioNetDxe on aarch64 as part of PXEv4, PXEv6,
> HTTPv4, HTTPv6 net-boots, for commit ae08ea246fe9
> ("ArmVirtPkg/ArmVirtQemu: enable the IPv6 stack", 2018-07-13):
>
> http://mid.mail-archive.com/20180712234112.3566-1-lersek@redhat.com
>
>>    3. Is reverting this patchset a safe workaround for now until a fix is
>> found? Or is there another simple workaround perhaps?
> Reverting the set should make the symptom go away, but that just papers
> over the undefined behavior in the protocol caller.
>
> The goal of the series was to enable VirtioNetDxe to function correctly
> on top of:
> - a virtio-1.0 PCI transport, provided by Virtio10Dxe, such that
> - explicit DMA mapping is performed using the EFI_PCI_IO_PROTOCOL
>    underlying the virtio transport, such that
> - the platform provides an IOMMU protocol implementation for
>    PciHostBridgeDxe, which underlies PciBusDxe.
>
> If this use case does not apply to your platform, then you can revert
> the patch set.
>
> In particular, in the ArmVirtQemu platform, we don't hook
> "OvmfPkg/Library/PlatformHasIoMmuLib/PlatformHasIoMmuLib.inf" into
> "PciHostBridgeDxe" via NULL class resolution, thus PciHostBridgeDxe
> there will not wait for another platform driver to decide dynamically
> whether the platform has an actual IOMMU protocol or not. And,
> accordingly, we never produce an IOMMU protocol instance in ArmVirtQemu.
>
> (Obviously the patch set was extensively regression-tested to confirm it
> was a no-op on platforms where the above use case wouldn't apply.)
>
>>    4. Any hunches what the issue could be?
> Code that uses VirtioNetDxe's SNP protocol interface invokes undefined
> behavior; please see my answer to your Q1.
>
>>    Thanks for any info/help/suggestions,
> I'd like to give you a hint on debugging this further as well:
>
> (1) in the VirtioNetTransmit() function, please add DEBUG messages both
> before and after the call to VirtioNetMapTxBuf(). Before, log "Buffer"
> (with %p). After, log "DeviceAddress" (with "%Lx").
>
> (2) In the VirtioNetGetStatus() function, please add DEBUG messages both
> before and after the call to VirtioNetUnmapTxBuf(). Before, log
> "DeviceAddress" (with "%Lx"). After, log "*TxBuf" (with "%p").
>
> In each of (1) and (2), the values printed before/after should be
> identical (= identity-mapping for DMA).
>
> Please write a python script that parses the log generated by (1) and
> (2). The python script should maintain an associative array where the
> key type is "address", and the value type is "integer" (starting value 0).
>
> For each "Buffer" address printed by (1), the python script should
> increase the counter for that address.
>
> For each "*TxBuf" address printed by (2), the python script should
> decrease the counter for that address.
>
> If any counter goes above 1, then the client code invokes undefined
> behavior. If any counter goes under 0, then we messed up something. If
> all counters only take values 0 and 1, but you still hit the assert,
> then again we messed up something.
>
> The idea is to check whether:
>
> - the SNP consumer code tries to Transmit() the same buffer for a second
> time *before* it gets it back from GetStatus() -- this is the error that
> I'm currently suspecting,
>
> - or we messed up something (return a buffer that hasn't been queued for
> transmission, or else we mistake a not-yet-queued buffer for an already
> queued buffer when it is Transmit()ted for the first time).
>
>
> ... BTW, if you build ArmVirtQemu for the RELEASE target, the ASSERT()
> will disappear from VirtioNetMapTxBuf(), and Transmit() will simply
> return EFI_DEVICE_ERROR. This is consistent with your last paragraph --
> i.e., "boot still fails with multiple errors" -- and the reason for
> returning this error is that the SNP consumer really tries to invoke
> undefined behavior. It's not always possible to detect UB, but whenever
> we can, without a huge performance hit, then in DEBUG builds it's
> appropriate to trip an assert, and in RELEASE builds, to return an error.
>
> Thanks
> Laszlo
>
>>    -Aaron Young
>>
> Thanks
> Laszlo



  reply	other threads:[~2018-10-03 16:56 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-30 20:17 Regression with PXE boot on OvmfPkg/VirtioNetDxe driver on aarch64 system aaron.young
2018-08-31 13:17 ` Laszlo Ersek
2018-10-03 16:56   ` aaron.young [this message]
2018-10-03 17:35     ` Ard Biesheuvel
2018-10-03 23:59       ` aaron.young
2018-10-04  9:24         ` Leif Lindholm
2018-10-04 13:06           ` Laszlo Ersek
2018-10-15 20:52             ` aaron.young
2018-10-16  1:55               ` Ard Biesheuvel
2018-10-17  9:50                 ` Laszlo Ersek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3b4a4980-42d0-51a0-eccd-ceba83b6c78f@oracle.com \
    --to=devel@edk2.groups.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox