public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed
From: aaron.young@oracle.com
To: Ard Biesheuvel <ard.biesheuvel@linaro.org>,
	Leif Lindholm <leif.lindholm@linaro.org>
Cc: Laszlo Ersek <lersek@redhat.com>,
	"edk2-devel@lists.01.org" <edk2-devel@lists.01.org>,
	Brijesh Singh <brijesh.singh@amd.com>
Subject: Re: Regression with PXE boot on OvmfPkg/VirtioNetDxe driver on aarch64 system
Date: Wed, 3 Oct 2018 16:59:54 -0700	[thread overview]
Message-ID: <eb5d2f4d-1b05-3d88-ad1e-d0184857c692@oracle.com> (raw)
In-Reply-To: <CAKv+Gu87iumO0yTjEBNBY6ukZXkWwFN6beFmhMkwTmcwB7PWyw@mail.gmail.com>


  I am suspecting that this patch to GRUB is the cause of a Buffer being 
re-transmitted before reaping the Buffer via SNP->GetStatus():

https://git.centos.org/blob/rpms!grub2.git/1065bd29e776aef83f927747882140dcb6fd5cde/SOURCES!0183-efinet-retransmit-if-our-device-is-busy.patch

  So, to reproduce the issue, the GRUB used via PXE boot needs to 
include this patch.

  Thanks,

  -Aaron


On 10/03/18 10:35, Ard Biesheuvel wrote:
> (+ Leif)
>
> On 3 October 2018 at 18:56,  <aaron.young@oracle.com> wrote:
>> Hello,
>>
>>   I have some more information on this issue that I wanted to share.
>>
>>   The SNP client appears to be GRUB - i.e. the ASSERT() is asserted just
>> after PXE code has loaded GRUB, presented the GRUB menu and a boot option is
>> selected.
>>
>>   And, I have confirmed that it, as predicted, attempts to call
>> SNP->Transmit() twice with the same Buffer - without successfully reaping
>> the Buffer via SNP->GetStatus() in between calls - thus tripping the
>> ASSERT() in SnpSharedHelpers.c:VirtioNetMapTxBuf().
>>
>>   This was quite easy to see since GRUB appears to use only a single Buffer
>> for which to transmit packets a few bytes at a time. i.e. it just calls
>> SNP->Transmit() and then calls SNP->GetStatus() over and over again with the
>> same buffer to transmit its packets.
>>
>>   From my debug, I can see it work successfully most of the time (i.e.
>> SNP->Transmit() is called successfully with a corresponding subsequent call
>> to SNP->GetStatus() which returns the Buffer). However, occasionally,
>> SNP->GetStatus() will return TxBuf=NULL (indicating that the Buffer has not
>> yet been transmitted). In this case, GRUB appears to attempt to call
>> SNP->GetStatus() *twice* before giving up (possibly due to a timer which I
>> can see in the GRUB code) and then attempts to transfer the Buffer again -
>> resulting in the ASSERT().
>>
>>   So, this explains the regression since the old code would allow the same
>> Buffer to to be re-transmitted.
>>
>>   As a quick test, I coded up a quick workaround in VirtioNetMapTxBuf() to
>> walk through Dev->TxBufCollection to see if the Buffer is present - i.e. has
>> already been transmitted (and not yet reaped via SNP->GetStatus()) - and if
>> so, just return EFI_SUCCESS from VirtioNetTransmit(). This allowed the PXE
>> boot/installation to succeed. I'm not advocating this as a solution, but is
>> an interesting data point.
>>
>>   Any suggestion where we go from here?
>>
>>   Is there any way to make the code more tolerant of these re-transmissions
>> (like the code was before)?
>>
>>   I'm not confident that declaring GRUB as not spec compliant is the answer.
>>
>>   Thanks,
>>
>>   -Aaron
>>
>>
>>
>> On 08/31/18 06:17, Laszlo Ersek wrote:
>>> On 08/30/18 22:17, aaron.young@oracle.com wrote:
>>>>     We have encountered what we believe to be a regression in the
>>>> OvmfPkg/VirtioNetDxe driver.
>>>>
>>>>     This regression causes PXE boot to fail with the following ASSERT:
>>>>
>>>> ASSERT
>>>>
>>>> [VirtioNetDxe]/builddir/build/BUILD/ovmf-92d07e48907f/OvmfPkg/VirtioNetDxe/SnpSharedHelpers.c(184):
>>>> ((BOOLEAN)(0==1))
>>>>
>>>>     The failure was encountered on a aarch64/Armv8 system (Ampere server)
>>>> using the latest mainline (edk2.git/master) code as of 8-28-2018.
>>>>
>>>>     On a hunch, I reverted the following patchset which resulted in PXE
>>>> boot being successful - which suggests this patchset is likely the
>>>> culprit:
>>>>
>>>> https://lists.01.org/pipermail/edk2-devel/2017-September/014679.html
>>>>
>>>>     The command used to initiate the VM PXE boot/install is:
>>>>
>>>>     virt-install --name guest1  --vcpus 16 --ram 32000 --arch aarch64
>>>> --disk guest1.img,size=100 --network
>>>> type=direct,source=enP4p1s0,model=virtio --pxe
>>>>
>>>>     NOTE that I also tried simply removing the offending ASSERT() and PXE
>>>> boot still fails with multiple errors (exceptions of some sort that I
>>>> didn't log).
>>>>
>>>>
>>>>     What I'd like to know:
>>>>
>>>>     1. Is this a known failure?
>>> Please read the comment in the context of the failed assertion. If this
>>> assertion is triggered, then that's an error in the code that consumes
>>> the SNP instance provided by VirtioNetDxe. The client code invokes
>>> behavior that is undefined by the UEFI spec.
>>>
>>> You can find more details at:
>>>
>>>
>>> http://mid.mail-archive.com/2adc266a-949d-c546-9bed-d78fea19d969@redhat.com
>>>
>>>>     2. Was aarch64 given much testing with this new patchset?
>>> Yes. I described the tests that I had done before pushing the patch set
>>> here:
>>>
>>>
>>> http://mid.mail-archive.com/581a568d-b0ce-34a1-8e52-95adf14c03e0@redhat.com
>>>
>>>
>>> http://mid.mail-archive.com/1e4267dc-0133-e833-f805-8381ef82c52e@redhat.com
>>>
>>>
>>> And most recently (independently of the patch set you refer to), I
>>> successfully tested VirtioNetDxe on aarch64 as part of PXEv4, PXEv6,
>>> HTTPv4, HTTPv6 net-boots, for commit ae08ea246fe9
>>> ("ArmVirtPkg/ArmVirtQemu: enable the IPv6 stack", 2018-07-13):
>>>
>>> http://mid.mail-archive.com/20180712234112.3566-1-lersek@redhat.com
>>>
>>>>     3. Is reverting this patchset a safe workaround for now until a fix is
>>>> found? Or is there another simple workaround perhaps?
>>> Reverting the set should make the symptom go away, but that just papers
>>> over the undefined behavior in the protocol caller.
>>>
>>> The goal of the series was to enable VirtioNetDxe to function correctly
>>> on top of:
>>> - a virtio-1.0 PCI transport, provided by Virtio10Dxe, such that
>>> - explicit DMA mapping is performed using the EFI_PCI_IO_PROTOCOL
>>>     underlying the virtio transport, such that
>>> - the platform provides an IOMMU protocol implementation for
>>>     PciHostBridgeDxe, which underlies PciBusDxe.
>>>
>>> If this use case does not apply to your platform, then you can revert
>>> the patch set.
>>>
>>> In particular, in the ArmVirtQemu platform, we don't hook
>>> "OvmfPkg/Library/PlatformHasIoMmuLib/PlatformHasIoMmuLib.inf" into
>>> "PciHostBridgeDxe" via NULL class resolution, thus PciHostBridgeDxe
>>> there will not wait for another platform driver to decide dynamically
>>> whether the platform has an actual IOMMU protocol or not. And,
>>> accordingly, we never produce an IOMMU protocol instance in ArmVirtQemu.
>>>
>>> (Obviously the patch set was extensively regression-tested to confirm it
>>> was a no-op on platforms where the above use case wouldn't apply.)
>>>
>>>>     4. Any hunches what the issue could be?
>>> Code that uses VirtioNetDxe's SNP protocol interface invokes undefined
>>> behavior; please see my answer to your Q1.
>>>
>>>>     Thanks for any info/help/suggestions,
>>> I'd like to give you a hint on debugging this further as well:
>>>
>>> (1) in the VirtioNetTransmit() function, please add DEBUG messages both
>>> before and after the call to VirtioNetMapTxBuf(). Before, log "Buffer"
>>> (with %p). After, log "DeviceAddress" (with "%Lx").
>>>
>>> (2) In the VirtioNetGetStatus() function, please add DEBUG messages both
>>> before and after the call to VirtioNetUnmapTxBuf(). Before, log
>>> "DeviceAddress" (with "%Lx"). After, log "*TxBuf" (with "%p").
>>>
>>> In each of (1) and (2), the values printed before/after should be
>>> identical (= identity-mapping for DMA).
>>>
>>> Please write a python script that parses the log generated by (1) and
>>> (2). The python script should maintain an associative array where the
>>> key type is "address", and the value type is "integer" (starting value 0).
>>>
>>> For each "Buffer" address printed by (1), the python script should
>>> increase the counter for that address.
>>>
>>> For each "*TxBuf" address printed by (2), the python script should
>>> decrease the counter for that address.
>>>
>>> If any counter goes above 1, then the client code invokes undefined
>>> behavior. If any counter goes under 0, then we messed up something. If
>>> all counters only take values 0 and 1, but you still hit the assert,
>>> then again we messed up something.
>>>
>>> The idea is to check whether:
>>>
>>> - the SNP consumer code tries to Transmit() the same buffer for a second
>>> time *before* it gets it back from GetStatus() -- this is the error that
>>> I'm currently suspecting,
>>>
>>> - or we messed up something (return a buffer that hasn't been queued for
>>> transmission, or else we mistake a not-yet-queued buffer for an already
>>> queued buffer when it is Transmit()ted for the first time).
>>>
>>>
>>> ... BTW, if you build ArmVirtQemu for the RELEASE target, the ASSERT()
>>> will disappear from VirtioNetMapTxBuf(), and Transmit() will simply
>>> return EFI_DEVICE_ERROR. This is consistent with your last paragraph --
>>> i.e., "boot still fails with multiple errors" -- and the reason for
>>> returning this error is that the SNP consumer really tries to invoke
>>> undefined behavior. It's not always possible to detect UB, but whenever
>>> we can, without a huge performance hit, then in DEBUG builds it's
>>> appropriate to trip an assert, and in RELEASE builds, to return an error.
>>>
>>> Thanks
>>> Laszlo
>>>
>>>>     -Aaron Young
>>>>
>>> Thanks
>>> Laszlo
>>



  reply	other threads:[~2018-10-04  0:00 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-30 20:17 Regression with PXE boot on OvmfPkg/VirtioNetDxe driver on aarch64 system aaron.young
2018-08-31 13:17 ` Laszlo Ersek
2018-10-03 16:56   ` aaron.young
2018-10-03 17:35     ` Ard Biesheuvel
2018-10-03 23:59       ` aaron.young [this message]
2018-10-04  9:24         ` Leif Lindholm
2018-10-04 13:06           ` Laszlo Ersek
2018-10-15 20:52             ` aaron.young
2018-10-16  1:55               ` Ard Biesheuvel
2018-10-17  9:50                 ` Laszlo Ersek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=eb5d2f4d-1b05-3d88-ad1e-d0184857c692@oracle.com \
    --to=devel@edk2.groups.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox