public inbox for devel@edk2.groups.io
 help / color / mirror / Atom feed
From: Laszlo Ersek <lersek@redhat.com>
To: aaron.young@oracle.com, edk2-devel@lists.01.org
Cc: brijesh.singh@amd.com, Ard Biesheuvel <ard.biesheuvel@linaro.org>
Subject: Re: Regression with PXE boot on OvmfPkg/VirtioNetDxe driver on aarch64 system
Date: Fri, 31 Aug 2018 15:17:54 +0200	[thread overview]
Message-ID: <b034c163-ea74-b059-6186-5ffda946b536@redhat.com> (raw)
In-Reply-To: <ccea8a82-1696-84ac-1cb6-0f1b62e3c9b1@oracle.com>

On 08/30/18 22:17, aaron.young@oracle.com wrote:
>   We have encountered what we believe to be a regression in the
> OvmfPkg/VirtioNetDxe driver.
> 
>   This regression causes PXE boot to fail with the following ASSERT:
> 
> ASSERT
> [VirtioNetDxe]/builddir/build/BUILD/ovmf-92d07e48907f/OvmfPkg/VirtioNetDxe/SnpSharedHelpers.c(184):
> ((BOOLEAN)(0==1))
> 
>   The failure was encountered on a aarch64/Armv8 system (Ampere server)
> using the latest mainline (edk2.git/master) code as of 8-28-2018.
> 
>   On a hunch, I reverted the following patchset which resulted in PXE
> boot being successful - which suggests this patchset is likely the culprit:
> 
> https://lists.01.org/pipermail/edk2-devel/2017-September/014679.html
> 
>   The command used to initiate the VM PXE boot/install is:
> 
>   virt-install --name guest1  --vcpus 16 --ram 32000 --arch aarch64
> --disk guest1.img,size=100 --network
> type=direct,source=enP4p1s0,model=virtio --pxe
> 
>   NOTE that I also tried simply removing the offending ASSERT() and PXE
> boot still fails with multiple errors (exceptions of some sort that I
> didn't log).
> 
> 
>   What I'd like to know:
> 
>   1. Is this a known failure?

Please read the comment in the context of the failed assertion. If this
assertion is triggered, then that's an error in the code that consumes
the SNP instance provided by VirtioNetDxe. The client code invokes
behavior that is undefined by the UEFI spec.

You can find more details at:

http://mid.mail-archive.com/2adc266a-949d-c546-9bed-d78fea19d969@redhat.com

>   2. Was aarch64 given much testing with this new patchset?

Yes. I described the tests that I had done before pushing the patch set
here:

http://mid.mail-archive.com/581a568d-b0ce-34a1-8e52-95adf14c03e0@redhat.com

http://mid.mail-archive.com/1e4267dc-0133-e833-f805-8381ef82c52e@redhat.com


And most recently (independently of the patch set you refer to), I
successfully tested VirtioNetDxe on aarch64 as part of PXEv4, PXEv6,
HTTPv4, HTTPv6 net-boots, for commit ae08ea246fe9
("ArmVirtPkg/ArmVirtQemu: enable the IPv6 stack", 2018-07-13):

http://mid.mail-archive.com/20180712234112.3566-1-lersek@redhat.com

>   3. Is reverting this patchset a safe workaround for now until a fix is
> found? Or is there another simple workaround perhaps?

Reverting the set should make the symptom go away, but that just papers
over the undefined behavior in the protocol caller.

The goal of the series was to enable VirtioNetDxe to function correctly
on top of:
- a virtio-1.0 PCI transport, provided by Virtio10Dxe, such that
- explicit DMA mapping is performed using the EFI_PCI_IO_PROTOCOL
  underlying the virtio transport, such that
- the platform provides an IOMMU protocol implementation for
  PciHostBridgeDxe, which underlies PciBusDxe.

If this use case does not apply to your platform, then you can revert
the patch set.

In particular, in the ArmVirtQemu platform, we don't hook
"OvmfPkg/Library/PlatformHasIoMmuLib/PlatformHasIoMmuLib.inf" into
"PciHostBridgeDxe" via NULL class resolution, thus PciHostBridgeDxe
there will not wait for another platform driver to decide dynamically
whether the platform has an actual IOMMU protocol or not. And,
accordingly, we never produce an IOMMU protocol instance in ArmVirtQemu.

(Obviously the patch set was extensively regression-tested to confirm it
was a no-op on platforms where the above use case wouldn't apply.)

>   4. Any hunches what the issue could be?

Code that uses VirtioNetDxe's SNP protocol interface invokes undefined
behavior; please see my answer to your Q1.

>   Thanks for any info/help/suggestions,

I'd like to give you a hint on debugging this further as well:

(1) in the VirtioNetTransmit() function, please add DEBUG messages both
before and after the call to VirtioNetMapTxBuf(). Before, log "Buffer"
(with %p). After, log "DeviceAddress" (with "%Lx").

(2) In the VirtioNetGetStatus() function, please add DEBUG messages both
before and after the call to VirtioNetUnmapTxBuf(). Before, log
"DeviceAddress" (with "%Lx"). After, log "*TxBuf" (with "%p").

In each of (1) and (2), the values printed before/after should be
identical (= identity-mapping for DMA).

Please write a python script that parses the log generated by (1) and
(2). The python script should maintain an associative array where the
key type is "address", and the value type is "integer" (starting value 0).

For each "Buffer" address printed by (1), the python script should
increase the counter for that address.

For each "*TxBuf" address printed by (2), the python script should
decrease the counter for that address.

If any counter goes above 1, then the client code invokes undefined
behavior. If any counter goes under 0, then we messed up something. If
all counters only take values 0 and 1, but you still hit the assert,
then again we messed up something.

The idea is to check whether:

- the SNP consumer code tries to Transmit() the same buffer for a second
time *before* it gets it back from GetStatus() -- this is the error that
I'm currently suspecting,

- or we messed up something (return a buffer that hasn't been queued for
transmission, or else we mistake a not-yet-queued buffer for an already
queued buffer when it is Transmit()ted for the first time).


... BTW, if you build ArmVirtQemu for the RELEASE target, the ASSERT()
will disappear from VirtioNetMapTxBuf(), and Transmit() will simply
return EFI_DEVICE_ERROR. This is consistent with your last paragraph --
i.e., "boot still fails with multiple errors" -- and the reason for
returning this error is that the SNP consumer really tries to invoke
undefined behavior. It's not always possible to detect UB, but whenever
we can, without a huge performance hit, then in DEBUG builds it's
appropriate to trip an assert, and in RELEASE builds, to return an error.

Thanks
Laszlo

>   -Aaron Young
> 

Thanks
Laszlo


  reply	other threads:[~2018-08-31 13:17 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-30 20:17 Regression with PXE boot on OvmfPkg/VirtioNetDxe driver on aarch64 system aaron.young
2018-08-31 13:17 ` Laszlo Ersek [this message]
2018-10-03 16:56   ` aaron.young
2018-10-03 17:35     ` Ard Biesheuvel
2018-10-03 23:59       ` aaron.young
2018-10-04  9:24         ` Leif Lindholm
2018-10-04 13:06           ` Laszlo Ersek
2018-10-15 20:52             ` aaron.young
2018-10-16  1:55               ` Ard Biesheuvel
2018-10-17  9:50                 ` Laszlo Ersek

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b034c163-ea74-b059-6186-5ffda946b536@redhat.com \
    --to=devel@edk2.groups.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox