Re: Network Stack Budgeting

From: "Tomas Pilar (tpilar)" <tpilar@solarflare.com>
To: "Fu, Siyuan" <siyuan.fu@intel.com>,
	"Wu, Jiaxin" <jiaxin.wu@intel.com>,
	Laszlo Ersek <lersek@redhat.com>,
	"edk2-devel@lists.01.org" <edk2-devel@lists.01.org>
Cc: "Ye, Ting" <ting.ye@intel.com>
Subject: Re: Network Stack Budgeting
Date: Tue, 29 Jan 2019 13:12:41 +0000	[thread overview]
Message-ID: <d2b66fcb-a743-3f17-0503-22146a992fa5@solarflare.com> (raw)
In-Reply-To: <B1FF2E9001CE9041BD10B825821D5BC58B6E155B@SHSMSX103.ccr.corp.intel.com>

Hi Siyuan,

Thanks for the insights, hopefully removing the moderator will fix my current issue and I'll have a look at how to do the recycling better.

Cheers,
Tom

On 29/01/2019 13:06, Fu, Siyuan wrote:
> Hi, Tomas
>
>> -----Original Message-----
>> From: Tomas Pilar (tpilar) [mailto:tpilar@solarflare.com]
>> Sent: Tuesday, January 29, 2019 6:54 PM
>> To: Fu, Siyuan <siyuan.fu@intel.com>; Wu, Jiaxin <jiaxin.wu@intel.com>; Laszlo
>> Ersek <lersek@redhat.com>; edk2-devel@lists.01.org
>> Cc: Ye, Ting <ting.ye@intel.com>
>> Subject: Re: [edk2] Network Stack Budgeting
>>
>>
>>
>> On 29/01/2019 03:20, Fu, Siyuan wrote:
>>> Hi, Tomas
>>>
>>>> -----Original Message-----
>>>> From: Tomas Pilar (tpilar) [mailto:tpilar@solarflare.com]
>>>> Sent: Monday, January 28, 2019 7:24 PM
>>>> To: Fu, Siyuan <siyuan.fu@intel.com>; Wu, Jiaxin <jiaxin.wu@intel.com>;
>> Laszlo
>>>> Ersek <lersek@redhat.com>; edk2-devel@lists.01.org
>>>> Cc: Ye, Ting <ting.ye@intel.com>
>>>> Subject: Re: [edk2] Network Stack Budgeting
>>>>
>>>> Hi Siyuan,
>>>>
>>>> I had to do some tweaks due to the fact that people in the wild want to PXE
>>>> boot whole WDS installations which are upwards of 300MB.
>>>>
>>>> I have an internal set of queues where the NIC receives into. Snp.Receive()
>>>> will get a packet from this store. I had to implement a separate queue for
>>>> unicast/multicast/broadcast traffic so that I could reprioritise and
>> deliver
>>>> unicast traffic while under ARP spam. I tied internal poll to receive and
>>>> transmit, so Snp.Receive() will poll the NIC and then dequeue a packet
>> while
>>>> Snp.Transmit() will transmit the packet and then poll the NIC to get a
>>>> completion. I also implemented a busy poll for 100us or until the next
>> packet
>>>> comes in when the NIC is polled.
>>> I see some problems here. Both Snp.Receive() and Snp.Transmit() should be
>> non-blocking
>>> interface in current design and UEFI spec, but it's not true in your driver.
>>> For Receive(), it could poll the NIC only once and return either a received
>> packet
>>> or an error, it should not do any blocking poll to wait for a packet.
>> On some platforms (tftp clients), if the Snp.Receive() returns EFI_NOT_READY
>> they do not poll the NIC until the next MnpSystemPoll(). Hence the delay
>> before returning EFI_NOT_READY in case the packet is still incoming.
> OK, that's really not good practice.
>
>>> For Transmit(), it just need to put the packet into the TX queue and return
>> directly,
>>> without checking if the TX is completed.
>>> Instead of that, Snp.GetStatus() should be
>>> used to check if the packet has really been sent out.
>> I've seen quite a few platforms where GetStatus was only called by the
>> MnpSystemPoll() and the application would not call it. This means our hardware
>> rings are stuffed with completions and we stop being able to transmit. Yes,
>> you could say that we should solve the application, but unfortunately I can't
>> do that.
> The application doesn't need to call Snp.GetStatus() to recycle the TX buffer, the
> Mnp.Transmit() will do this automatically when it found the SNP's TX buffer is full.
> See MnpSyncSendPacket() and MnpRecycleTxBuf(). 
>
>>> May I know how did you control the 100us timeout of the busy poll (by which
>> API?),
>>> and what's the TPL of this busy poll?
>>>
>> Create a timeout timer with value of 1000. Then busy poll in a loop until
>> packet is received or a timer is triggered.
>> This runs at TPL_CALLBACK while the EfxPoll function itself runs at TPL_NOTIFY.
>>
>> --
>>   /* Set the poll moderation timeout to 100us */
>>   gBS->SetTimer(NetDev->PollModeratorTimer, TimerRelative, 1000);
>>
>>   while (!InterruptStatus) {
>>     EfxPoll(NULL, NetDev->Efx);
>>     InterruptStatus = EfxStatus(NetDev->Efx);
>>
>>     if (gBS->CheckEvent(NetDev->PollModeratorTimer))
>>       return InterruptStatus;
>>   }
>>
>>   return InterruptStatus;
>> --
> This is exactly the case what I worried about. Some platform has a minimum timer
> interval (like 1ms or 10ms), so if you set a timer interval shorter than that,
> it will actually be signaled after the minimum value. Please check if your
> platform could support a such a short timeout (100us).
>
>>>> Otherwise I have seen a pathological case where the TFTP client would pull
>> one
>>>> packet per second but the TFTP_RETRANSMIT interval on the server was also
>> one
>>>> second and everything was awful.
>>> Why TFTP client just pull one packet per second? I think it’s not correct
>> and it
>>> could use the poll() function to accelerate the receive. Why you trying to
>> solve a
>>> TFTP layer problem in SNP layer? It breaks the design principle of network
>> layer model.
>> Yes, I appreciate the principle. However, in practice we don't get to sell
>> adapters that do not PXE boot and it's pointless to argue with customers that
>> they have a rubbish TFTP client implementation. So we put in workarounds into
>> the driver.
> OK I understand your situation now. Technically it will be better to fix the
> problem in TFTP client, so I still suggest you to confirm this with your customer
> or the TFTP application owner. This workaround in your SNP driver will put you
> in the risk that a normal application which follow UEFI spec may not work correctly
> with you, so I think you should only consider this method if you can confirm
> the TFTP client is the only application layer client you need to serve.
>
> Best Regards,
> Siyuan
>
>>>> In theory the busy poll moderator might cause an issue in a pathological
>> case,
>>>> if it stalls the machine for 100us for each packet - with 100 packets per
>>>> second this would eat only 10ms.
>>> Not only 10ms, it's at least 10ms, and in theory. If you have multiple NICs
>> in the
>>> system, this time will double. If someone tries to poll the NIC, the time
>> will increase
>>> significantly. It also depends on who you measure the 100us timeout, I did
>> see some
>>> platform which has a minimum timer interval limitation, that you thought you
>> are using
>>> a 100us timeout but it’s actually 1ms or even 10ms.
>>>
>>> So my suggestion is do NOT add any busy wait in SNP layer API, keep it as
>> unblocking,
>>> and the try to solve the application layer problem in application layer.
>> I'll try dropping the busy poll and see whether that helps
>>
>> Cheers,
>> Tom
>>> Best Regards,
>>> Siyuan
>>>
>>>> Cheers,
>>>> Tom
>>>>
>>>> On 27/01/2019 14:28, Fu, Siyuan wrote:
>>>>> Hi, Tom
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of
>>>> Tomas
>>>>>> Pilar (tpilar)
>>>>>> Sent: Friday, January 25, 2019 8:09 PM
>>>>>> To: Wu, Jiaxin <jiaxin.wu@intel.com>; Laszlo Ersek <lersek@redhat.com>;
>>>> edk2-
>>>>>> devel@lists.01.org
>>>>>> Cc: Ye, Ting <ting.ye@intel.com>
>>>>>> Subject: Re: [edk2] Network Stack Budgeting
>>>>>>
>>>>>> Yeah, that makes sense I suppose. The end result is however that the
>>>> network
>>>>>> device is 'opened' as soon as ConnectController() is called rather than
>>>> when
>>>>>> someone wants to do networking. This seems wrong and directly leads to a
>>>> DoS
>>>>>> of a platform in case of heavy network load (unless we implement
>> budgeting
>>>> in
>>>>>> the network driver, which is a really not what it should be doing).
>>>>> It's true that a configured MNP will start a background system poll to
>>>> receive
>>>>> network packets, but normally it won't lead to a DoS, because the receive
>>>> frequency
>>>>> is actually very low. The MNP driver only tries to receive 1 packet from
>> SNP
>>>> for every
>>>>> 10 milliseconds (see MNP_SYS_POLL_INTERVAL, MnpSystemPoll() and related
>>>> timer event).
>>>>> So no matter how heavy the network is, MNP driver (and upper layer stack)
>>>> will
>>>>> only process at most 100 packets per second, all other packets should be
>>>> dropped
>>>>> by UNDI driver directly.
>>>>>
>>>>> So if there is nobody doing busy network receiving, the network stack cost
>>>> will be
>>>>> at most 100 packets per NIC device and per second.
>>>>>
>>>>> In your first email you mentioned "some performance improvements to my
>>>> network driver",
>>>>> may I know what's the improvement method you are using?
>>>>>
>>>>> Best Regards,
>>>>> Siyuan
>>>>>
>>>>>> How do you suggest we solve this problem?
>>>>>>
>>>>>> Cheers,
>>>>>> Tom
>>>>>>
>>>>>> On 25/01/2019 08:44, Wu, Jiaxin wrote:
>>>>>>> Hi Tom,
>>>>>>>
>>>>>>> One thing I want to highlight is that our design of network stack is not
>>>>>> only for the PXE/HTTP boot trigged in BootManager, but also to make sure
>>>> it's
>>>>>> workable once there is any MNP instance configured by upper drivers
>>>>>> (ARP/IPv4/IPv6).
>>>>>>> Take ARP/IP as an example, once ARP/IP are started, we need a heartbeat
>> to
>>>>>> process any ARP requests, which is required by ARP functionality. In such
>> a
>>>>>> case, SNP must be started to make ARP/IP drivers works well.
>>>>>>> Thanks,
>>>>>>> Jiaxin
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of
>>>>>>>> Tomas Pilar (tpilar)
>>>>>>>> Sent: Friday, January 25, 2019 1:43 AM
>>>>>>>> To: Laszlo Ersek <lersek@redhat.com>; edk2-devel@lists.01.org
>>>>>>>> Subject: Re: [edk2] Network Stack Budgeting
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 24/01/2019 16:49, Laszlo Ersek wrote:
>>>>>>>>> On 01/24/19 14:25, Tomas Pilar (tpilar) wrote:
>>>>>>>>>> Hmm,
>>>>>>>>>>
>>>>>>>>>> Mnp->Configure() will eventually call MnpStartSnp().
>>>>>>>>>>
>>>>>>>>>> A grep for Mnp->Configure shows that:
>>>>>>>>>> * ArpDxe performs this on DriverBinding.Start()
>>>>>>>>>> * Ip6Dxe performs this on DriverBinding.Start()
>>>>>>>>>>
>>>>>>>>>> Ipv4 and DnsDhcp do this as a part of their Configure() they expose
>> in
>>>>>> the
>>>>>>>> API.
>>>>>>>>> Yes, that makes sense. All of the above drivers are UEFI drivers that
>>>>>>>>> follow the UEFI driver model, AIUI. As long as the controller is not
>>>>>>>>> connected from BDS, no BindingStart() function should be called in
>> these.
>>>>>>>> Ah, but I would expect the BDS to call ConnectController() on the NIC,
>>>> but
>>>>>> I
>>>>>>>> would not expect the network stack to be started unless the device is
>>>>>>>> actually chosen for PXE booting. In other words, the above protocols
>>>> should
>>>>>>>> follow the example of EFI_DNS4_PROTOCOL that binds against the device
>>>>>>>> during BDS but .Configure() is not automatically called by
>>>>>>>> DriverBinding.Start().
>>>>>>>>
>>>>>>>> .Configure() should be called by the BootManager if networking is
>>>> actually
>>>>>> to
>>>>>>>> be done. That in turn will configure Mnp and start Snp.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Tom
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> edk2-devel mailing list
>>>>>>>> edk2-devel@lists.01.org
>>>>>>>> https://lists.01.org/mailman/listinfo/edk2-devel
>>>>>> _______________________________________________
>>>>>> edk2-devel mailing list
>>>>>> edk2-devel@lists.01.org
>>>>>> https://lists.01.org/mailman/listinfo/edk2-devel