From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <star.zeng@intel.com>
Received-SPF: Pass (sender SPF authorized) identity=mailfrom;
 client-ip=134.134.136.100; helo=mga07.intel.com;
 envelope-from=star.zeng@intel.com; receiver=edk2-devel@lists.01.org 
Received: from mga07.intel.com (mga07.intel.com [134.134.136.100])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by ml01.01.org (Postfix) with ESMTPS id 7367A21106F21
 for <edk2-devel@lists.01.org>; Wed,  7 Nov 2018 07:01:25 -0800 (PST)
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga005.fm.intel.com ([10.253.24.32])
 by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 07 Nov 2018 07:01:24 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.54,475,1534834800"; d="scan'208";a="277876593"
Received: from shzintpr03.sh.intel.com (HELO [10.253.24.32]) ([10.239.4.100])
 by fmsmga005.fm.intel.com with ESMTP; 07 Nov 2018 07:01:24 -0800
To: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: edk2-devel-01 <edk2-devel@lists.01.org>, star.zeng@intel.com
References: <1540561286-112684-1-git-send-email-star.zeng@intel.com>
 <1540561286-112684-5-git-send-email-star.zeng@intel.com>
 <CS1PR8401MB1189428C2915107C583C0A42B4CC0@CS1PR8401MB1189.NAMPRD84.PROD.OUTLOOK.COM>
 <CAKv+Gu-U6rtrz11zsTH0+ea4X7WaDdf38ZcZG45Bsh_2cK+L=Q@mail.gmail.com>
 <20181030125006.4deveknlhrwehllb@bivouac.eciton.net>
 <962a2a90-2783-5fd1-25d2-6a834daa3f26@intel.com>
 <CAKv+Gu8hSmrEOAgTn91PPa+oX-6o+HWRjef6jdEPzwBOvr-vdQ@mail.gmail.com>
 <42216d80-d5c0-d071-aa54-932138a05078@intel.com>
From: "Zeng, Star" <star.zeng@intel.com>
Message-ID: <57b70aa4-f2c7-02e4-4eb5-43b0a65ba24c@intel.com>
Date: Wed, 7 Nov 2018 23:00:53 +0800
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:52.0) Gecko/20100101
 Thunderbird/52.9.1
MIME-Version: 1.0
In-Reply-To: <42216d80-d5c0-d071-aa54-932138a05078@intel.com>
Subject: Re: [PATCH V3 4/4] MdeModulePkg EhciDxe: Use common buffer for AsyncInterruptTransfer
X-BeenThere: edk2-devel@lists.01.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: EDK II Development  <edk2-devel.lists.01.org>
List-Unsubscribe: <https://lists.01.org/mailman/options/edk2-devel>,
 <mailto:edk2-devel-request@lists.01.org?subject=unsubscribe>
List-Archive: <http://lists.01.org/pipermail/edk2-devel/>
List-Post: <mailto:edk2-devel@lists.01.org>
List-Help: <mailto:edk2-devel-request@lists.01.org?subject=help>
List-Subscribe: <https://lists.01.org/mailman/listinfo/edk2-devel>,
 <mailto:edk2-devel-request@lists.01.org?subject=subscribe>
X-List-Received-Date: Wed, 07 Nov 2018 15:01:25 -0000
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 8bit

On 2018/11/6 22:37, Zeng, Star wrote:
> On 2018/11/6 17:49, Ard Biesheuvel wrote:
>> On 31 October 2018 at 05:38, Zeng, Star <star.zeng@intel.com> wrote:
>>> Good feedback.
>>>
>>> On 2018/10/30 20:50, Leif Lindholm wrote:
>>>>
>>>> On Tue, Oct 30, 2018 at 09:39:24AM -0300, Ard Biesheuvel wrote:
>>>>>
>>>>> (add back the list)
>>>>
>>>>
>>>> Oi! Go back on holiday!
>>>>
>>>>> On 30 October 2018 at 09:07, Cohen, Eugene <eugene@hp.com> wrote:
>>>>>>
>>>>>> Has this patch been tested on a system that does not have coherent 
>>>>>> DMA?
>>>>>>
>>>>>> It's not clear that this change would actually be faster on a 
>>>>>> system of
>>>>>> that
>>>>>> type since using common buffers imply access to uncached memory.
>>>>>> Depending
>>>>>> on the access patterns the uncached memory access could be more time
>>>>>> consuming than cache maintenance operations.
>>>
>>>
>>> The change/idea was based on the statement below.
>>>    ///
>>>    /// Provides both read and write access to system memory by both the
>>> processor and a
>>>    /// bus master. The buffer is coherent from both the processor's 
>>> and the
>>> bus master's point of view.
>>>    ///
>>>    EfiPciIoOperationBusMasterCommonBuffer,
>>>
>>> Thanks for raising case about uncached memory access. But after 
>>> checking the
>>> code, for Intel VTd case
>>> https://github.com/tianocore/edk2/blob/master/IntelSiliconPkg/Feature/VTd/IntelVTdDxe/BmDma.c#L460 
>>>
>>> (or no IOMMU case
>>> https://github.com/tianocore/edk2/blob/master/MdeModulePkg/Bus/Pci/PciHostBridgeDxe/PciRootBridgeIo.c#L1567), 
>>>
>>> the common buffer is just normal memory buffer.
>>> If someone can help do some test/collect some data on a system using 
>>> common
>>> buffers imply access to uncached memory, that will be great.
>>>
>>
>> OK, so first of all, can anyone explain to me under which
>> circumstances interrupt transfers are a bottleneck? I'd assume that
>> anything throughput bound would use bulk endpoints.
>>
>> Also, since the Map/Unmap calls are only costly when using an IOMMU,
>> could we simply revert to the old behavior if mIoMmu == NULL?
>>
>>>>>
>>>>> I haven't had time to look at these patches yet.
>>>>>
>>>>> I agree with Eugene's concern: the directional DMA routines are much
>>>>> more performant on implementations with non-coherent DMA, and so
>>>>> common buffers should be avoided unless we are dealing with data
>>>>> structures that are truly shared between the CPU and the device.
>>>>>
>>>>> Since this is obviously not the case here, could we please have some
>>>>> numbers about the performance improvement we are talking about here?
>>>>> Would it be possible to improve the IOMMU handling code instead?
>>>
>>>
>>> We collected the data below on a platform with release image and 
>>> Intel VTd
>>> enabled.
>>>
>>> The image size of EhciDxe or XhciDxe can reduce about 120+ bytes.
>>>
>>> EHCI without the patch:
>>> ==[ Cumulative ]========
>>> (Times in microsec.)     Cumulative   Average     Shortest    Longest
>>>     Name         Count     Duration    Duration    Duration    Duration
>>> ------------------------------------------------------------------------------- 
>>>
>>> S0000B00D1DF0        446        2150           4           2         963
>>>
>>> EHCI with the patch:
>>> ==[ Cumulative ]========
>>> (Times in microsec.)     Cumulative   Average     Shortest    Longest
>>>     Name         Count     Duration    Duration    Duration    Duration
>>> ------------------------------------------------------------------------------- 
>>>
>>> S0000B00D1DF0        270         742           2           2          41
>>>
>>> XHCI without the patch:
>>> ==[ Cumulative ]========
>>> (Times in microsec.)     Cumulative   Average     Shortest    Longest
>>>     Name         Count     Duration    Duration    Duration    Duration
>>> ------------------------------------------------------------------------------- 
>>>
>>> S0000B00D14F0        215         603           2           2          52
>>>
>>> XHCI with the patch:
>>> ==[ Cumulative ]========
>>> (Times in microsec.)     Cumulative   Average     Shortest    Longest
>>>     Name         Count     Duration    Duration    Duration    Duration
>>> ------------------------------------------------------------------------------- 
>>>
>>> S0000B00D14F0         95         294           3           2          52
>>>
>>> I believe the performance data really depends on
>>> 1. How many AsyncInterruptTransfer handlers (the number of USB keyboard
>>> and/or USB bluetooth keyboard?)
>>> 2. Data size (for flushing data from PCI controller specific address to
>>> mapped system memory address *in original code*)
>>> 3. The performance of IoMmu->SetAttribute (for example, the SetAttribute
>>> operation on Intel VTd engine caused by the unmap and map for 
>>> flushing data
>>> *in original code*, the SetAttribute operation on IntelVTd engine will
>>> involve FlushPageTableMemory, InvalidatePageEntry and etc)
>>>
>>
>> OK, so there is room for improvement here: there is no reason the
>> IOMMU driver couldn't cache mappings, or do some other optimizations
>> that would make mapping the same memory repeatedly less costly.
> 
> The unmap/map with IOMMU will direct to SetAttribute that will 
> disallow/allow DMA memory access. The IOMMU driver is hard to predict 
> the sequence of unmap/map operations. Do you have more detail about the 
> optimizations?
> 
> Could you take a try with the patch on the platform for the case you and 
> Eugene mentioned?
> 
> Anyway, I am going to revert the patches (3/4 and 4/4, since 1/4 and 2/4 
> have no functionality impact) since the time point is a little sensitive 
> as it is near edk2-stable201811.

I have reverted the patch 3/4 and 4/4 at 
https://github.com/tianocore/edk2/compare/1ed6498...d98fc9a, and we can 
continue the discussion.


> 
> Thanks,
> Star
> 
>>
>>>>
>>>> On an unrelated note to the concerns above:
>>>> Why has a fundamental change to the behaviour of one of the industry
>>>> standard drivers been pushed at the very end of the stable cycle?
>>>
>>>
>>> We thought it was a simple improvement but not fundamental change before
>>> Eugene and Ard raised the concern.
>>>
>>>
>>> Thanks,
>>> Star
>>>
>>>>
>>>> Regards,
>>>>
>>>> Leif
>>>>
>>>