Re: [Patch 2/2] UefiCpuPkg/MpInitLib: Enhance waiting for AP initialization logic.

From: "Dong, Eric" <eric.dong@intel.com>
To: "Brian J. Johnson" <brian.johnson@hpe.com>,
	Laszlo Ersek <lersek@redhat.com>,
	"edk2-devel@lists.01.org" <edk2-devel@lists.01.org>
Cc: "Ni, Ruiyu" <ruiyu.ni@intel.com>, Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [Patch 2/2] UefiCpuPkg/MpInitLib: Enhance waiting for AP initialization logic.
Date: Fri, 27 Oct 2017 01:31:26 +0000	[thread overview]
Message-ID: <ED077930C258884BBCB450DB737E66224AA3EC30@shsmsx102.ccr.corp.intel.com> (raw)
In-Reply-To: <7c0023c7-01cb-e4a2-2c6a-c93372e651a4@hpe.com>

Brian,

> -----Original Message-----
> From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of
> Brian J. Johnson
> Sent: Friday, October 27, 2017 4:48 AM
> To: Dong, Eric <eric.dong@intel.com>; Laszlo Ersek <lersek@redhat.com>;
> edk2-devel@lists.01.org
> Cc: Ni, Ruiyu <ruiyu.ni@intel.com>; Paolo Bonzini <pbonzini@redhat.com>
> Subject: Re: [edk2] [Patch 2/2] UefiCpuPkg/MpInitLib: Enhance waiting for
> AP initialization logic.
> 
> On 10/25/2017 08:13 PM, Dong, Eric wrote:
> > Laszlo,
> >
> >
> >> -----Original Message-----
> >> From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf
> >> Of Laszlo Ersek
> >> Sent: Wednesday, October 25, 2017 11:08 PM
> >> To: Dong, Eric <eric.dong@intel.com>; edk2-devel@lists.01.org
> >> Cc: Ni, Ruiyu <ruiyu.ni@intel.com>; Paolo Bonzini
> >> <pbonzini@redhat.com>
> >> Subject: Re: [edk2] [Patch 2/2] UefiCpuPkg/MpInitLib: Enhance waiting
> >> for AP initialization logic.
> >>
> >> Hi Eric,
> >>
> >> On 10/25/17 07:42, Dong, Eric wrote:
> >>> Hi Laszlo,
> >>>
> >>> I think I have clear about your raised issues for Ovmf platform. In
> >>> this case, I
> >> think your platform not suit for this code change.  How about I do
> >> below change based on the new code:
> >>>
> >>> -      if (CpuMpData->CpuCount == 0) {
> >>>          TimedWaitForApFinish (
> >>>            CpuMpData,
> >>>            PcdGet32 (PcdCpuMaxLogicalProcessorNumber) - 1,
> >>>            PcdGet32 (PcdCpuApInitTimeOutInMicroSeconds)
> >>>            );
> >>> -      }
> >>>
> >>> After this change, for Ovmf can still set
> >>> PcdCpuApInitTimeOutInMicroSeconds to MAX_UINT32 to keep old
> >> solution.
> >>> For the real platform, it can set a small value to follow the new
> >>> solution. For this new change, it just wait one more
> >>> PcdCpuApInitTimeOutInMicroSeconds timeout, should not have
> >> performance
> >>> impact (This time may also been cost in later while
> >>> (CpuMpData->MpCpuExchangeInfo->NumApsExecuting != 0) loop) or
> have
> >>> litter performance impact (not cover by the later loop).
> >> The suggested change will work for OVMF, thanks!
> >>
> >> (Just please don't forget to un-indent the TimedWaitForApFinish()
> >> call that will become unconditional again.)
> >>
> >> --o--
> >>
> >> Do you think Brian's concern should be investigated as well (perhaps
> >> in a separate Bugzilla entry)?
> >
> > As Jeff has mentioned, only Ovmf can know the exist Ap numbers before
> > send the Init-sipi-sipi.  So we don't know how many bits needed for this
> bitmap.
> > In case we can create the bitmap, we still don't know when all Aps have
> been
> >   found(If the begin map bit value same as finish map bit value, we
> > still can't get the conclusion that all Aps have been found, because
> > maybe other Aps not started yet).
> >
> 
> Right, physical platforms don't usually know the number of CPUs up front, so
> they still need a timeout.  That's unavoidable.  But we've seen cases where
> interrupts aren't getting delivered reliably, so not all the expected CPUs start
> up (based on the core counts in the physical sockets, as known by the
> developing engineer, not the firmware.)  To debug this failure, it's very
> useful to have a list of exactly which CPUs did start.
> 
> Similarly, we've seen cases where a CPU starts, but crashes due to h/w or
> s/w bugs before signaling the BSP that it has finished.  With only a counter to
> indicate how many CPUs are still running, it's impossible to tell which CPUs
> failed.  That makes debugging much more difficult.
> 
> The bitmaps would need to be sized according to the maximum number of
> CPUs supported by the platform (or the maximum APIC ID, depending on
> how they are indexed), like other data structures in the MpCpu code.
> 
> Bitmaps aren't the only possible implementation of course....  My point is
> that it's useful to have a list of which CPUs started executing, and which
> finished.
> 

Seems this is not a must have item for this patch related issue. I think you can submit
A bugz for this debug feature.

Thanks,
Eric

> > Thanks,
> > Eric
> >>
> >> Because, I believe, the scheduling pattern that I described earlier
> >> could be possible on physical platforms as well, at least in theory:
> >>
> >>>> (2) After at least one AP has started running, the logic expects
> >>>> "NumApsExecuting" to monotonically grow for a while, and then to
> >>>> monotonically decrease, back to zero. For example, if we have 1 BSP
> >>>> and
> >>>> 7 APs, the BSP logic more or less expects the following values in
> >>>> "NumApsExecuting":
> >>>>
> >>>> 1; 2; 3; 4; 5; 6; 7;
> >>>> 6; 5; 4; 3; 2; 1; 0
> >>>>
> >>>>
> >>>> While this may be a valid expectation for physical processors
> >>>> (which actually run in parallel, in the physical world), in a
> >>>> virtual machine, it is not
> >> guaranteed.
> >>>> Dependent on hypervisor scheduling artifacts, it is possible that,
> >>>> say, three APs start up *and finish* before the remaining four APs
> >>>> start up *at all*. The INIT-SIPI-SIPI marks all APs for execution /
> >>>> scheduling in the hypervisor, yes, but the actual code execution
> >>>> may commence a lot later. For example, the BSP may witness the
> >>>> following
> >> series of values in "NumApsExecuting":
> >>>>
> >>>> 1; 2; 3;
> >>>> 2; 1; 0;
> >>>> 1; 2; 3; 4;
> >>>> 3; 2; 1; 0
> >>>>
> >>>> and the BSP could think that there are 3 APs only, when it sees the
> >>>> first 0 value.
> >>
> >> I don't know if such a pattern is "likely", "unlikely", or "extremely
> >> unlikely" on physical platforms. I think the pattern is "theoretically
> possible" at least.
> >>
> >> I suggest that we go ahead with the small patch for the OVMF use case
> >> first, and then open a new BZ if Brian thinks there's a real safety
> >> problem with the code.
> >>
> >
> > We also notice this theoretical problem when we implement this change,
> > but We think this is rarely to happen on physical platforms. We think
> > the value for the change is much bigger than not do this change
> > because of this theoretical problem.
> 
> I strongly disagree.  Any chance for it to happen on physical platforms is too
> large of a chance.  There's simply too much variance between physical
> platforms to make assumptions about interrupt delivery and the interleaving
> of atomic operations among dozens (or thousands!) of CPUs.
> 
> With the latest change from Eric above, we can restore the old behavior by
> setting PcdCpuApInitTimeOutInMicroSeconds large enough to cover all APs.
> So the new behavior is just an optimization which a platform can enable if its
> timing characteristics are suitable for it.  That should work.
> 
> Thanks,
> Brian
> 
> >
> > Thanks,
> > Eric
> >> ... I should note that, with the small patch that's going to fix the
> >> OVMF use case, the previous behavior will be restored for *all*
> >> platforms that continue using their previous PCD values.
> >>
> >> The cumulative effect of commit 0594ec417c89 and the upcoming patch
> >> will be that a platform may significantly decrease
> >> "PcdCpuApInitTimeOutInMicroSeconds", if it chooses to, and then the
> >> "NumApsExecuting" loop will take the role of the preexisting
> >> TimedWaitForApFinish() call. If a platform doesn't want that, then it
> >> should keep its "PcdCpuApInitTimeOutInMicroSeconds" as high as
> >> before, and then any implementation details in the "NumApsExecuting"
> loop will be irrelevant.
> >>
> >> Thanks!
> >> Laszlo
> >> _______________________________________________
> >> edk2-devel mailing list
> >> edk2-devel@lists.01.org
> >> https://lists.01.org/mailman/listinfo/edk2-devel
> > _______________________________________________
> > edk2-devel mailing list
> > edk2-devel@lists.01.org
> > https://lists.01.org/mailman/listinfo/edk2-devel
> >
> 
> 
> --
> 
>                                                  Brian
> 
> --------------------------------------------------------------------
> 
>    "This isn't a design, it's a hack."
>                                             -- Stephen Gunn
> _______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.01.org
> https://lists.01.org/mailman/listinfo/edk2-devel