From mboxrd@z Thu Jan 1 00:00:00 1970 Authentication-Results: mx.groups.io; dkim=missing; spf=pass (domain: hpe.com, ip: 148.163.143.35, mailfrom: prvs=01164d2842=brian.johnson@hpe.com) Received: from mx0b-002e3701.pphosted.com (mx0b-002e3701.pphosted.com [148.163.143.35]) by groups.io with SMTP; Thu, 01 Aug 2019 13:20:28 -0700 Received: from pps.filterd (m0134423.ppops.net [127.0.0.1]) by mx0b-002e3701.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x71KFcUN021149; Thu, 1 Aug 2019 20:20:27 GMT Received: from g9t5009.houston.hpe.com (g9t5009.houston.hpe.com [15.241.48.73]) by mx0b-002e3701.pphosted.com with ESMTP id 2u43dy1mu9-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 01 Aug 2019 20:20:27 +0000 Received: from g9t2301.houston.hpecorp.net (g9t2301.houston.hpecorp.net [16.220.97.129]) by g9t5009.houston.hpe.com (Postfix) with ESMTP id 7350358; Thu, 1 Aug 2019 20:20:26 +0000 (UTC) Received: from [10.33.152.19] (bjj-laptop2.americas.hpqcorp.net [10.33.152.19]) by g9t2301.houston.hpecorp.net (Postfix) with ESMTP id 0904F48; Thu, 1 Aug 2019 20:20:25 +0000 (UTC) Subject: Re: [edk2-devel] [Patch 0/2] UefiCpuPkg: Default avoid print. To: Laszlo Ersek , devel@edk2.groups.io, Eric Dong Cc: Ray Ni , Michael Kinney References: <20190731073502.24640-1-eric.dong@intel.com> <3a28f2c6-6ef1-c830-b3c6-3cf69c5ca60f@hpe.com> <93d4a7e7-9b86-b3c9-a476-ac1a40dc4723@redhat.com> From: "Brian J. Johnson" Message-ID: Date: Thu, 1 Aug 2019 15:20:25 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: <93d4a7e7-9b86-b3c9-a476-ac1a40dc4723@redhat.com> X-HPE-SCL: -1 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2019-08-01_08:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1906280000 definitions=main-1908010213 Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit On 7/31/19 1:58 PM, Laszlo Ersek wrote: > On 07/31/19 18:34, Brian J. Johnson wrote: > >> I do wonder if there would be a clean way to let a DebugLib instance >> itself declare that AP_DEBUG() is safe. That way a platform would >> only need to override the DebugLib instance in the DSC file, rather >> than both the instance and the PCD. (I know, I'm nitpicking.) A >> library can't override PCDs in its calling modules, of course. I >> suppose the AP_DEBUG() macro could call a new DebugLib entry point to >> test for AP safety before doing anything else, say >> DebugPrintOnApIsSafe(). Or it could even be a global CONST BOOLEAN >> defined by the library. But that would require all DebugLib instances >> to change, which is something you were trying to avoid. > > That's right -- I tried to imagine some other approaches, but they'd > need new DebugLib functions, and likely force platforms to write new > code. > > >> However, it's not always practical to track down all uses of DEBUG(). >> An AP can easily call a library routine which uses DEBUG() rather than >> AP_DEBUG(), buried under several layers of transitive library >> dependencies. In other words, it's not always practical to determine >> ahead of time if a given DEBUG() call may be done on an AP. > > This problem is valid IMO, but I think its scope is a lot wider than > just DebugLib. Assume the programmer is looking at a function that may > be invoked on an AP, and they are about to call a function for taking > care of a specific sub-task. If the programmer cannot prove the > thread-safety of the *entire* call tree underneath the function call > they are about to add, they simply must not add the call. The > thread-safety of the DebugLib instance in use is just a part of the > thread-safety of said call tree. > > Put differently, code that runs APs must be extremely self-contained; > I'd rule out any and all lib classes from direct use unless a specific > library instance, advertizing thread safety, would be chosen in the > platform DSC file. But, if we adopted this approach, we could even > introduce a new AP-oriented library *class* for debug messages, offer a > Null implementation in edk2, and ask platforms to bring their own. > > >> I know that AP code runs in a very restricted environment and that >> people who use MpServices are supposed to understand the >> repercussions, but it gets very difficult when libraries are >> involved. :( > > Exactly -- the first restriction people should understand is, "stay away > from libraries as much as you can". > > >> So would a better solution be to modify the common unsafe DebugLib >> instances to have DebugPrintEnabled() return FALSE on APs? That would >> probably require a new BaseLib interface to determine if the caller is >> running on the BSP or an AP. > > I agree that "AmIAnAP()" would be a pre-requisite. > > >> (For IA32/X64 this isn't too hard -- it just needs to check a bit in >> the local APIC. > > Still not trivial, as some DebugLib instances might want to target > runtime drivers (or even SMM drivers). For runtime drivers the > complication is that a runtime (virtual address) mapping for the LAPIC > MMIO range would be needed (if I understand correctly anyway). And for > both runtime and SMM drivers, it could be a problem that on physical > hardware, the MMIO range of the LAPIC can be moved (reprogrammed) to a > different base address, possibly by the OS too. > > I could be quite confused about this, of course; I don't eat LAPICs for > breakfast :) I just recall an SMM firmware vulnerability that was in > part based on moving the LAPIC base address elsewhere. Hm... googling > suggests the attack was called "The Memory Sinkhole". > I finally looked at the documentation (imagine!), and the BSP flag is bit 8 in the IA32_APIC_BASE (0x1B) MSR. So no local APIC access is necessary, only reading an MSR. This MSR has been present since the Pentium Pro (family/model 06_01H), and is considered "architectural." > >> I have no idea about other architectures.) That wouldn't solve the >> problem everywhere -- anyone using a custom DebugLib would have to >> update it themselves. But it would solve it solidly in the majority >> of cases. >> >> Thoughts? > > My fear might not be reasonable, but I feel quite uncomfortable about > LAPIC accesses in DebugLib APIs. The information ("BSP or AP") is safer > to determine at the call site, I think, even if it takes more human > work. > I understand the concern. DebugLib is exposed in a lot of awkward end cases and odd environments. Caution is warranted. But checking a bit in a MSR is about as simple a test as you can do. It sounds reasonable to me. Actually reading the local APIC would, as you said, be a lot more problematic. The observation remains that the problem is a lot bigger than DebugLib. We can't add AmIAnAP() calls on every single library interface. Are there any other spots people know of which tend to cause a lot of grief for APs? Maybe the memory allocation boot services? At least if there's a simple test for AP-ness which doesn't have additional dependencies, it's easy to add checks as needed. I don't know if this discussion will go anywhere... but it has been an interesting one to have. > I could very well be biased. In OvmfPkg we have a minuscule amount of > code that runs on APs, and even that code is written with total > minimalism in mind. > > Leaping to a different topic... Years ago I was tracking down an MTRR > setup bug in the Xen hypervisor (as it was shipped as a part of RHEL5). > It is necessary to setup MTRRs identically on all CPUs, plus it has to > be done while all CPUs are in a "pen" doing nothing but setting up > MTRRs. The bug was exactly in that part of the code (running > simultaneously on more than a hundred CPUs). It was impossible to print > anything to the serial console -- first because it would be unreadable > for humans, and second because the delays would perturb the buggy > behavior. In the end I had to introduce an array of per-CPU debug > structures where each CPU would record its own view (snapshot) of a > shared resource -- a shared resource that should have been protected by > mutual exclusion between the CPUs. After the CPUs left the "pen" (with > the invalid MTRR configuration established), I'd use the BSP to dump the > array. That showed me that some CPUs had overlapping / inconsistent > views of the shared resource between each other. This proved that the > mutual exclusion primitive didn't work as expected -- it turns out that > the semaphore (or spinlock, not sure) in question used an INT8 counter, > which overflowed when more than 127 CPUs contended for the resource. > I too have done battle with MTRR initialization, with over a thousand processors. I feel your pain. > I'm sure I'm misremembering parts of this story (from several years > distance), the moral is that debugging in a multiprocessing environment > may easily require its own dedicated infrastructure. In edk2, we don't > have anything like that, I think. Could we build it, sufficiently > generally? Like, prepare a log buffer for each CPU before calling > StartupAllAps(), log only to that buffer during the concurrent > execution, and finally dump the buffers? I guess if we don't *reach* the > "finally" part, we could still dump the RAM and investigate the log > buffers that way... Dumping the RAM is certainly an option for virtual > machines, but it might be viable for physical setups too (JTAG, debug > agent... dunno). > > Sorry about the wild speculation :) > > Thanks > Laszlo > -- Brian -------------------------------------------------------------------- "Are we going to push the edge of the envelope, Brain?" "No, Pinky, but we may get to the sticky part." -- Quoted on the Net