From mboxrd@z Thu Jan 1 00:00:00 1970 Subject: Re: [edk2-devel] [PATCH 1/1] MdeModulePkg/Ata: Fix command status reporting To: Albecki, Mateusz ,devel@edk2.groups.io From: "Albecki, Mateusz" X-Originating-Location: Olsztyn, Warmia-Masuria, PL (134.191.221.84) X-Originating-Platform: Windows Chrome 108 User-Agent: GROUPS.IO Web Poster MIME-Version: 1.0 Date: Mon, 02 Jan 2023 08:30:46 -0800 References: <20221018155419.2808-2-mateusz.albecki@intel.com> In-Reply-To: <20221018155419.2808-2-mateusz.albecki@intel.com> Message-ID: <17259.1672677046390223455@groups.io> Content-Type: multipart/alternative; boundary="6MCAFPVii9CkvkMHkMvw" --6MCAFPVii9CkvkMHkMvw Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Hello, I've done some investigation into the password issue and I think I have the= proposition on how to solve it. The issue arises due to the driver checkin= g the PxIS.TFES(task file error status) and if TFES is set it performs reco= very steps(this is correct according to AHCI spec 6.2.2) and restarts the c= ommand(this is left up to the SW according to AHCI spec). In the case of SE= CURITY UNLOCK command if the password is incorrect device will set the abor= t bit in ERROR register and error bit in status register which will cause t= he controller to set the PxIS.TFES bit which will=C2=A0 trigger command ret= ires. In case of security commands retries are particularly bad since every= incorrect password unlock attempt is decreasing the retry counter on the d= evice side which eventually leads to device locking itself and aborting fur= ther unlock commands(power cycle is required to recover). My proposition for a fix is as follows: 1. The logic in AhciRecoverPortError seems to be largely correct. Even if t= he TFES error bit shouldn't necessarily mean that the command should be ret= ried, according to AHCI spec when TFES is set controller will enter the ERR= :Fatal state and PxCMD.CR needs to be restarted. The only thing that we nee= d to change in this function is to return EFI_DEVICE_ERROR when the port re= start failed(as discussed below). 2. We need to add a logic which will decide if the cmd should be retried. T= his logic should check following conditions: a) If HBDS is set command should be attempted again. HBDS indicates that th= ere was a CRC error during accessing the system memory during DMA transfer(= AHCI spec 6.1.1). b) If HBFS is set do not attempt to send the command again. HBFS indicates = that the pointer given to the AHCI controller is incorrect. Retries won't s= olve that(AHCI spec 6.1.1) c) If IFS or INFS is set further check PxSERR.ERR and PxSERR.DIAG(based on = AHCI spec 6.1.2). i) for PxSERR.DIAG.C(CRC error) -> restart command ii) for PxSERR.DIAG.B(disparity error) -> CRC error should also be reported= so it is covered by above iii) for PxSERR.ERR.P(protocol error) -> do not restart. Protocol errors ar= e unlikely to be transient although I am not sure if Internal buffer overfl= ows can't be caused by transient conditions on the devices side. iv) for PxSERR.DIAG.H(handshake errors) -> do not restart. Acording to spec= handshake errors might be caused by transient conditions such as CRC error= s but I hope in such case HBA will also signal CRC in PxSERR.DIAG.C(spec is= not super clear on that point though) d) If TFES is set further check PxTFD.ERR if PxTFD.ERR bit 7 is set(INTERFA= CE CRC according to ACS-3 spec). For others do not retry. Hopefully this will catch majority of errors caused by transient conditions= while also avoiding restarts on commands that are simply malformed. I will= start working on a patch and if anyone has any suggestions/objections plea= se respond. I will also update bz with this data: https://bugzilla.tianocore.org/show_b= ug.cgi?id=3D4011 ( https://bugzilla.tianocore.org/show_bug.cgi?id=3D4011 ) Thanks, Mateusz --6MCAFPVii9CkvkMHkMvw Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable

Hello,

I've done some investigation into the password issue a= nd I think I have the proposition on how to solve it. The issue arises due = to the driver checking the PxIS.TFES(task file error status) and if TFES is= set it performs recovery steps(this is correct according to AHCI spec 6.2.= 2) and restarts the command(this is left up to the SW according to AHCI spe= c). In the case of SECURITY UNLOCK command if the password is incorrect dev= ice will set the abort bit in ERROR register and error bit in status regist= er which will cause the controller to set the PxIS.TFES bit which will = ; trigger command retires. In case of security commands retries are particu= larly bad since every incorrect password unlock attempt is decreasing the r= etry counter on the device side which eventually leads to device locking it= self and aborting further unlock commands(power cycle is required to recove= r).

My proposition for a fix is as follows:

1. The lo= gic in AhciRecoverPortError seems to be largely correct. Even if the TFES e= rror bit shouldn't necessarily mean that the command should be retried, acc= ording to AHCI spec when TFES is set controller will enter the ERR:Fatal st= ate and PxCMD.CR needs to be restarted. The only thing that we need to chan= ge in this function is to return EFI_DEVICE_ERROR when the port restart fai= led(as discussed below).
2. We need to add a logic which will decide i= f the cmd should be retried. This logic should check following conditions:<= br />
    a) If HBDS is set command should be attempted agai= n. HBDS indicates that there was a CRC error during accessing the system me= mory during DMA transfer(AHCI spec 6.1.1).
    b) If HBFS is= set do not attempt to send the command again. HBFS indicates that the poin= ter given to the AHCI controller is incorrect. Retries won't solve that(AHC= I spec 6.1.1)
    c) If IFS or INFS is set further check PxS= ERR.ERR and PxSERR.DIAG(based on AHCI spec 6.1.2).
     = ;      i) for PxSERR.DIAG.C(CRC error) -> restart command=
           ii) for PxSERR.DIAG.B(dispar= ity error) -> CRC error should also be reported so it is covered by abov= e
           iii) for PxSERR.ERR.P(proto= col error) -> do not restart. Protocol errors are unlikely to be transie= nt although I am not sure if Internal buffer overflows can't be caused by t= ransient conditions on the devices side.
        &= nbsp;  iv) for PxSERR.DIAG.H(handshake errors) -> do not restart. A= cording to spec handshake errors might be caused by transient conditions su= ch as CRC errors but I hope in such case HBA will also signal CRC in PxSERR= .DIAG.C(spec is not super clear on that point though)
    d)= If TFES is set further check PxTFD.ERR if PxTFD.ERR bit 7 is set(INTERFACE= CRC according to ACS-3 spec). For others do not retry.

Hopefull= y this will catch majority of errors caused by transient conditions while a= lso avoiding restarts on commands that are simply malformed. I will start w= orking on a patch and if anyone has any suggestions/objections please respo= nd.

I will also update bz with this data:  https://bugzilla.tianocore= .org/show_bug.cgi?id=3D4011

Thanks,
Mateusz

--6MCAFPVii9CkvkMHkMvw--