diff --git a/blog/content/post/2023-07-28-optimizing-spi-on-stm32.md b/blog/content/post/2023-07-28-optimizing-spi-on-stm32.md
new file mode 100644
index 00000000..18a5ac26
--- /dev/null
+++ b/blog/content/post/2023-07-28-optimizing-spi-on-stm32.md
@@ -0,0 +1,914 @@
+---
+title: "Optimizing SPI communication on STM32 MCUs: a comprehensive guide to high-frequency communication"
+abstract: 'In this blog post, we delve into the fascinating world of Serial
+  Peripheral Interface (SPI) on STM32 microcontrollers (MCUs). Specifically, we
+  explore how SPI is utilized in TPM chips for communication with PC
+  motherboards. We encounter the challenges of achieving high-frequency SPI
+  communication, especially when operating as a slave, and the specific
+  limitations of STM32L476 MCUs. The article further uncovers the limitations of
+  existing approaches in platforms like Zephyr and STM32 HAL. We then embark on a
+  journey to fix SPI issues, improve DMA performance, and optimize the firmware
+  for high-speed SPI communication. Throughout the blog post, we provide in-depth
+  technical insights and share valuable test results. Join us as we push the
+  boundaries of SPI communication and unlock new possibilities in the world of
+  microcontrollers.'
+cover: /img/nucleo-l476rg.jpg
+author: artur.kowalski
+layout: post
+published: true
+date: 2023-07-28
+archives: "2023"
+
+tags:
+  - embedded
+  - spi
+  - Zephyr
+  - STM32
+  - STMicroelectronics
+  - TPM
+  - TwPM
+categories:
+  - Firmware
+  - Miscellaneous
+  - Security
+
+---
+
+STM32 MCUs come with various peripherals, one of them is SPI (Serial Peripheral
+Interface) which is a simple serial bus interface commonly used for
+short-distance communication between various devices. SPI is one of the
+interfaces used by TPM chips for communication with PC motherboard. SPI uses 4
+lines for communication: MOSI, MISO, SCK, SS, which are described down below.
+
+The device must implement the TPM protocol to work as a TPM module. TPM protocol
+works by transmitting variable-length frames over SPI. The 4-byte TPM header
+contains fields describing the length of the data payload, the address of the
+target TPM register, and the transfer direction (read or write). TPM  protocol
+has its own means of handling flow control (as there isn't a standard flow
+control mechanism on SPI) and for doing bus aborts.
+
+TPM must be able to operate at frequencies from 10 MHz to 24 MHz to comply with
+the [TCG PTP specification](https://trustedcomputinggroup.org/resource/pc-client-platform-tpm-profile-ptp-specification/)
+(see section 7.4.1 and [TwPM Documentation](https://twpm.dasharo.com/explanation/hardware-requirements/#spi-interface)).
+Getting SPI right on such high frequencies is a significant challenge,
+especially when operating as a slave.TPM-specific features complicate things
+further. Some platforms require TPMs to support higher frequencies. PTP
+specification encourages support for the 33-66 MHz range in addition to the
+required range of 10-24 MHz, and future versions of the specification may
+mandate higher frequencies, so the platform should be capable of handling them.
+
+## Limitations of STM32L476
+
+STM32L476 has SPI capable of frequencies up to a (theoretical) limit of 40 MHz,
+which is half of the maximum clock that can be provided to Cortex-M and AHB/APB
+buses. SPI capabilities are described in the STM32 Programming Manual
+([RM0351](https://www.st.com/resource/en/reference_manual/rm0351-stm32l47xxx-stm32l48xxx-stm32l49xxx-and-stm32l4axxx-advanced-armbased-32bit-mcus-stmicroelectronics.pdf))
+section 42.2.
+
+Another limiting factor is maximum GPIO speed, which depends on operating
+conditions such as the voltage provided to the MCU, ambient temperature, and
+parameters of cables used to connect the master and slave. GPIO limitations are
+described in the [STM32L476RG](https://www.st.com/resource/en/datasheet/stm32l476rg.pdf)
+datasheet in section 6.4. Table 72 describes the maximum frequency of GPIO
+outputs.
+
+More problematic may be DMA limitations. The hard limit of DMA transfer speed
+would be 80 Mbits/s as 80 MHz is the maximum frequency that can be provided to
+the AHB bus and the MCU. The actual transfer speed may be lower due to AHB and
+APB protocol overhead, bus contention, etc. Unfortunately, the datasheet does
+not provide information about DMA transfer limitations.
+
+Last but not least, the performance of the firmware itself is significant.
+Achieving target frequency may require extensive optimizations.
+
+## Creating SPI slave on Zephyr
+
+Zephyr is our platform of choice primarily due to its portability (we will
+target non-STM32 platforms too). I will briefly describe the outcome of my early
+tests done on Zephyr and why it was terrible.
+
+The application just transmits a static sequence of bytes:
+
+```c
+const struct device *spi_dev = NULL;
+struct spi_config spi_cfg = {
+	.frequency = 25000000,
+	.operation = SPI_WORD_SET(8) | SPI_TRANSFER_MSB | SPI_OP_MODE_SLAVE,
+};
+
+const uint8_t testdata[32] = {
+	0x3d, 0x7f, 0x85, 0xc3, 0x86, 0x2f, 0x14, 0x2a, 0xa2, 0x67, 0x1d, 0xd7, 0xfa, 0xa8, 0x3a, 0x42,
+	0xf8, 0x12, 0xd2, 0xa1, 0x04, 0xcc, 0xe2, 0xc6, 0x78, 0x73, 0x09, 0xe6, 0xd8, 0xc5, 0x0e, 0xba,
+};
+
+void main(void) {
+	spi_dev = device_get_binding(DEVICE_DT_NAME(DT_NODELABEL(spi2)));
+	if (!device_is_ready(spi_dev)) {
+		printk("SPI controller is not ready!\n");
+		return;
+	}
+
+	struct spi_buf spi_tx_buf = {
+		.buf = (void *)testdata,
+		.len = sizeof testdata,
+	};
+
+	struct spi_buf_set spi_tx_buf_set = {
+		.buffers = &spi_tx_buf,
+		.count = 1,
+	};
+
+	while (true) {
+		int ret = spi_write(spi_dev, &spi_cfg, &spi_tx_buf_set);
+		if (ret < 0)
+			printk("spi_write failed: %d\n", ret);
+		else
+			printk("spi transfer complete\n");
+	}
+}
+```
+
+It is necessary to enable some configs:
+
+```shell
+CONFIG_GPIO=y
+CONFIG_SPI=y
+CONFIG_SPI_SLAVE=y
+CONFIG_SPI_STM32_DMA=y
+```
+
+And configure DMA channels:
+
+```shell
+&dma1 {
+	status = "okay";
+};
+
+&spi1 {
+	status = "disabled";
+};
+
+&spi2 {
+	/*
+	 * See https://docs.zephyrproject.org/3.0.0/reference/devicetree/bindings/dma/st%2Cstm32-dma-v2.html
+	 */
+	dmas = <&dma1 5 1 0x20440>,
+	       <&dma1 4 1 0x20480>;
+	dma-names = "tx", "rx";
+};
+
+&spi3 {
+	status = "disabled";
+};
+```
+
+If we run this code we will see SPI errors running down the console:
+
+```shell
+spi_write failed: -11
+spi_write failed: -11
+spi_write failed: -11
+spi_write failed: -11
+spi_write failed: -11
+```
+
+This is the first problem with Zephyr's SPI driver - each transfer has a
+one-second timeout. While this may be desirable behavior for SPI master (it
+could be used for error recovery, for example, to power cycle the slave if it
+doesn't respond), it breaks SPI slave. The slave must be ready to give a
+response when the transfer commences - appropriate data must already be loaded
+in FIFO. Here we get stuck in an endless loop, queuing the transfer, aborting
+it, and queuing it again.
+
+The problem can be worked around by patching the
+[wait_dma_rx_tx_done](https://github.com/zephyrproject-rtos/zephyr/blob/39391b4a160a4e23a2b7f213f94cf04b2c250ad7/drivers/spi/spi_ll_stm32.c#L692)
+function in `spi_ll_stm32.c`. The original function looks like this:
+
+```c
+static int wait_dma_rx_tx_done(const struct device *dev)
+{
+	struct spi_stm32_data *data = dev->data;
+	int res = -1;
+
+	while (1) {
+		res = k_sem_take(&data->status_sem, K_MSEC(1000));
+		if (res != 0) {
+			return res;
+		}
+...
+```
+
+Just replace `K_MSEC(1000)` with `K_FOREVER`.
+
+Now running `spitest` at 100 KHz yields the following result:
+
+![STM32 SPI slave at 100 KHz](/img/stm32-spislave-100khz.png)
+
+The transfer works properly at 100 KHz. At 10 MHz the transfer sometimes works,
+sometimes does not:
+
+![STM32 SPI slave at 10 MHz](/img/stm32-spislave-10mhz.png)
+
+At 24 MHz transfer is completely corrupted. We have been looking for a solution
+in Zephyr [Issues](https://github.com/zephyrproject-rtos/zephyr/issues) and
+[Pull Requests](https://github.com/zephyrproject-rtos/zephyr/pulls) but found
+nothing useful.
+
+Looking at Zephyr's
+SPI driver code, we discovered that every call to `spi_write` causes many things
+to happen. Among others, the SPI controller is reconfigured
+[every single time](https://github.com/zephyrproject-rtos/zephyr/blob/39391b4a160a4e23a2b7f213f94cf04b2c250ad7/drivers/spi/spi_ll_stm32.c#L751).
+During this process, the SPI controller is disabled and re-enabled, which is
+quite suspicious.
+
+## Reading STM32 documentation
+
+I've been searching through STM32 documentation for information about high-speed
+SPI. The most helpful were the STM32L4 series programming manual and
+[AN5543](https://www.st.com/resource/en/application_note/an5543-enhanced-methods-to-handle-spi-communication-on-stm32-devices-stmicroelectronics.pdf).
+Section 4.2 of [AN5543](https://www.st.com/resource/en/application_note/an5543-enhanced-methods-to-handle-spi-communication-on-stm32-devices-stmicroelectronics.pdf)
+describes various aspects of handling high-speed communication, and section
+4.2.2 describes what happens when SPI is disabled.
+
+The main problem here is that Zephyr (as well as STM32 HAL) re-configures SPI
+before each transaction, doing configure-enable-transmit-disable cycle on each
+SPI session. While this is ok for master, slave must respect timings imposed
+by master, so SPI disabling should be avoided if not needed.
+
+The problem becomes even more evident when we want to implement TPM protocol
+as we don't know size (and direction) of data payload. Each TPM frame starts
+with a 4 byte header which tells us what is the size of transfer and what is the
+direction (read from or write to a register):
+
+![](/img/tpm2_spi_protocol.png)
+
+After we read the header, we disable SPI, causing a few things:
+
+- MISO is left floating (we have SPI v1.3 on STM32L4)
+- we introduce additional delay by re-configuring SPI
+
+## Fixing SPI
+
+We decided to continue the tests using only HAL and STM32CubeIDE (we plan to
+port the solution back to Zephyr). From earlier tests, we already know that HAL
+also does not work correctly, but it is easier to roll out a custom solution.
+
+So, I created a new STM32CubeMX project and set up the SPI2 controller through
+the graphical configuration manager. Basic settings involve configuring SPI as a
+Full-Duplex Slave, configuring NSS (Chip Select) pin as input, setting 8-bit
+frame length (as required by TPM spec), and setting up DMA channels. All other
+settings are left at their defaults.
+
+
+![](/img/stm32cube_spi2_setup.png)
+
+![](/img/stm32cube_spi_setup_dma.png)
+
+STM32CubeMX generates code that performs hardware initialization, and we are
+ready to do SPI transactions using the `HAL_SPI_TransmitReceive_DMA` function.
+Let's look at the implementation:
+
+```c
+HAL_StatusTypeDef HAL_SPI_TransmitReceive_DMA(SPI_HandleTypeDef *hspi, uint8_t *pTxData, uint8_t *pRxData, uint16_t Size)
+{
+  ...
+
+  /* Reset the threshold bit */
+  CLEAR_BIT(hspi->Instance->CR2, SPI_CR2_LDMATX | SPI_CR2_LDMARX);
+
+  /* The packing mode management is enabled by the DMA settings according the spi data size */
+  if (hspi->Init.DataSize > SPI_DATASIZE_8BIT)
+  {
+    /* Set fiforxthreshold according the reception data length: 16bit */
+    CLEAR_BIT(hspi->Instance->CR2, SPI_RXFIFO_THRESHOLD);
+  }
+  else
+  {
+    /* Set RX Fifo threshold according the reception data length: 8bit */
+    SET_BIT(hspi->Instance->CR2, SPI_RXFIFO_THRESHOLD);
+
+    if (hspi->hdmatx->Init.MemDataAlignment == DMA_MDATAALIGN_HALFWORD)
+    {
+      if ((hspi->TxXferSize & 0x1U) == 0x0U)
+      {
+        CLEAR_BIT(hspi->Instance->CR2, SPI_CR2_LDMATX);
+        hspi->TxXferCount = hspi->TxXferCount >> 1U;
+      }
+      else
+      {
+        SET_BIT(hspi->Instance->CR2, SPI_CR2_LDMATX);
+        hspi->TxXferCount = (hspi->TxXferCount >> 1U) + 1U;
+      }
+    }
+
+    if (hspi->hdmarx->Init.MemDataAlignment == DMA_MDATAALIGN_HALFWORD)
+    {
+      /* Set RX Fifo threshold according the reception data length: 16bit */
+      CLEAR_BIT(hspi->Instance->CR2, SPI_RXFIFO_THRESHOLD);
+
+      if ((hspi->RxXferCount & 0x1U) == 0x0U)
+      {
+        CLEAR_BIT(hspi->Instance->CR2, SPI_CR2_LDMARX);
+        hspi->RxXferCount = hspi->RxXferCount >> 1U;
+      }
+      else
+      {
+        SET_BIT(hspi->Instance->CR2, SPI_CR2_LDMARX);
+        hspi->RxXferCount = (hspi->RxXferCount >> 1U) + 1U;
+      }
+    }
+  }
+
+  /* Check if we are in Rx only or in Rx/Tx Mode and configure the DMA transfer complete callback */
+  if (hspi->State == HAL_SPI_STATE_BUSY_RX)
+  {
+    /* Set the SPI Rx DMA Half transfer complete callback */
+    hspi->hdmarx->XferHalfCpltCallback = SPI_DMAHalfReceiveCplt;
+    hspi->hdmarx->XferCpltCallback     = SPI_DMAReceiveCplt;
+  }
+  else
+  {
+    /* Set the SPI Tx/Rx DMA Half transfer complete callback */
+    hspi->hdmarx->XferHalfCpltCallback = SPI_DMAHalfTransmitReceiveCplt;
+    hspi->hdmarx->XferCpltCallback     = SPI_DMATransmitReceiveCplt;
+  }
+
+  /* Set the DMA error callback */
+  hspi->hdmarx->XferErrorCallback = SPI_DMAError;
+
+  /* Set the DMA AbortCpltCallback */
+  hspi->hdmarx->XferAbortCallback = NULL;
+
+  /* Enable the Rx DMA Stream/Channel  */
+  if (HAL_OK != HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)hspi->pRxBuffPtr,
+                                 hspi->RxXferCount))
+  {
+    /* Update SPI error code */
+    SET_BIT(hspi->ErrorCode, HAL_SPI_ERROR_DMA);
+    errorcode = HAL_ERROR;
+
+    hspi->State = HAL_SPI_STATE_READY;
+    goto error;
+  }
+
+  /* Enable Rx DMA Request */
+  SET_BIT(hspi->Instance->CR2, SPI_CR2_RXDMAEN);
+
+  /* Set the SPI Tx DMA transfer complete callback as NULL because the communication closing
+  is performed in DMA reception complete callback  */
+  hspi->hdmatx->XferHalfCpltCallback = NULL;
+  hspi->hdmatx->XferCpltCallback     = NULL;
+  hspi->hdmatx->XferErrorCallback    = NULL;
+  hspi->hdmatx->XferAbortCallback    = NULL;
+
+  /* Enable the Tx DMA Stream/Channel  */
+  if (HAL_OK != HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)hspi->pTxBuffPtr, (uint32_t)&hspi->Instance->DR,
+                                 hspi->TxXferCount))
+  {
+    /* Update SPI error code */__HAL_SPI_ENABLE
+    SET_BIT(hspi->ErrorCode, HAL_SPI_ERROR_DMA);
+    errorcode = HAL_ERROR;
+
+    hspi->State = HAL_SPI_STATE_READY;
+    goto error;
+  }
+
+  /* Check if the SPI is already enabled */
+  if ((hspi->Instance->CR1 & SPI_CR1_SPE) != SPI_CR1_SPE)
+  {
+    /* Enable SPI peripheral */
+    __HAL_SPI_ENABLE(hspi);
+  }
+  /* Enable the SPI Error Interrupt Bit */
+  __HAL_SPI_ENABLE_IT(hspi, (SPI_IT_ERR));
+
+  /* Enable Tx DMA Request */
+  SET_BIT(hspi->Instance->CR2, SPI_CR2_TXDMAEN);
+
+  ...
+}
+```
+
+What this code does:
+
+- Initialize callbacks (like transfer complete callbacks)
+- Configure SPI registers
+- Initialize DMA channels and enable DMA on SPI controller (RXDMAEN and TXDMAEN
+  bits)
+- Enable SPI interrupts
+- Enable SPI controller
+
+Many of these things could be done only once and never changed. Doing this every
+time introduces additional overhead. Moreover, SPI is re-enabled before each
+transaction and disabled after the transaction. This worsens the overhead and
+causes other problems described in
+[AN5543](https://www.st.com/resource/en/application_note/an5543-enhanced-methods-to-handle-spi-communication-on-stm32-devices-stmicroelectronics.pdf):
+
+> SPI versions 1.x.x: the peripheral takes no control of the associated GPIOs
+> when it is disabled. The SPI signals float if they are not supported by
+> external resistor and if they are not reconfigured and they are kept at
+> alternate function configuration.
+
+> At principle, the SPI must not be disabled before the communication is fully
+> completed and it should be as short as possible at slave, especially between
+> sessions, to avoid missing any communication.
+
+On Nucleo L476RG we use, we have SPI v1.3, which does not drive MISO when
+disabled. We have observed MISO line changing unexpectedly during SPI idle
+periods, presumably caused by this.
+
+`HAL_SPI_TransmitReceive_DMA` setups interrupt callbacks which handle error
+detection and the end-of-transaction condition (`SPI_EndRxTxTransaction`), which
+involves waiting for the master to stop sending data and the SPI bus to become
+idle. This causes more unnecessary overhead, as we don't have to wait for SPI
+idle. We can process data as soon as RX DMA completes and queue more data as
+soon as TX DMA completes.
+
+A transaction in the TPM protocol consists of three steps: TPM header
+transmission, flow control, and data payload transmission. After receiving the
+header, we know the size of the entire transaction, removing the need for
+end-of-transaction checking.
+
+I created a stripped-down version of `HAL_SPI_TransmitReceive_DMA`:
+
+```c
+static uint8_t tx_buf[4] = {0xaa, 0xe0, 0xbb, 0xa5};
+static uint8_t rx_buf[4] = {0};
+
+static void rxdma_complete(DMA_HandleTypeDef *hdma)
+{
+	SPI_HandleTypeDef *hspi = (SPI_HandleTypeDef *)(((DMA_HandleTypeDef *)hdma)->Parent);
+	HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)rx_buf, 4);
+}
+
+static void txdma_complete(DMA_HandleTypeDef *hdma)
+{
+	SPI_HandleTypeDef *hspi = (SPI_HandleTypeDef *)(((DMA_HandleTypeDef *)hdma)->Parent);
+	HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)tx_buf, (uint32_t)&hspi->Instance->DR, 4);
+}
+
+void app_main() {
+	SPI_HandleTypeDef *hspi = &hspi2;
+
+	// Initialize callbacks
+	hspi->hdmatx->XferCpltCallback = txdma_complete;
+	hspi->hdmarx->XferCpltCallback = rxdma_complete;
+
+	// One-time SPI configuration
+	// Clear SPI_RXFIFO_THRESHOLD to trigger DMA on each byte available.
+	CLEAR_BIT(hspi->Instance->CR2, SPI_RXFIFO_THRESHOLD);
+
+	// Start the transfer
+	SET_BIT(hspi->Instance->CR2, SPI_CR2_RXDMAEN);
+	HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)rx_buf, 4);
+
+	HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)tx_buf, (uint32_t)&hspi->Instance->DR, 4);
+	SET_BIT(hspi->Instance->CR2, SPI_CR2_TXDMAEN);
+
+	__HAL_SPI_ENABLE(hspi);
+}
+```
+
+The code size is reduced to almost a minimum - still, some optimizations could
+be done in `HAL_DMA_Start_IT`. Currently, we transmit only 4 bytes of static
+data to test whether MCU can handle this before going further.
+
+I'm using a bit different initialization sequence than HAL: HAL enables
+`RXDMAEN` after programming the channel and `TXDMAEN` after enabling SPI. Our
+code follows the sequence described in the STM32 Programming Manual (**rm0351**).
+
+![](/img/rm0351_spi_dma_enable_procedure.png)
+
+For testing purposes, I'm using Raspberry PI 3B as SPI host. Configuration is
+pretty straightforward, you can enable `spidev` by uncommenting
+`dtoverlay=spi0-1cs` in `/boot/config.txt` and rebooting. For communicating with
+`spidev` I'm using a custom Python script:
+
+```python
+from spidev import SpiDev
+
+class Spi:
+    def __init__(self):
+        self.device = SpiDev()
+        self.device.open(0, 0)
+        self.device.bits_per_word = 8
+        self.device.mode = 0b00
+        self.device.max_speed_hz = 24000000
+        self.freq = 24000000
+
+    def get_frequency(self):
+        return self.freq
+
+    def set_frequency(self, freq):
+        self.freq = freq
+
+    def xfer(self, data: bytes) -> bytes:
+        return self.device.xfer(data, self.freq)
+
+
+def test(func):
+    def wrapper():
+        freq = spi.freq
+        iteration = 0
+
+        try:
+            for i in range(10):
+                iteration = i
+                func()
+            print(f'OK: {func.__name__} @ {freq} Hz')
+        except AssertionError:
+            print(f'FAIL: {func.__name__} @ {freq} Hz (iteration {iteration})')
+    return wrapper
+
+
+@test
+def test_read_constant_4byte():
+    expected_data = [0xaa, 0xe0, 0xbb, 0xa5]
+    result = spi.xfer([0xff] * len(expected_data))
+    print('result = {:x} {:x} {:x} {:x}'.format(result[0], result[1], result[2], result[3]))
+    assert result == expected_data
+
+
+def main():
+    global spi
+
+    try:
+        spi = Spi()
+        spi.set_frequency(24000000)
+        test_read_constant_4byte()
+    finally:
+        spi.device.close()
+
+if __name__ == '__main__':
+    main()
+```
+
+After running the test code, I saw the transmitted data was correct through the
+logic analyzer, but Raspberry PI didn't receive the right data. This was a
+problem with the connection between Raspberry PI and Nucleo. I could achieve
+stable transmission at frequencies up to 18 MHz. After changing cable
+connections, I got stable transmission at 22 MHz. Before, I was using two 20 cm
+male-to-female jumper wires for each SPI line. The cables and the logic analyzer
+probes were connected to a breadboard. Now, I have a direct connection between
+Nucleo and Raspberry using a single 20 cm female jumper wire for each line.
+Nucleo pins stretch into two sides of the board, so I can attach the probes
+directly on the backside of Nucleo.
+
+## The work continues - implementing TPM protocol
+
+While 22 MHz is not the frequency we aim for, I continued tests on the highest
+frequency I could afford for now (in the meantime planning to replace the cables
+with better ones). I extended the code to speak over the TPM protocol
+
+```c
+#define STATE_WAIT_HEADER 0
+#define STATE_WAIT_STATE 1
+#define STATE_WAIT_STATE_LAST 2
+#define STATE_PAYLOAD_TRANSFER 3
+
+static uint8_t ff_buffer[64];
+static uint8_t waitstate_insert[4] = {0xff, 0xff, 0xff, 0xfe};
+static uint8_t waitstate_hold[1] = {0x00};
+
+static uint8_t waitstate_cancel[1] = {0x01};
+static uint8_t trash[1] = {0};
+static uint8_t header[4] = {0};
+static bool b_txdma_complete = false;
+static bool b_rxdma_complete = false;
+static uint8_t state = STATE_WAIT_HEADER;
+
+static uint8_t scratch_buffer[64] = {0};
+
+static uint8_t transfer_is_read = 0;
+static uint32_t transfer_length = 0;
+
+static void update_state(SPI_HandleTypeDef *hspi)
+{
+	b_rxdma_complete = false;
+	b_txdma_complete = false;
+
+	switch (state) {
+	case STATE_WAIT_HEADER:
+		// We don't care what host sends during wait state, but we start DMA anyway to avoid overrun errors.
+		HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)trash, sizeof trash);
+		// Wait state got inserted while reading header.
+		HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)waitstate_cancel, (uint32_t)&hspi->Instance->DR, sizeof waitstate_cancel);
+
+    // This follows a real TPM protocol, except we ignore addr currently.
+		transfer_is_read = !!(header[0] & (1 << 7));
+		transfer_length = (header[0] & 0x3f) + 1;
+    // Remaining bytes contain TPM register offset. Currently we have only one
+    // "register" so we just ignore that.
+
+		state = STATE_WAIT_STATE_LAST;
+		break;
+
+	case STATE_WAIT_STATE_LAST:
+		if (transfer_is_read) {
+			HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)scratch_buffer, (uint32_t)&hspi->Instance->DR, transfer_length);
+			HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)ff_buffer, transfer_length);
+		} else {
+			HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)ff_buffer, (uint32_t)&hspi->Instance->DR, transfer_length);
+			HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)scratch_buffer, transfer_length);
+		}
+
+		state = STATE_PAYLOAD_TRANSFER;
+		break;
+
+	case STATE_PAYLOAD_TRANSFER:
+		HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)header, sizeof header);
+		HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)waitstate_insert, (uint32_t)&hspi->Instance->DR, sizeof waitstate_insert);
+
+		state = STATE_WAIT_HEADER;
+		break;
+	}
+}
+
+static void rxdma_complete(DMA_HandleTypeDef *hdma)
+{
+	SPI_HandleTypeDef *hspi = (SPI_HandleTypeDef *)(((DMA_HandleTypeDef *)hdma)->Parent);
+	if (b_txdma_complete)
+		update_state(hspi);
+	else
+		b_rxdma_complete = true;
+}
+
+static void txdma_complete(DMA_HandleTypeDef *hdma)
+{
+	SPI_HandleTypeDef *hspi = (SPI_HandleTypeDef *)(((DMA_HandleTypeDef *)hdma)->Parent);
+	if (b_rxdma_complete)
+		update_state(hspi);
+	else
+		b_txdma_complete = true;
+}
+```
+
+On Python side I introduce a new function:
+
+```python
+def tpm_read(size: int) -> bytes:
+    assert size > 0 and size <= 64
+
+    header = bytes([
+        (1 << 7) | (size - 1),
+        0, 0, 0
+    ])
+
+    waitstate = spi.xfer(header)
+    assert waitstate == [0xff, 0xff, 0xff, 0xfe]
+
+    while True:
+        status = spi.xfer([0])
+        if status[0] == 1:
+            break
+        assert status[0] == 0
+
+    empty = [0] * size
+    return spi.xfer(empty)
+```
+
+And update the test:
+
+```python
+@test
+def test_read():
+    x = tpm_read(0, 8)
+    assert x == [0] * 8
+```
+
+After running the test code, I immediately got an error, the logic analyzer
+showing:
+
+![](/img/stm32-spi-failure.png)
+
+There are two problems here. The first problem is that the CS pin goes high
+between the header, wait states, and payload. This was my oversight, but fixing
+it is not critical as it currently does not affect communication - deasserting
+the CS pin should abort the transaction, but we don't handle this yet. Linux's
+`spidev` drivers can be instructed not to deassert CS, but this is not supported
+by the bindings I'm using, so let's just postpone the fix.
+
+The other problem is with the transmission itself - Nucleo transmits wrong data
+(0xff) instead of 0x01 during the wait state.
+
+To solve the problem, I went a step back. I hardcoded a few data patterns to
+replicate the transfer sequence:
+
+```c
+struct pattern {
+	uint8_t *data;
+	uint8_t len;
+};
+
+#define ARRAY_SIZE(x) (sizeof((x)) / sizeof((x)[0]))
+
+static uint8_t pattern_0[] = {0xff, 0xff, 0xff, 0xfe};
+static uint8_t pattern_1[] = {1};
+static uint8_t pattern_2[] = {
+	0x3e, 0x60, 0xc3, 0x4f, 0x35, 0x2e, 0xa6, 0xaa, 0xa6, 0x61, 0x64, 0xcb,
+	0x10, 0xd7, 0x45, 0x35, 0x82, 0xc9, 0x91, 0xbc, 0x35, 0x43, 0xbb, 0xe1,
+	0xea, 0x08, 0xdf, 0xdd, 0x4d, 0xd8, 0xd5, 0x94, 0x71, 0x75, 0xfd, 0x23,
+	0x24, 0xf8, 0x95, 0x85, 0x7b, 0x11, 0xf9, 0xdd, 0xa0, 0xaa, 0x60, 0xc5,
+	0xd2, 0x07, 0x6b, 0x3a, 0xd4, 0xd2, 0xac, 0xac, 0x1b, 0x54, 0xfe, 0x2f,
+	0xa2
+};
+
+static struct pattern patterns[] = {
+	{ .data = pattern_0, .len = sizeof pattern_0 },
+	{ .data = pattern_1, .len = sizeof pattern_1 },
+	{ .data = pattern_2, .len = sizeof pattern_2 },
+};
+int current_pattern = 1;
+
+static void txdma_complete(DMA_HandleTypeDef *hdma)
+{
+	SPI_HandleTypeDef *hspi = (SPI_HandleTypeDef *)(((DMA_HandleTypeDef *)hdma)->Parent);
+	HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)patterns[current_pattern].data, (uint32_t)&hspi->Instance->DR, patterns[current_pattern].len);
+
+	if (++current_pattern == ARRAY_SIZE(patterns))
+		current_pattern = 0;
+}
+
+void app_main() {
+	hspi->hdmatx->XferCpltCallback = txdma_complete;
+
+	HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)pattern_0, (uint32_t)&hspi->Instance->DR, sizeof pattern_0);
+	SET_BIT(hspi->Instance->CR2, SPI_CR2_TXDMAEN);
+
+	__HAL_SPI_ENABLE(hspi);
+}
+```
+
+And on Python side:
+
+```python
+patterns = [
+    [0xff, 0xff, 0xff, 0xfe],
+    [0x01],
+    [
+        0x3e, 0x60, 0xc3, 0x4f, 0x35, 0x2e, 0xa6, 0xaa, 0xa6, 0x61, 0x64, 0xcb,
+        0x10, 0xd7, 0x45, 0x35, 0x82, 0xc9, 0x91, 0xbc, 0x35, 0x43, 0xbb, 0xe1,
+        0xea, 0x08, 0xdf, 0xdd, 0x4d, 0xd8, 0xd5, 0x94, 0x71, 0x75, 0xfd, 0x23,
+        0x24, 0xf8, 0x95, 0x85, 0x7b, 0x11, 0xf9, 0xdd, 0xa0, 0xaa, 0x60, 0xc5,
+        0xd2, 0x07, 0x6b, 0x3a, 0xd4, 0xd2, 0xac, 0xac, 0x1b, 0x54, 0xfe, 0x2f,
+        0xa2
+    ]
+]
+
+spi = Spi()
+spi.set_frequency(22000000)
+
+for i in range(100000):
+    for pattern in patterns:
+        data = spi.xfer([0] * len(pattern))
+        print(f'iter {i} test {data} == {pattern}')
+        assert data == pattern
+```
+
+The code works just fine (100k iterations)
+
+```
+...
+iter 99999 test [255, 255, 255, 254] == [255, 255, 255, 254]
+iter 99999 test [1] == [1]
+iter 99999 test [62, 96, 195, 79, 53, 46, 166, 170, 166, 97, 100, 203, 16, 215, 69, 53, 130, 201, 145, 188, 53, 67, 187, 225, 234, 8, 223, 221, 77, 216, 213, 148, 113, 117, 253, 35, 36, 248, 149, 133, 123, 17, 249, 221, 160, 170, 96, 197, 210, 7, 107, 58, 212, 210, 172, 172, 27, 84, 254, 47, 162] == [62, 96, 195, 79, 53, 46, 166, 170, 166, 97, 100, 203, 16, 215, 69, 53, 130, 201, 145, 188, 53, 67, 187, 225, 234, 8, 223, 221, 77, 216, 213, 148, 113, 117, 253, 35, 36, 248, 149, 133, 123, 17, 249, 221, 160, 170, 96, 197, 210, 7, 107, 58, 212, 210, 172, 172, 27, 84, 254, 47, 162]
+```
+
+The main difference is that the full code performs reading and writing, contrary
+to only writing. Currently, we wait for both TX and RX DMA to complete before
+re-programming DMA channels and updating the state machine. TX and RX are always
+the same size, so they should complete in a similar time. So, instead of using
+interrupts for both channels, I changed the code so that interrupts are used for
+TX and polling for RX (tests showed that TX DMA usually completes first).
+
+```c
+static void txdma_complete(DMA_HandleTypeDef *hdma)
+{
+	SPI_HandleTypeDef *hspi = (SPI_HandleTypeDef *)(((DMA_HandleTypeDef *)hdma)->Parent);
+
+	switch (state) {
+	case STATE_WAIT_HEADER:
+		// Wait state got inserted while reading header.
+		HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)waitstate_cancel, (uint32_t)&hspi->Instance->DR, sizeof waitstate_cancel);
+
+		// We don't care what host sends during wait state, but we start DMA anyway to avoid overrun errors.
+		HAL_DMA_PollForTransfer(hspi->hdmarx, HAL_DMA_FULL_TRANSFER, HAL_MAX_DELAY);
+		HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)trash, sizeof trash);
+
+		transfer_is_read = !!(header[0] & (1 << 7));
+		transfer_length = (header[0] & 0x3f) + 1;
+
+		state = STATE_WAIT_STATE_LAST;
+		break;
+
+	case STATE_WAIT_STATE_LAST:
+		if (transfer_is_read) {
+			HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)scratch_buffer, (uint32_t)&hspi->Instance->DR, transfer_length);
+			HAL_DMA_PollForTransfer(hspi->hdmarx, HAL_DMA_FULL_TRANSFER, HAL_MAX_DELAY);
+			HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)ff_buffer, transfer_length);
+		} else {
+			HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)ff_buffer, (uint32_t)&hspi->Instance->DR, transfer_length);
+			HAL_DMA_PollForTransfer(hspi->hdmarx, HAL_DMA_FULL_TRANSFER, HAL_MAX_DELAY);
+			HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)scratch_buffer, transfer_length);
+		}
+
+		state = STATE_PAYLOAD_TRANSFER;
+		break;
+
+	case STATE_PAYLOAD_TRANSFER:
+		HAL_DMA_Start_IT(hspi->hdmatx, (uint32_t)waitstate_insert, (uint32_t)&hspi->Instance->DR, sizeof waitstate_insert);
+		HAL_DMA_PollForTransfer(hspi->hdmarx, HAL_DMA_FULL_TRANSFER, HAL_MAX_DELAY);
+		HAL_DMA_Start_IT(hspi->hdmarx, (uint32_t)&hspi->Instance->DR, (uint32_t)header, sizeof header);
+
+		state = STATE_WAIT_HEADER;
+		break;
+	}
+}
+```
+
+I start the TX transfer first, then poll for RX DMA completion before
+re-programming the DMA channel. Now, the test succeeds.
+
+## Extending tests
+
+I have basic code that can read and write data over SPI, but I have tested only
+read of a zeroed register. Now, it is time to extend the tests so that we write
+random data of random lengths, then read the data back and check whether it is
+as expected. I already got shorter cables - 10 cm instead of 20 cm,
+and I have stable communication at 24 MHz:
+
+![](/img/stm32-spi-connections.jpg)
+
+I started with something simple
+
+```python
+tpm_write(0, bytes([1,2,3,4,5,6,7,8]))
+x = tpm_read(0, 8)
+assert x == [1,2,3,4,5,6,7,8]
+```
+
+and failed. The first transfer succeeded, but the second did not:
+
+![](/img/stm32-spi-failed-second-transfer.png)
+
+I hooked the debugger and saw that app was still polling for RX DMA completion.
+Looking again at the original code, I found that I incorrectly cleared
+`SPI_RXFIFO_THRESHOLD` bit - it should be clear for 16-bit frame length and set
+for 8-bit frame length.
+
+Changing
+
+```c
+CLEAR_BIT(hspi->Instance->CR2, SPI_RXFIFO_THRESHOLD);
+```
+
+to
+
+```c
+SET_BIT(hspi->Instance->CR2, SPI_RXFIFO_THRESHOLD);
+```
+
+solved the problem, however I got another one.
+
+![](/img/stm32-spi-readback-wrong-data.png)
+
+The wait state is properly inserted and terminated, but the payload is invalid.
+I split the test into two to pause the app between write and read from the
+register. Peeking at the `scratch_buffer` reveals that DMA went wrong, as the
+first three bytes were lost entirely.
+
+![](/img/stm32-scratch-state.png)
+
+Moreover, we are again stuck polling for DMA completion (DMA is still waiting
+for the remaining three bytes). The issue could be caused by too high delays
+between restarting of DMA transfers, so I lowered the SPI frequency to 100 KHz,
+but to my surprise, the result was exactly the same. I tested different data
+sizes, and the result was always the same (3 bytes lost). So, the
+`SPI_RXFIFO_THRESHOLD` fix only moved the problem a bit further. The outcome is
+still the same.
+
+## Summary
+
+That's all for this blog post. I got SPI working at 24 MHz when writing, but
+reading is broken. This is a significant improvement. I can't tell whether SPI
+could work at 24 MHz on that platform - even though it works for write, it
+doesn't mean it would work for reads and writes simultaneously. Further work
+could include fixing RX on the current platform, but we could also try using
+different platforms. We could try newer STM32 CPUs with a more recent SPI
+version (and possibly a higher clock frequency), such as the STM32L5 series or
+the latest STM32U5 series.
+
+Further work will surely include implementing missing features, such as SPI bus
+aborts, SPI synchronization (using CS pin), and error recovery. Also, some
+patches may be needed for Zephyr due to suboptimal handling of SPI transactions.
+Possibly, on a faster CPU, we could achieve 24 MHz without any problems, but we
+could run into similar issues trying to work at the optional 66 MHz frequency.
+Also, Zephyr's SPI API currently doesn't support transmission of variable-length
+frames. There is an open RFC
+[issue](https://github.com/zephyrproject-rtos/zephyr/issues/56091) that covers
+API changes and optimized SPI handling to enable the usage of protocols
+requiring them, so our future work could also include working on API
+improvements. Lastly, we plan to upstream all Zephyr patches (if any).
+
+The code for STM32 application is available
+[here](https://github.com/arturkow2/MySPITest).
diff --git a/blog/static/img/nucleo-l476rg.jpg b/blog/static/img/nucleo-l476rg.jpg
new file mode 100644
index 00000000..9f653323
Binary files /dev/null and b/blog/static/img/nucleo-l476rg.jpg differ
diff --git a/blog/static/img/rm0351_spi_dma_enable_procedure.png b/blog/static/img/rm0351_spi_dma_enable_procedure.png
new file mode 100644
index 00000000..db90ccb4
Binary files /dev/null and b/blog/static/img/rm0351_spi_dma_enable_procedure.png differ
diff --git a/blog/static/img/stm32-scratch-state.png b/blog/static/img/stm32-scratch-state.png
new file mode 100644
index 00000000..7eef3b5c
Binary files /dev/null and b/blog/static/img/stm32-scratch-state.png differ
diff --git a/blog/static/img/stm32-spi-connections.jpg b/blog/static/img/stm32-spi-connections.jpg
new file mode 100644
index 00000000..d259a80e
Binary files /dev/null and b/blog/static/img/stm32-spi-connections.jpg differ
diff --git a/blog/static/img/stm32-spi-failed-second-transfer.png b/blog/static/img/stm32-spi-failed-second-transfer.png
new file mode 100644
index 00000000..0c6505a5
Binary files /dev/null and b/blog/static/img/stm32-spi-failed-second-transfer.png differ
diff --git a/blog/static/img/stm32-spi-failure.png b/blog/static/img/stm32-spi-failure.png
new file mode 100644
index 00000000..82b2586c
Binary files /dev/null and b/blog/static/img/stm32-spi-failure.png differ
diff --git a/blog/static/img/stm32-spi-readback-wrong-data.png b/blog/static/img/stm32-spi-readback-wrong-data.png
new file mode 100644
index 00000000..1497da94
Binary files /dev/null and b/blog/static/img/stm32-spi-readback-wrong-data.png differ
diff --git a/blog/static/img/stm32-spislave-100khz.png b/blog/static/img/stm32-spislave-100khz.png
new file mode 100644
index 00000000..25b4eb6b
Binary files /dev/null and b/blog/static/img/stm32-spislave-100khz.png differ
diff --git a/blog/static/img/stm32-spislave-10mhz.png b/blog/static/img/stm32-spislave-10mhz.png
new file mode 100644
index 00000000..0b6c03b5
Binary files /dev/null and b/blog/static/img/stm32-spislave-10mhz.png differ
diff --git a/blog/static/img/stm32cube_spi2_setup.png b/blog/static/img/stm32cube_spi2_setup.png
new file mode 100644
index 00000000..e1369d61
Binary files /dev/null and b/blog/static/img/stm32cube_spi2_setup.png differ
diff --git a/blog/static/img/stm32cube_spi_setup_dma.png b/blog/static/img/stm32cube_spi_setup_dma.png
new file mode 100644
index 00000000..ca005036
Binary files /dev/null and b/blog/static/img/stm32cube_spi_setup_dma.png differ
diff --git a/blog/static/img/tpm2_spi_protocol.png b/blog/static/img/tpm2_spi_protocol.png
new file mode 100644
index 00000000..3642941e
Binary files /dev/null and b/blog/static/img/tpm2_spi_protocol.png differ