Best Practices For Building Resilience in Remote Iot Devices

Market reports indicate that by 2019, there were 1.2 billion IoT-connected devices to cellular networks. This number is expected to reach 4.7 billion by the end of 2030.

These devices were mostly connected to the internet for remote deployments such as monitoring tree growth in a forest or stream, river, or lake level monitoring.

These devices are unique in their challenges. Based on the IoT software solutions development life cycle, initial development and testing are done in controlled environments or indoors. In this phase, most firmware and hardware problems are solved. In the second stage, devices are deployed to the field for testing. This phase is a challenge because the devices are being deployed remotely. The key to this phase is accessing the device physically to fix problems and debug. This article discusses some of the best practices for developing IoT devices to support the piloting phase and beyond.

Developing a Basic IoT Device Application

AWS IoT can be used to build an IoT device app based on MQTT. Your application must at least support the following:

AWS IoT Core: Provisioning using AWS IoT Core.
Configuration using your AWS IoT Core Endpoint Address
Configuration of credentials for connecting to the endpoint address.
Integration with an MQTT client compatible with your chosen protocol, programming languages, and runtime environments.
Connect to AWS IoT Core with the correct MQTT client (MQTT, MQTT over WebSocket) and protocol.
Subscribe to MQTT topics and publish or receive messages.

You should integrate your device app with an AWS Device SDK and use the MQTT Client from your selected SDK. The AWS IoT Devices SDKs are equipped with resilience features and integrate closely with AWS IoT Core's resilience functionality.

You can run your IoT application after you've built it. You can connect your application to AWS Core by correctly configuring it (with the endpoint and credentials).

So far, so good. You have created a basic IoT application that works. What if there is a problem? What happens if you lose the network connection? What if your application crashes? What happens if your app crashes?

Your device application may exit if it does not handle negative scenarios. Here are some recommendations for building resilience with IoT app development services that can help.

Best Practices for Building Resilience in Remote IoT Devices

This section will discuss how to diagnose device problems efficiently and make them more durable in the field.

Initial Package

It is a good idea to send a data package containing fixed information, such as the device ID, firmware, modem-related info, IMEI, and the modem's firmware version.

In addition, fixed information should be reset-related data, such as the cause of the reset for the microcontroller. The initial packet's back-end data can be used to identify a device that has been reset unusually. This could be due to the watchdog or a brownout.

Health Information

Periodically sharing the health status of devices is a great idea. Including telemetry information with the health information is best, rather than sending a separate packet. This will save battery life. These data could include:

Data on battery-related parameters, including voltage and charging or discharging current
Communication-related data, such as Signal Strength, Connection Type (2G or LTE-M)
If present, the internal temperature of a device.
The error counters can track communication problems with I2C temperature sensors.

The data on the backend can identify any issues with battery power, RF connection, or others.

Watchdog & Crash Dump Creation

Watchdog is used in firmware to prevent unintended hangups or runaway codes. It could result from a hardware malfunction or a firmware bug not detected earlier. A watchdog is typically enabled at the beginning of firmware or by a microcontroller that allows it to be powered up via fuses/user settings.

The watchdog timer's timeout period can be adjusted. It depends on the microcontroller and how quickly a response is needed. For IoT applications, four to eight seconds is usually sufficient. The watchdog timer is reset in each function or looping condition after the watchdog has been enabled and its parameters have been set.

The real reason behind the reset remains hidden, even though the watchdog resets to the microcontroller to avoid the bug/issue. Adding the first packet shows the server data that the device has rebooted but not the root cause. You can overcome this by introducing a concept similar to a crash dump.

Early warning interrupts are available on both the watchdog and microcontroller. They are called before the watchdog reset handler. For example, depending on the registration configuration, the watchdog timeout could be set to 8 seconds, and the early watchdog interrupt at 4 seconds. If the watchdog interrupt has not been cleared after 4 seconds in this configuration, the early watchdog interrupt will be executed.

The early interrupt of this watchdog will trigger a Non-Maskable Interrupt, store stack data, and trace buffer on the flash. This data dump can also be sent to the developer remotely as part of the first packet. NMI is the entry point for stack data restoration when debugging. The following is a list of the data that can be stored as part of a crash dump:

Stack This could range from a few hundred to a few thousand bytes, depending on the typical firmware flow. The stack memory can be copied depending on the size of the crash dump payload.
Stack pointer: In debugging, the address of the stack pointer will be used to restore RAM data in the debug system.
Timestamp A time stamp can be useful if the device has RTC. It will help identify when the crash dumps were saved.
Data for Micro trace buffer: Micro trace buffer is a feature of ArmMicrocontrollers. This feature saves the last set of instructions that were run, and the addresses of these instructions are saved in the local RAM buffer. This is the second method of identifying the exact set of running instructions on the verge of watchdog reset.

Open-source tools like openocd, GDB, and a Microcontroller Setup can analyze data when a crash dump appears in the first packet received after a reset.

Global Timeout

You can also use the watchdog to perform other tricks when you need to restart your device because of a malfunctioning hardware module. Usually disable the watchdog before sending a sleep instruction and then enable it again after waking up. We can calculate the duration of a data transmission cycle by knowing the average (read the temperature and send it to the back end) and then adding buffer time.

Assuming, for example, that a typical transmission cycle lasts 60 seconds and a buffer is kept, the flow should be complete within 5 minutes. We can assume that the code is still running, but it's not moving forward in the flow. It could be waiting for a trigger. This could be because of a lack of implementation for timeouts or another untested condition where the program remains at the same stage while signaling to the watchdog that the code is working properly.

When the functions of enabling, turning on off, and clearing the watchdog timer are wrapped up, a global break can be implemented to consider the overall flow. Note the current time when the watchdog is enabled. At each clearing of the watchdog, the function to check the global timeout is called. Stop clearing the watchdog clock if the timeout has passed. This will reboot the device.

Reset Individual/Submodules at the Startup

Resetting any sensors or modules attached at startup is another good practice. It will ensure that the peripherals associated with the device are in a known state from the start. The LTE modem can, for example, be switched off before the switch-on procedure. If a submodule fails, the device can be revived by turning it off and on again.

Over-the-Air Updates

Over-the-air upgrades are the last but not least. Over-the-air updates are a feature that allows you to push new firmware reliably and securely to your device. It is necessary to push the firmware binary to the device, which may be miles away.

The Key Takeaway

This blog post contains several recommendations and detailed techniques to help you create resilient IoT device applications using AWS IoT Core SDKs. It is your responsibility as the IoT app development services provider to mitigate negative scenarios that may occur with your device application. Following the recommendations above, you can make your device application more resilient, and it will remain active even in negative scenarios.

IoT devices deployed in remote locations, particularly those in a state of hang-up, should be able to communicate their health status, recover from it, and resume normal operation. They also need to have the capability to update the firmware remotely.