Fault Tolerant ESP32

Well, that didn't work, now what?
By Allen Edwards

Overview

My well tested ESP32 boat monitoring system ran up against the real world and the real world won. I think I am on version 50 of the software. This article goes over some of the issues I found and solved. Some would apply to anyone using an ESP32 over WiFi and some are more unique to my boat but have relevance to any system that takes real world data, makes decisions about it, and changes its behavior based on those decisions.

Review

My Boat Monitor is based on an ESP32. It monitors the power, and manages the bilge pump. My boat is wood and leaks. After having the monitor in place for many months and after refining the software many times I now know how much. After a hard sail it leaks about two and a half gallons a day. After a couple of weeks, it is down to about one eight of a gallon. I run the pump with a goal of pumping just under half a gallon at a time with a maximum time between pumps settable by a web page that is read each time the monitor reports. The reports are over the marina WiFi system. See https://L-36.com/DIY-boat-monitoring-system.php

One question one might rightly ask is why don't I just let the pump, which is an automatic Rule 500GPH pump do what it is intended to do. The reason is that I have had many of these pumps and they consistently fail after 3 to 5 years. I get extended warranty so the cost is not a large issue but I would rather not have the pumps fail. My thought is if they don't run every 2.5 minutes but instead every few hours that they will last longer. It turned out that making this happen was the hardest part of doing the software for the monitor. More on that later.

Avoiding Software Lock-Up

It wasn't long until my daily reports quit coming and I had to drive to my boat to see what was going on. What I found was that the light was on but nobody was home. The light being on means the software was running but as nothing was happening clearly it was hung up.
The standard way that an ESP32 connects to WiFi can be found using Google. Do that and you will find this code, which I initially used:

void initWiFi() { WiFi.mode(WIFI_STA); WiFi.begin(ssid, password); Serial.print("Connecting to WiFi .."); while (WiFi.status() != WL_CONNECTED) { Serial.print('.'); delay(1000); } Serial.println(WiFi.localIP()); }

Clearly the code will sit there forever if for some reason the WiFi doesn't connect. This is very difficult to troubleshoot as it always connects at home but once every few weeks fails at the marina. The solution is simple. NEVER HAVE A WHILE LOOP WITHOUT A TIMER EXIT. Here is the code now:

int connectToWiFi(){ int counter = 0; int i = 0; while (WiFi.status() != WL_CONNECTED && counter < 10) { WiFiReset(); counter++; for(i = 0; i < 10; i++){ if (WiFi.status() == WL_CONNECTED) break; delay(3000); Serial.println("Connecting"); } } Serial.println(); if( WiFi.status() != WL_CONNECTED ) { Serial.println("Failed to connect"); WiFiFail+=1000000; return FALSE;// failed will try again in 5 minutes } Serial.println("WiFi connected"); return TRUE; } }

The key here is that the while now has a second way to exit in case of failure. I had three places where where were while loops all copied from examples found using Google from otherwise excellent tutorials. A global search of my code for "while" and an insertion of a counter exit solved the hang up problem.

Of course the above code is only part of the story. You will notice it returns either true or false. On a failure the actual sending of the message is skipped. If there is a failure, a failure code is also set and in that case instead of sleeping for my normal half hour, It sleeps for five minutes and tries again. But the point is, if there is a failure, you decide what to do about it and hopefully what you do won't cause another loop. I chose to sleep so that everything would be reset when the timer interrupt woke the system.

I have several while loops in the code, all copied from examples. The one above connects to the local WiFi. One reads the time from NTSC. One reads a web page which is how I pass a variable to the code. And the final one sends the email. If any of these fail, the error code is incremented by a certain amount. The error code is sent in the next successful email. That way I can see which routines failed and how many times.

Missing Inputs

Not all errors are catastrophic in that you can recover from them and continue. A good example of this is the NTP time reading. That is needed really only because I want to get a status email every morning around 6AM. But the internal clock on an ESP32 is not very good so over the course of weeks, it will be off and eventually I might get the email at night. And of course, any clock needs to at least be initialized. But if the connection to NTSC is missed one time that is not a problem as long as the internal clock is not set to zero or some other nonsense value. Just skip and go on.

Another thing the monitor does is reads a maximum time between pump usage which is a variable I set remotely. I have a webpage that has that value and the ESP32 reads the value off that page. If that fails, the important thing is to not change the value. The reason I have that is to lower the interval if it is going to rain. I am not sure where all the rain comes in but as any rain that falls on my cockpit sole hatch (over the engine) falls in the bilge and any rain that gets in the mast does the same, there is significant water when it rains that I don't want to wait a day and have all that fresh water in the bilge.
Here is the code that reads the webpage with that value.

int forecast(){ int number = -99;// if ((WiFi.status() == WL_CONNECTED)) { //Check the current connection status HTTPClient http; const char *url = "https://L-36.com/papooseRain.html?x=1"; http.begin(url); int httpCode = http.GET(); //Make the request if (httpCode > 0) { //Check for the returning code String payload = http.getString(); number = payload.toInt(); } http.end(); //Free the resources } return number; }

The calling routine might call this twice but doesn't update the data unless it gets a valid result.

int number = forecast();// get rain code if (number >= -1) max_pump_cycle = number; else{ delay(2000); number = forecast();// get rain code again if (number >= -1) max_pump_cycle = number; }

I should mention that there are two bilge pumps. One is on the monitor timer and the big one is on a float switch. If that one goes off, I get an immediate emial as I know the small one has failed or something is very wrong with the boat.

Bad Data

One of the biggest issues was dealing with bad data. The when the pump is turned on it stays on until it is done pumping at which point it turns off the pump. I monitor that wire so I know then I turned the pump on and when it turned itself off. I then turn off the master power until the next time depending on how much the boat is leaking. To determine that I keep track of how long the pump runs. The hose from the pump runs uphill to a point where there is a vacuum break valve. Then that amount of water flows back into the sump. It takes about 5 seconds for the pump to move that water up to the check valve so it will always run for ABOUT 5 seconds. I try and wait long enough between pumping that I get a meaningful amount of water in the sump but not so much that it puts my large pump's seal under water. That is a run time of about 15 seconds. The problem is that the pump often runs after the bilge is empty so I might get 4 or 6 seconds of run time with no water. Basically, there is noise in the pump run time to gallons conversion algorithm. To combat that, I ended up putting in a filter and then use the output of the filter to decide how long to run the pump. More on that in the next section.

I use a simple filter.

pumpFilteredGallons = pumpFilteredGallons + FILTER_TIME_CONSTANT * (gph - pumpFilteredGallons);

Algorithm Improvements

When first powered on the pump every half hour. The unit is initialized if the 12 volt power is on which typically means I am going sailing. If I am going sailing, the boat is going to leak. The pump is left powered on while sailing and as it is a auto pump it will just do its own thing while sailing. After the sail is over it is going to run and find .05 gallons but given the bad data from the pump, this might be 0 or .1 gallons. Without a way to know yet how much it is leaking, the time is incremented slowly. The counter goes up doubling the time interval. At least that is what I did at first. If the pump ran for less than 15 seconds, double the interval. But that put the big pump under water and as I have been told that my 30 year old pump should only last about 4 years I decided not to put it under water so often. That meant run the pump between 10 and 15 seconds. I needed data with less noise so I filtered it.

The algorithm now does a calculation of what the pump interval should be to make the gallons pumped my 0.4 gallon target. Even when the boat is leaking it doesn't leak much in a half hour so initially the calculation of how long the interval should be is too noisy to use. Instead of using that calculation, I use the minimum of it and an increment interval of 2 initially and 1.5 after more water is being pumped. This was tested both on the bench and on the boat. The pump now stabilizes to the desired pump range in about a day.

The green line is the unfiltered calculated leak rate per day and the blue is the filtered version. The filter is not used for the first few intervals. The goal is to have the amount pumped constant and that is shown in the purple line. You can see one instant where a bit too much was pumped but it is just as likely that the pump just ran a bit longer without there actually being more water but the algorithm decreased the interval and all the readings from there on look as desired. The red line shows the interval between pump running called PCM. The time is that number plus one and that is half hours so a PCM of 23 would make the pump run every 12 hours.

I stopped at this point in the graph because the next day we had a bit of rain and including that just confused the story. But this brings up the next challenge which is how to adapt the algorithm for rain. It is fine to have a nice algorithm that works its way to only running the pump once a day, but what if the boat is full of rain water. I have the rain variable but it is only read when the monitor sends an email, which might be a long time after the rain starts. The next goal is to change the unit to react more quickly to rain. I will start by reading the rain variable every half hour and then see if I can automate that based on a forecast. The quest never stops.

Summary

I hope you have enjoyed this discussion. If you have questions, please see the contact section. The biggest take away is to assume every external element is subject to failure and other non ideal behavior. WiFi connections don't always connect. Web pages don't always work. Pumps don't always shut off when they should and sometimes might not run at all when there is water to pump. Everything is subject to errors and you need to put in code to detect and then handle all the things that might go wrong.

NOTICE: Some pages have affiliate links to Amazon. As an Amazon Associate, I earn from qualifying purchases. Please read website Cookie, Privacy, and Disclamers by clicking HERE. To contact me click HERE. For my YouTube page click HERE