Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

While offline, edgeHub fails after automatic renewal of certificates #7321

Open
sejonssonr opened this issue Jul 4, 2024 · 9 comments
Open
Assignees

Comments

@sejonssonr
Copy link

Our company have a large number of IoT devices that rely on the offline capabilities of IoT Edge.

We recently discovered that devices can run offline at a maximum of ~25 days.
The behaviour seems to be caused by the automatic renewal of device/workload certificates. The renewal interval can be specified by setting the edgeHub environment variable ServerCertificateRenewAfterInMs but maxes out at 25 days(int32.max).

When the certificate is renewed, the edgeHub is stopped and fails to start again if the device is offline. This causes both data loss and complete failure of downstream devices to run configured modules. The edgeHub does not recover when connectivity is restored.

In the documentation found here the following is stated: "While disconnected from IoT Hub, the IoT Edge device, its deployed modules, and any downstream devices can operate indefinitely."

Expected Behavior

IoT Edge modules including the edge hub can operate indefinitely in offline mode

Current Behavior

Edge Hub stops after being offline for ~25 days which causes dataloss at the devices

Steps to Reproduce

Provide a detailed set of steps to reproduce the bug.

  1. Configure an edge device to run with specified CA certificates by updating config.toml
    [edge_ca]
    cert = "file:///etc/pki/tls/certs/<mydeviceca>.full-chain.ca.cert.pem"
    pk = "file:///etc/pki/tls/private/<mydeviceca>.key.pem"
  2. For the edgeHub module set the environment variable ServerCertificateRenewAfterInMs to 60000 ms in order to enforce renewal every minute.
  3. When the device twin is downloaded by the device, simulate an offline situation by disconnecting the device from the internet.
  4. After a few minutes time, observe how the edgeHub fails to start after renewal of certificates.

Context (Environment)

Output of iotedge check

Click here


Configuration checks (aziot-identity-service)
---------------------------------------------
√ keyd configuration is well-formed - OK
√ certd configuration is well-formed - OK
√ tpmd configuration is well-formed - OK
√ identityd configuration is well-formed - OK
√ daemon configurations up-to-date with config.toml - OK
√ identityd config toml file specifies a valid hostname - OK
× aziot-identity-service package is up-to-date - Error
    could not query https://aka.ms/latest-aziot-identity-service for latest available version
‼ host time is close to reference time - Warning
    Could not query NTP server
√ preloaded certificates are valid - OK
√ keyd is running - OK
√ certd is running - OK
√ identityd is running - OK
√ read all preloaded certificates from the Certificates Service - OK
√ read all preloaded key pairs from the Keys Service - OK
√ check all EST server URLs utilize HTTPS - OK
√ ensure all preloaded certificates match preloaded private keys with the same ID - OK

Connectivity checks (aziot-identity-service)
--------------------------------------------
× host can connect to and perform TLS handshake with iothub AMQP port - Error
    Could not connect to <masked>.azure-devices.net : could not complete TLS handshake
× host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - Error
    Could not connect to <masked>.azure-devices.net : could not complete TLS handshake
× host can connect to and perform TLS handshake with iothub MQTT port - Error
    Could not connect to <masked>.azure-devices.net : could not complete TLS handshake

Configuration checks
--------------------
√ aziot-edged configuration is well-formed - OK
√ configuration up-to-date with config.toml - OK
√ container engine is installed and functional - OK
√ configuration has correct URIs for daemon mgmt endpoint - OK
× aziot-edge package is up-to-date - Error
    Error while fetching latest versions of edge components: could not send HTTP request
√ container time is close to host time - OK
‼ DNS server - Warning
    Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub.
    Please see https://aka.ms/iotedge-prod-checklist-dns for best practices.
    You can ignore this warning if you are setting DNS server per module in the Edge deployment.
√ production readiness: logs policy - OK
√ production readiness: Edge Agent's storage directory is persisted on the host filesystem - OK
√ production readiness: Edge Hub's storage directory is persisted on the host filesystem - OK
× Agent image is valid and can be pulled from upstream - Error
    Failed to login to ta01iotcrd01.azurecr.io
√ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - OK

Connectivity checks
-------------------
× container on the default network can connect to upstream AMQP port - Error
    Container on the default network could not connect to <masked>.azure-devices.net:5671
× container on the default network can connect to upstream HTTPS / WebSockets port - Error
    Container on the default network could not connect to <masked>.azure-devices.net:443
× container on the default network can connect to upstream MQTT port - Error
    Container on the default network could not connect to <masked>.azure-devices.net:8883
× container on the IoT Edge module network can connect to upstream AMQP port - Error
    Container on the azure-iot-edge network could not connect to <masked>.azure-devices.net:5671
× container on the IoT Edge module network can connect to upstream HTTPS / WebSockets port - Error
    Container on the azure-iot-edge network could not connect to <masked>.azure-devices.net:443
× container on the IoT Edge module network can connect to upstream MQTT port - Error
    Container on the azure-iot-edge network could not connect to <masked>.azure-devices.net:8883
23 check(s) succeeded.
2 check(s) raised warnings. Re-run with --verbose for more details.
12 check(s) raised errors. Re-run with --verbose for more details.


Device Information

  • Host OS: CentOS 7 (But reproducible on other OS:s too)
  • Architecture: amd64
  • Container OS: Linux containers

Runtime Versions

  • aziot-edged: 1.4.20
  • Edge Agent: 1.4.38
  • Edge Hub: 1.4.38
  • Docker/Moby: 20.10.25

Logs

aziot-edged logs [iotedge_system_logs.txt](https://github.com/user-attachments/files/16100719/iotedge_system_logs.txt)
edge-agent logs [edgeAgent_logs.txt](https://github.com/user-attachments/files/16100711/edgeAgent_logs.txt)
edge-hub logs [edgeHub_logs.txt](https://github.com/user-attachments/files/16100674/edgeHub_logs.txt)

Additional Information

Logs supplied as files due to max character limit.

@vipeller vipeller self-assigned this Jul 4, 2024
@VGDev1
Copy link

VGDev1 commented Jul 4, 2024

I've encountered the same issue on my devices.

@vipeller
Copy link
Contributor

Hi @sejonssonr , sorry for the late answer, I started investigating this. I looked at the logs you attached. I see the periodic certificate renewal and a bunch of errors about not able to connect to iot hub. I assume that is expected and caused by being offline. I also see a module connecting (roger-test-240424/dispatcher), although the related upstream connection fails - again, I assume this is expected as the upstream network connection was somehow disabled for this test.

The part I am not sure if understand is that the error description is "the edgeHub is stopped and fails to start again ". However, in the logs I see edgeHub restarting every few minutes and based on the logs, it accepts incoming connections.

I don't see this part from the log, but the expected behavior is that if that "dispatcher" module sends messages, once EdgeHub gets online again, those will be forwarded. Based on your description "This causes both data loss and complete failure of downstream devices to run configured modules." - this is what does not happen?

To repeat I want to clarify:

  • the problem is that edgeHub does not restart - but the attached logs show it restarting (and accept connections, at least from that single module "dispatcher"
  • the problem is that messages sent during the offline session get lost
  • some other problem I missed (e.g. there are multiple modules and those cannot connect)

@silvestropomestetrapak
Copy link

@vipeller Sorry for a late reply. I'm a @sejonssonr colleague.
I'll answer point by point:

  • the problem is that edgeHub does not restart - but the attached logs show it restarting (and accept connections, at least from that single module "dispatcher"

You can see that edgeHub logs stops at 12:08:51, after one last attempt to start without connectivity.
You can also see in the edgeAgent logs, it stop trying to restart the edgeHub module at 12:10:14, after this error:
Error getting edge agent twin from IoTHub
From 12:08:51 on, the edgeHub remained stopped.

  • the problem is that messages sent during the offline session get lost
  • some other problem I missed (e.g. there are multiple modules and those cannot connect)

Messages get lost when edgeHub is down, of course, but the other important issue is that all IoTEdge downstream devices (nested child device) fail to start when the edgeHub on the parent device is down.
We expect from a child device to be able to start the IoTEdge services even when the edgeHub on the parent device is not available.

It looks like, in this case, the Identity Service on child devices fails indefinitely. It's stuck on a restart loop.
This is from a child device in the same situation:
image

We restored the system by manually restarting edgeHub on the parent once the connectivity has been restored.
This made identity service on all child devices start again.

@Lazer91
Copy link

Lazer91 commented Aug 2, 2024

Hi, is there any news on this?

@bishal41
Copy link
Contributor

@vipeller any update on this one ?

@sejonssonr
Copy link
Author

Any updates?

@jlian
Copy link
Member

jlian commented Nov 12, 2024

Hey folks we are looking into prioritizing a fix. Currently we think it might be on SDK side. The hypothesis is that when edgeAgent is offline, then it makes a call (into the SDK) that get stuck (never returns), and because of that it stops restarting stopped modules. And because edgeHub stops to renew its cert, it stays stopped. But we still need to find some bandwidth on the team to get an isolated repro. Will update soon.

@jlian
Copy link
Member

jlian commented Nov 26, 2024

Just a quick update to say we are tracking this and should be able to line up some bandwidth for more investigation soon.

@Lazer91
Copy link

Lazer91 commented Nov 27, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants