Docs

      Blue Matador monitors the outgoing TCP connections that a server makes and detects when connections that used to succeed no longer do. Only connections that have been detected as successfully established before will be tested, and connections that go to a port on the local machine are exempted.

      Effects


      When a connection to a remote host fails, it usually manifests itself in an application’s logs. Possible resulting errors in your application include:

      • Unable to connect to a database
      • Unable to complete HTTP requests to another service
      • Timed out and retried connections to external services

       

      Causes


      Connection issues with a remote host could be caused by issues at any point in the networking infrastructure between your server and the remote host. Some common causes include:

      • The remote service is no longer available
      • The remote service changed the port it listens on
      • A security group change blocking outgoing connections from your server
      • A security group change blocking incoming connections on the remote host
      • Inability to reach the internet via NAT
      • A VPN or tunnel between two private networks no longer working
      • Iptables on either the local or remote server blocking the connection

       

      Troubleshooting


      Troubleshooting connections to remote hosts is tricky at best, and requires a process of elimination to determine what the problem really is.  The following questions are designed to lead you down a path of determining where the network issues is, and what can be done to resolve it.

      Can you telnet there?
      See if you can use telnet from the affected server to get any connection. If you can, then it’s possible that the connection timed out earlier or fails inconsistently. You can rule out security group issues as the cause at this time.

      Can you telnet anywhere?
      Make sure the affected server has basic network connectivity. If you expect the affected server to have internet access, telnet to somewhere that is known to be running like www.google.com on port 80. Otherwise, try to connect to another working service in your private network.  If you cannot connect to anything, it is possible the server has no network connectivity.  Check for iptables rules blocking outgoing connections, and investigate with your cloud provider. If it turns out that the hardware is degraded, the quickest solution may be just to replace the server.

      Can you telnet there from another server?
      If you cannot connect using telnet from the affected server, but are able to do it from another server that is supposed to have access, then there is likely an issue with security groups or iptables. You can view the current iptables using iptables -L. Security groups can be checked using your cloud provider’s API or web console.

      If you cannot connect from another server, then it is possible that the remote service is completely inaccessible or that you do not have access to the other network through NAT or a VPN.

      Is the remote host publicly accessible?
      If the remote host is supposed to be publicly accessible, you can try connecting from your laptop or a server on a known good network. If this fails, then the remote server is likely to blame, and you may want to check for outage reports for the service. If you can access it publicly but not from your servers’ private network, then checking the NAT or VPN tunnel would be wise.

      Did network configuration recently change?
      If any security group, iptable, VPN, NAT or other network changes recently got released, check with the team in charge of those changes to make sure they were intended and are working as expected. What could seem like an innocuous change in one part of the network can affect systems owned by other teams and legacy clients.

      Is the service relatively new?
      A new service can often be configured incorrectly or underprovisioned. Check with the owners of that service to ensure that access should be allowed from the affected server, and that the service is highly available.

       

      Resources