Even in the healthiest networks, issues can arise that cause a loss of network connectivity, sometimes preventing users from gaining access to critical network resources. These outages may not be planned and are often difficult to predict. Network administrators normally have a short period of time to identify a problem and find a way to resolve it. In these situations, a common troubleshooting step is usually to reboot networking equipment until the issue is resolved. This may work as a short term solution, but rebooting equipment regularly may mask a larger issue that warrants further investigation. This document details steps that can be used to troubleshoot a network connectivity issue and explains helpful information that can be gathered to discover the source of this type of problem.
Sections Included in this Document
Basic network connectivity troubleshooting will be the same regardless of the AOS device being used. Some troubleshooting options may not be available in all units, however this document will only cover the general troubleshooting options available on every unit.
It is important when experiencing any issue to check that your equipment is running current software. If you are unsure which firmware version you should be running, ADTRAN recommends the latest Extended Maintenance Release (EMR) to ensure avoidance of possible software issues. To determine the current EMR, please see the product page for your applicable product on www.adtran.com. For more information an AOS firmware naming conventions, please see .
Without a troubleshooting plan in place for network disasters, it is very easy to panic when a problem occurs. Most businesses today primarily use network and internet connections to do a majority of their critical operations. When network resources are lost, panic can ensue among the business employees which will eventually be fed back to any network administrator. In almost every case, top priority is to get the network back to working order as quickly as possible. Often, little thought is put into troubleshooting an outage beyond simply rebooting equipment.
This approach is very understandable, but what happens if this issue occurs again? What caused the problem in the first place? The answer to these questions really depends on what you learn from the problem when it occurs with the troubleshooting steps you take. Normally problems of this nature can only be fully identified if troubleshooting is done while the problem is occurring. If data is not gathered during the issue, the administrator may not learn how to properly prevent the problem. The network may come back up quickly with the reboot of a unit, or by reconnecting to the network, but if this problem happens repeatedly, the short length of the outage becomes irrelevant compared to the overall number of times the problem is experienced.
This is why an administrator should work with the following in mind: How can I prevent this problem from happening again while getting the network back up as quickly as possible? With this line of thinking, a small amount of time may be lost while troubleshooting, but this will easily save a greater amount of time in the future for the prevention of the same problem reoccurring. With a detailed and precise troubleshooting plan in place beforehand, this troubleshooting time can be even more easily reduced.
This section details basic general steps that should be taken when any network problems arise dealing with loss of connectivity to network resources or the Internet. Feel free to improvise off of these basic steps as each may not be fully applicable in terms of every network situation. These should, however, serve as a general guide during each network issue.
Depending on how large your network is, having a "network problem" can be pretty vague. Normally a network administrator will find out about a network problem through another employee, possibly not someone technically savvy. In this case, they may just say that they "can't get onto the Internet" or "can't access the file storage server". In this case, though these problems sound simple, there may not be enough context to know exactly what is causing the problem. If a user's physical link on their client goes down, they will lose connectivity to everything. So, if a user reports they have lost internet connectivity, they may have actually lost all connectivity. Similarly in this situation, someone could claim they lost access to a server because that was the only thing they were using when in reality they have no connectivity at all.
In cases like this, it is important to ask troubleshooting questions from the user to narrow down the problem, or physically log onto their system to see if the issue can be put into context. Some of the things you want to find out in each type of situation:
As you can see, some of the steps are very similar in each of the above situations because they are all trying to achieve the central point: Narrow down where the problem is occurring and who the problem is affecting. These steps will not always be the same: there is a certain amount of improvisation that will need to be used to fully figure out the problem. However, these steps should serve as a general guide showing the thought process that should be used when a network connectivity problem occurs. An example situation is shown below:
Company.com's network administrator gets a call from users in building A complaining of "network connectivity problems". Upon arriving at building A and questioning the users further, the network administrator realizes they do not have Internet access. By asking around, it is discovered that no one in building A seems to have Internet access. A quick call over to the employees at building B confirms their Internet is up and working. Since their PCs are located in a different VLAN, it seems that building A's VLAN is somehow not getting out to the Internet.
The administrator starts by sending a PING from a Building A PC to 22.214.171.124 (a public IP that's easy to remember and always accessible on the Internet) to see if the issue is caused by a lack of DNS resolution. This fails, so it seems that there is an actual break in connectivity. The administrator then decides to traceroute to 126.96.36.199. This fails upon reaching the third hop, which is the site's main Internet router. The administrator PINGs several other internal network units to confirm it is just Internet access that is lost. Once that is done, the administrator has now narrowed down the problem to reside at the Internet router, or further into the service provider network meaning they can troubleshoot at that one central area now that they have narrowed down where the problem resides.
This is just a basic example of a network connectivity complaint, but the general troubleshooting steps will apply to the majority of all issues that arise of this type. In a network with large amounts of routers, switches, and other units, narrowing down the problem can mean taking hours off of the total troubleshooting time needed before the network is back up and functioning normally. The following section discusses what to do if the unit that the issue is narrowed down to is an AOS unit.
Resuming from the example in the above section, if the issue in Company.com's network leads the administrator to the Internet router (which happens to be an AOS unit), they must now troubleshoot the AOS unit directly to see if the issue in the unit can be identified. Unfortunately, it is a very common instinct to just try and reboot the unit that seems to be causing the problem. Most residential router manufacturers encourage this practice in home networks, and even in an enterprise network, its hard to sometimes imagine how something that has been working far a period of time could just stop. However, rebooting the unit has several negative effects:
Rebooting should normally be a last resort when attempting to restore connectvity. If the troubleshooting steps are done properly beforehand and a reboot ends up resolving the issue, important information has been obtained during the troubleshooting period that could help ADTRAN support engineer's identify the issue that affected the unit. The output detailed below should be gathered from an AOS unit when issues of this nature occur, before a reboot is performed (if necessary).
Once the issue has been narrowed down to an AOS device, the device should be accessed for further troubleshooting via the Command Line Interface (CLI) This is important because the CLI has the most tools to help troubleshooting connectivity issues. If you need assistance logging into the AOS CLI, please see . If you are unable to login to the CLI, please see the section If the Unit is not Accessible.
Once inside the CLI, attempt to find answers to these types of questions:
Asking and answering the above questions should help confirm whether the connectivity problem actually is in the AOS unit, or exists in a different section of the network past it. Assuming the issues still seem to point to the AOS unit causing the problem, the following steps at minimum should be taken. Note: It is recommended that anyone logging into the unit use a program like PuTTY that can log all session output to a text file:
As you can see, there is a startup-config and a startup-config.bak. Startup-config.bak is a copy of the previous startup-config after it's been saved. In other words, after saving a configuration, startup-config is a current copy of the config, while startup-config.bak is a copy of the config prior to the last time the configuration was saved. If these are not equal, it is important that they are pulled and compared (you can show them on the screen with the show file flash <name> command) so the last change made can be examined to see if it is possibly the issue.
The below shows a snippet of commands that can be entered into a device quickly to gather all of the applicable information shown above without having to type each command individually. This should be entered in privilege exec mode:
term length 0
show ip policy-sessions
show process cpu
show process queue
exception report generate
After running through these steps, if the issue still persists, a reboot can be performed to see if the problem resolves itself. If the reboot resolves it, all the above information should be provided to ADTRAN Technical Support along with a detailed problem description and a network diagram. If the reboot does not resolve it, contact ADTRAN Technical Support with the information above to help continue to troubleshoot the issue.
In certain cases during a network outage, attempts to log into a unit may fail (this is discounting a user not having proper credentials. If you need credentials to log into the unit, please see another unit administrator). This could be for several reasons:
In these cases, the unit should be accessed via console using a male to female, straight through DB9 cable. Once logged in via console, the commands shown in the section Information to Gather from the Unit should be gathered to help troubleshoot the unit, or be provided to Tech Support.
If the console is also not responsive, a reboot will most likely restore the unit back to working condition, but it will not provide any information needed to help troubleshoot the issue. In this case, if the unit has a NIM, it should be pulled out of the AOS device without powering down the device. This will cause the unit to reboot and print an exception report to flash (which can be provided to technical support) which can prove vital to finding a resolution. If a NIM is not present, call ADTRAN Technical Support about the issue.
Syslog should also be set up for a unit in this case so that any messages before the inaccessibility occurs could be logged and examined later. This can be setup using Configuring Syslog Logging in AOS . _of_connectivity
For information on problems with AOS units behind third-party modems, please see Problems with Internet Connectivity to an AOS Unit Behind a Third Party Modem