cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
davidbusey
Adtran Team

Troubleshooting Network Performance Testing (FCC/PMM testing) in Mosaic Device Manager

Featured In This Article

The following is a series of common issues you may experience while using the Network Performance Test (NPT) features of Mosaic Device Manager (MDM) followed by the most likely remedy to the problem. Before we get into troubleshooting specific symptoms, we'll begin with a list of critical background reading and an introduction to the tools and general troubleshooting skills that will help to make your Network Performance Test experience successful.

The troubleshooting skills discussed are:

  • Evaluating your bulk operation
  • Using network logging

The specific troubleshooting scenarios discussed below are:

  • NPT Report is completely blank 
  • Times tested in the reports are not the correct 6PM to 11:59:59AM local time
  • Devices only test every other day
  • Some units did not test
  • A device did not report for the full testing window
  • Testing errors are indicated in the reports
  • Large amounts of deferrals 
  • Tests does not indicate true speeds
  • Latency tests periodically show high latency / Latency scores failing
  • Reports didn't generate for a day
  • Report output not formatted correctly
  • Devices not responding to Bulk Operations for Network Performance Test (NPT) 

 

Before consulting this Troubleshooting Guide, make sure you have completed all of the necessary preparations for testing as explained in the core documentation associated with NPT and/or PMM testing.   

General

Many of the problems customers experience during testing have common root causes. Before proceeding, confirm that the following is true:

  • All devices under test have the recommended firmware version installed.  See this article for the latest advisory on firmware versions for your specific CPE hardware. If you are not running the latest version of approved firmware, you will not get correct or complete results. 
  • If using SmartOS 10.8.9.1, consult this article regarding the workaround for the 10.8.9.1 test state issue. 
  • Ensure the correct time zone is set on each CPE device under test.
  • Make sure the appropriate labels applied to the devices under test, that all devices are active and have not been decommissioned. 

It is also important to understand the general transmission of data when using NPT so you can see where problems may have occurred. The general flow is:

  1. User creates bulk ops which filter on use of NPT labels.
  2. Bulk Ops pushes down test parameters to CPEs before testing period from 6PM to 11:59:59PM local time.
  3. CPE devices start executing tests at 6PM local time.
  4. As the automated Diagnostic Complete events trigger, the devices send testing results they've temporarly stored up to the ACS (Mosaic Device Manager) 
  5. Once the final tests are done, a final diagnostic complete event is sent to the ACS. 
  6. The user then generates the reports cataloging all the testing data. This must be done each day as the NPT results are only stored in the ACS until the next testing period and will be overwritten by the next day's test cycle. Testing results are only stored in the device until the next NPT test session, or the device is rebooted. 

Narrowing down the issue type

Most NPT/FCC testing issues fall within these specific realms:

  • Setup of the labels and bulk ops
  • Device reachability during the bulk ops
  • Errors during transmission of device results back to Device Manager

You can narrow down your issue to one of these areas by looking at the associated bulk ops you ran for the test, and by using network logging.

Evaluating your Bulk Operation

The bulk operation is used to send the appropriate test parameters to the device via the NPT Start Script. If anything is incorrect in the bulk op, you will not get the desired NPT results. It is imperative to remember that this only sends the parameters down to the CPEs; the units still have to execute the testing and send the results back as outlined above. 

After creating a bulk op, you can view your entries and the status of the bulk op by clicking on Bulk Operations at the top horizontal menu bar in MDM, then select your associated bulk op you've scheduled to run soon. Inside, you will see time information of when the bulk ops should run (which must be run and completed before the 6Pm to 12AM testing window) and a pie chart that shows the success status of the units:

ehudson1_0-1635195688520.png

 

Also, verify that the correct labels to test are present here as well. Under the labels portion, there is a table of each device and it's individual status. A status of pending means that the device has not been solicited or informed yet, so it has not received the test parameters. Complete means it has received the test parameters. Any status of Unreachable or Max number of Solicits Reached means that the device was not reachable. This does not necessarily mean the bulk op will fail for the device: if the device checks in while the bulk op is running, the associated test parameters will be sent to the device. However, if it does not, testing will not commence on the unit. See the section on device reachability for more info. 

ehudson1_1-1635195873883.png

 

You can also look at the script ran on a specific unit to make sure there are no errors. Select Customer Support from the horizontal menu bar at the top of the scree, then select or search for the associated subscriber. Next, select Expert->Scripts. Notice the NPT - Start Test script at the top. Select this to evaluate the running script:

ehudson1_2-1635195981618.png

 

In the example above, the script errored out because the time entered was improperly formatted (trailing space). 

 

Using Network Logging

Referring to Network logging is imperative to troubleshooting problems. it is recommended to have 10 devices in your test pool set to do Network level logging. Do not enable logging on more than 10 devices to keep from causing stability issues on your MDM instance. To set up network-level logging:

  1. Log into your MDM instance and select or search for the specific device that is a member of your test pool. 
  2. Inside the device page, select on Expert from the left navigation bar.
  3. Near the bottom of the list, select Event Logs
  4. Click to set the logging level to Network:

ehudson1_0-1635032164801.png

 

Network logging includes all the trace logs and SOAP logs showing the ACS communication from the device. When running NPT tests, you will be looking for 2 specific event codes in the logs during the testing day: The Connection Request and the Diagnostic Complete. Connection Requests show the bulk ops being sent to the device as seen below which includes sending an NPT Start Script. You can see them by selecting that event in the Recent Sessions window:

 

ehudson1_2-1635032734897.png

 

ehudson1_1-1635032445252.png

 

Next, select "show trace detail" or "show SOAP detail" to look at the individual script values being sent down to the device. This is very useful to verify that the correct time periods were put in via the start script, or to look for any errors in setting up the bulk ops. If an error is seen, it will show up with a code of "error" in the associated logs. 

Diagnostic Complete are a periodic event from the unit that sends data back up to the ACS (Device Manager) including all current NPT data. Selecting this event will show the individual NPT data being sent up from the devices to the ACS. This enables you to see if there are any data transfer errors sending data back up for the NPT reports. 

ehudson1_3-1635032958764.png

 

These logs will be important to have available in your ACS for ADTRAN Product Support reference as well, should you require their assistance.

Commonly Reported Issues

NPT Report is completely blank 

Reports are commonly blank due to: 

  • Times being entered incorrectly during creation of the bulk-op.  
  • The devices were not responsive to the solicit attempts of the bulk-op and did not inform in during bulk-op time frame. 
  • A new NPT test was created and set before the report results were retrieved from the ACS by the administrator.  
  • An error has occurred with the bulk-op or during the test. 

Troubleshooting steps

  • Evaluate the Bulk Op to verify all the settings in it are correct, and the devices received the test.
    • Verify the correct devices were put in the test pool of the bulk op you invoked.
    • Verify you entered the date for the test correctly and in the proper format.
    • If you see devices that didn't receive the results, evaluate them for reachability
  • Look at network logging on a device and verify that the correct time settings were sent down to the device and received successfully.
  • If using Zulu time on the DSL modems, RetryOnFailure must be set to false in the Globals of your ACS instance. Setting this to true will result in errors.
  • Create your NPT bulk-op 2-3 hours in advance of the actual test. Download reports from the previous night prior to finalizing the bulk-op. Not doing so will over-write the results from the previous test cycle. For more on this review this article and the section entitled "Establish a daily routine".
  • Make sure no other administrators created NPT bulk ops for that day (or that bulk ops for other days you created do not have the wrong date in them). Executing new NPT tests will erase results from a previous day. 
  • Make sure you are generating reports for the last day of tests that ran. Results are only stored in MDM until the next NPT test is run. 
  • Run a test on one device and verify the results before attempting the full pool again. 

Times tested in the reports are not the correct 6PM to 11:59:59AM local time

This is commonly due to issues with the timezone setting on the devices, the clocks not being synced, or issues with the time put into the NPT - Start Test script within the bulk operation.

Troubleshooting Steps

  • Verify you put the correct time format in the Bulk Op.
  • Note that you cannot see the time by looking at the bulk ops. You must go to a unit that received the script from the bulk ops and look under Expert->Scripts and select the NPT - Start Test script. 
  • You can also verify by looking at the Event Logs for the device. It will be the Connection Request event log corresponding to the bulk op that was set up. Inside here, if you check Show Trace Detail, the script info sent down to the device is revealed.
    • Common mistakes include getting the start and end time mixed up, a leading or trailing space in the time, or another syntax error.
  • Check the devices to make sure they have the correct local timezone. If they do not, times will be incorrect as all the device timing is done in Zulu (UTC). 
  • Check that an individual device in the test pool has the correct time. 

Devices are only test every other day

If your CPEs are running SmartOS Release 10.8.9.1, you must use the workaround for SmartOS Devices. 

Some units did not test

This issue is commonly due to devices not being associated with the correct testing label, devices not being reachable, or a device being unplugged/rebooted before or during the test. 

Troubleshooting steps

  • If your CPEs are running SmartOS Release 10.8.9.1, you must use the workaround for SmartOS Devices. 
  • Always make sure all devices are associated with the correct labels, and that the labels were included in your bulk ops for testing. 
  • Look at the bulk ops and see if the device received the test and has a status of Complete. If the test status is not complete the unit will need to be evaluated:
    • Pending - pending means the device was never solicited, and didn't inform during the bulk op interval. Bulk ops should always be configured to run about ~2 hours, 2+ hours prior to the beginning of the test window. An example would be a running time of 3:30PM to 5:30PM. It should only be in this status if the bulk op didn't have time to reach out to every unit. 
    • Unreachable (or any variation) - The device needs to be evaluated for reachability
    • Error of any typeThese should be individually evaluated based on the error message. You can always contact ADTRAN Support for assistance. 
  • You can verify in the Event Logs (boot event) or with the device uptime that you do not see that the unit has rebooted. If it reboots before or during the NPT test, you will not get full results. 

A device did not report for the full testing window

This issue is commonly due to devices being unplugged or rebooted during the test, or some other infrastructure failure during the test.  

Troubleshooting steps

  • Check the reports to see when the device stopped reporting (either via missing results, or an error test status) and see if the device uptime matches this indicating a reboot. You can verify in the Event Logs (boot event) or with the device uptime that you do not see that the unit has rebooted.
  • Evaluate upstream from the device to see if there was a network connection issue during this period. Devices will fail to test if DNS fails (cannot contact the test server), or they are unplugged during the window.
  • Check logs on the test server to see if the device contacted the test server successfully during that period. 
  • Device related issues - if you believe there may be a device related issue, you can contact ADTRAN Support for assistance. Make sure to put "FCC Testing" in the subject of the ticket. 

Testing errors indicated in the reports

Errors in the reports are not necessarily problematic unless they are continuous for a long stretch of time. USAC PMM calculations do not count errors against you. The most common error (status 3) is a device not reporting back results. Since a subscriber could simply unplug or reboot their device, these will not always be avoided. However, if it happens for multiple days it should be investigated and noted in case USAC asks about the errors.

Other errors should be troubleshot based on their error explanation, and potentially reported to ADTRAN Support if they persist.

Troubleshooting steps

  • Evaluate the text in the error. If the report is that the device did not report results, check the device for reachability. You can also evaluate whether it rebooted, or whether the subscriber has potentially unplugged it. Remember to remove any subscriber devices that have been decommissioned from your test pool. 
  • Timing errors - These errors are generally due to the device time being input in wrong. Make sure you entered the correct time in the bulk op by going to the specific device and looking at Expert->Scripts and looking for the NPT - Start Test script. 
  • Hostname errors - verify DNS connectivity. 
  • Other errors - There are large amount of errors that can cause a single test to fail. One or two tests failing are not a large issue. However, if they continuously fail, contact ADTRAN Support for assistance. Make sure to put "FCC Testing" in the subject of the ticket. 

Large amounts of deferrals 

Deferrals happen when the link is found to be busy when a test attempts to run. In unfiltered reports, you will see every referral for each test that is attempted. However, in unfiltered reports you should only see a referral if the device was not able to be tested throughout the full period. In either case, these do not negatively affect the USAC scoring. 

If you see deferrals for a device every time it tests, verify the link status to see if the device has consistent running traffic that might not make it a suitable test candidate.

Tests does not indicate true speeds

This issue is commonly due to device speed provisioning problems, issues with the speed test server, or problems with jitter.  One poor test will largely not affect your results as USAC averages all of your results for each unit together to get the overall score. You should focus on tests that are consistently bad, or a consistent average that is poor. 

Also note that NPT only supports up to the 100Mbps tiers. While NPT may show higher results, the results will be inconsistent above the 100Mbps tier.

Troubleshooting steps

  • If a specific unit always seems to test bad, check that it has the correct provisioning
    • You can try provisioning it slightly higher to see if results are better.
    • You can run tests from the GUI, or the NPT section of the device page to see if the results are different. 
  • If a large number of results are poor, make sure your upstream provisioning for those units can support multiple units testing bandwidth at the same time.
  • Check the speed test server to make sure devices are connecting and reporting the correct bandwidth. 
  • Manually run multiple speed tests directly from the device to see if you get similar results. 
  • Verify you are using the correct calculations from USAC to calculate bandwidth. 
  • Contact ADTRAN Support for assistance. Make sure to put "FCC Testing" in the subject of the ticket. 

Latency tests periodically show high latency / latency scores failing

It is common to see spikes in latency during testing, especially if you are using servers hosted in AWS as routing changes will sometimes cause poor results for a few latency tests. However, the overall latency average of all the devices must be below 100ms. Therefore, if the majority of your results are well within compliance you should not see an issue. If you have prolonged periods of latency, follow the troubleshooting steps.

Troubleshooting steps

  • Run individual ping tests from different units to try and narrow down the source of the latency. Use traceroute to narrow down where it is coming from. 
  • Look from the latency server side to view that the correct values are being reported back and there are no issues with the server responding to ping. 
  • Try running NPT tests from other individual devices on different network segments to compare latency data. 
  • Verify that the correct statistics are being sent up via the network logs and are being input into the reports.
  • Contact ADTRAN Support for assistance. Make sure to put "FCC Testing" in the subject of the ticket.

Reports didn't generate for a day

This commonly happens when the bulk op specified the wrong day, reports were not generated before the next NPT test was run, or if the bulk op Filter Criteria was set with incorrect labels. 

Troubleshooting steps

  • Verify the device labels were set up correctly in the bulk op.
  • Make sure the bulk op had the correct date.
  • Verify with a specific device in the Event Logs, using the Connection Request event, that the correct date timestamp was sent down. An error in the timestamp specifying the incorrect day will cause the device to not test. Also, verify there were Diagnostic Completes for that testing period as you should see them report up periodically throughout the testing interval. 
  • Make sure you generated the reports with the correct label (test tier) and for the correct date (generally you are creating them for the day before). Note the device results are only stored until the next test is run.

Report output not formatted correctly

When downloading reports, if the report output does not line up with the columns in the report, it is likely your server or subscriber name has a delimiting charter (such as a comma) in it. You can either hand edit the reports to fix the delimiting issue, or change your subscriber names to remedy this.

Devices not responding to Bulk Operations for Network Performance Test (NPT) 

When running a bulk op to pull results, or push NPT testing, you may see some devices that do not respond to solicits requests from Mosaic Device Manager, or do not inform in time for the job to run. 

A device may not respond because: 

  • The device is offline or unplugged.
  • The device is not listening for solicit due to CWMP client being disabled. (A user could potentially log in locally to the CPE and disable it.) 
  • The CWMP client is enabled but is hung or otherwise in a bad state and thus is not responding to solicit requests. A reboot of the device normally resolves this. 
  • The Connection Request URL used by the ACS is somehow outdated or otherwise incorrect. This should be fixed the next time the device informs.  
  • The Connection Request credentials used are incorrect. This is fixed by synchronizing (PUSHing) the ManagementServer app. This sometimes happens if the device has been rebooted but has not informed yet and the ACS solicits (creds are refreshed on reboot). 
  • The Connection Request URL is non-routable from the ACS, or otherwise blocked (firewall, etc).
  • The device is using CGNAT and therefore must use STUN. In this case, you will want to try a 1 hour 45 minutes inform interval during the test period. Always set this back up to 23 hours after testing has completed.

Troubleshooting Steps

  1. Verify the device is still online using other upstream methods (looking at the ONT, for customer traffic, etc.). If the device has been deprovisioned, it should be removed from the pool with USAC and they will assign another random subscriber.
  2. Ask the customer to reboot the device.
  3. Gain access to the device and check to make sure the correct ACS settings are set.
  4. Make sure you have no other firewall or item blocking access to AWS (where MDM is provisioned)
  5. Set the Inform Interval in the TR-069 settings to 1 hour, 45 minutes for the test pool only and only when performing tests. All other devices should be at the standard 23 hour (1380 minutes) inform interval (or greater). In conjunction with this, be sure to push the NPT test start time between 3:30pm and 5:30pm local time.

    This combination helps to ensure that the CPE has multiple chances to be solicited and inform during that window.

    WARNING: To avoid potentially serious performance degradation of your Device Manager system, it is imperative that you return the devices under test to the standard 23 hour inform interval after the testing cycle is complete.

 

Contacting ADTRAN Support

If you cannot solve your issue, please contact ADTRAN Support for assistance. Make sure to put "FCC Testing" in the subject of the ticket.

 

 

0 Kudos