We have a carrier provided and managed Netvanta 3430 to a 10Mb MPLS circuit -- traffic occasionally fills the 10Mb limit --and the tunnel sometimes fails -- most consistently during those peak usage times. The carrier has been unable in nearly 3 weeks to identify the cause of the problem. They are recommending traffic shaping to limit peak traffic as a band aid.
The carrier did give us console access to the netvanta -- the console is showing these errors repeat every 15 seconds when command 'debug BGP' is issued
2015.08.25 16:48:15 BGP.OUT VRF: -DEFAULT- x.x.x.x: TCP error 51 connecting to peer (event, s:idle)
2015.08.25 16:48:15 BGP.EVT VRF: -DEFAULT- x.x.x.x: IDLE->CONNECT
2015.08.25 16:48:16 BGP.EVT VRF: -DEFAULT- x.x.x.x: CONNECT->IDLE
2015.08.25 16:48:16 BGP.OUT VRF: -DEFAULT- x.x.x.x: TCP error 22 connecting to peer (event, s:connect)
carrier claims he does not see these errors. circuit was going down for 30 sec about 8 times per day -- now we have unloaded it, carrier adjusted some frame? parameters and it is staying up, but occasionally becoming sluggish
AOS seems to have a wealth of probes and monitors to diagnose issues with BGP --
we do not have enable access but have asked the carrier to enable console logging -- After many tickets and escalations carrier sent a traffic utilization report that points only to the fact that peak traffic hits the 10Mb limit for very short periods of time.
What are diagnostics that will find the misconfiguration that we suspect?
More info -- we now have console logging and netflow -- yet no smoking gun -- tunnel continues to collapse -- maybe 8 times today. Provider offered something that has a possibility to be true -- that network on the netvanta CPE side has too many possible hosts and is filling the arp cache.
Yet I cannot get excited about this because the total # of mac addresses is far fewer than 512 or even 256 -- mainly I want to see the netvanta report a cache full event
Is there a way to interrogate or how do we get that reported
To add a little info that might help you to troubleshoot this issue. A GRE Tunnel is stateless. There is no handshake or any other sort of communication back and forth before the GRE Tunnel comes up. Technically it cannot flap as you suggest. All a GRE Tunnel does is encapsulate IP packets within a GRE header and sends them blindly to the routing table. I would start by having a permanent ping to the remote end of the outer IP Address and see if it fails. If it does you have a connectivity issue unrelated to the GRE Tunnel.
Yes the pings to routers stay up from end to end -- the tunnel however goes down for 30 second intervals and WAN traffic ceases. I cannot figure out how to get the Adran to tell me what is happening -- It may be happening at the adtran itself - or somewhere midstream, but I cannot get the info.
This is all supposed to be provider managed -- by a veeerrry large provider for a huge customer -- a 31 node high capacity MPLS! The provider seems to be grabbing at straws.
-- To his credit -- the provider has opened his managed CPE equipment up to us (the Adtran 3430) -- but right now is making his customer change his IP scheme, and use NATing to shrink the ARP table -- it seems possible that it could be a fix to me -- but I am doubtful -- I should see the Adtran complain that his ARP cache is full first, right?
Are you saying that this is not even a possibility?
If the ping to the remote outer IP stays up, then the GRE Tunnel is up. As I said before GRE is stateless so it technically cannot go down, if there is a route for the IP packets to go then they will be sent regardless of whether they reach the other end.
It sounds to me like you are having a routing issue instead. Your first post is about BGP. Try to put in a static route to a known destination just for testing and see how that goes. Your BGP session might be collapsing when the WAN gets full and all routing goes down. Careful traffic shaping/policing will be needed to give BGP priority over any other traffic, thus ensuring it never gets crowded out.
This is along the lines of what the carrier has been saying. And he did introduce some QoS Traffic Shaping just yesterday -- which did not help. As I understand the WAN traffic is all within this tunnel and the customer generates no other traffic -- his utilization generally stays around 10% -- with some short bursts up to 80% and beyond. The carrier is chasing the ARP theory right now -- an excerpt
"The thought was to have a sole /24 IP range on the interface reducing the ARP table entries. Adding the /16 back in as a secondary still gives us the same scenario and issues. We are still seeing the 10 . /16 IP range in the arp table and although we are well below the 256 entries all it will take is one ARP issue to recreate the problem.
That being said is it possible to have NATing implemented and allow us to have just the 192.168.155.0 /24 on the LAN interface? I realize this is a bigger setup even if temporary. The other option is to find the infected PC that is causing the ARP requests off the network."
I will research how to locate this ARP requestor -- but I am dubious
The tunnel can "flap" if keepalive settings are configured on the GRE interface.
If BGP is also trying to handshake, this could be a CPU issue on the router that is causing delays in processing GRE keepalive packets and thus bringing the tunnel down. What does resource utilization look like on the router?
I think the BGP issue needs to be addressed. Is BGP supposed to be up?
Keepalives are configured -- not sure if there are alternatives to consider -- a resource snapshot on the Adtran showed only 5% utilization -- not sure how to look at a utilization history.
BGP is supposed to be up -- there are the curious messages of TCP errors in the first post above
Carrier is going full throttle on the ARP cache overflow as the problem -- I don't believe he has seen an overflow though -- there is just a potential with the /16 network or so he says -- I thing in truth there are only about 100 MAC addresses
Solution in play now is to replace the Adtran with a Cisco
As I said before, a simple troubleshooting step will be to place a static route to a known destination to see if the ping stays up while the trouble manifests itself. If it does then its a BGP issue. If you still have doubts about the GRE Tunnel then do a 'debug int tunnel <number>'. It will clearly show you if there is an issue with it.
Well one option is to adjust the keepalives or disable them completely.
You say ICMP stays up to the far end GRE endpoint, but I assume that is the GRE termination IP and not the IP on the GRE tunnel itself. With keepalives disabled, ping the far end tunnel IP and see if any drops occur.
Where is the GRE tunnel going? My guess is a term point used for other GRE tunnels that don’t have problems, so that likely isn’t the issue, but it should be checked if this is the only tunnel going there.
As for resources, it would be best to check them during a period when the tunnel drops out.