Greetings and salutations once again to all you folks out there that read the things I write! Today’s fun topic started just a few weeks ago.
We have a client who started off on Lync 2010, migrated to Lync 2013, and last year upgraded to Skype for Business. They’ve always had pretty much the same setup: Microsoft UC (whatever version that is), Sonus 1k SBC, SIP over IP SIP Trunk. Last year, the vendor went through a hardware refresh that took them from Sonus hardware to Taqua, and, at least for the clients, it was a fairly painless process. We just pointed their hardware to a set of IPs and life was happy. I tell you these things because I want to impress upon you, dear reader, that this has really been a rock-solid setup with nary a burp in functionality.
Around the beginning of July, our client began having some intermittent audio issues in Skype for Business. Sometimes only one-way audio, sometimes no-way audio, sometimes it only happened on inbound calls, sometimes it only happened on outbound calls. There was unfortunately no consistency in exactly how the problem would present itself, and there was no way to reliably reproduce the problem.
Basic troubleshooting ensued: restart services, dig through event logs, CLS logs, logs from the Sonus, Wireshark, just about everything we could think of to make sure the issue wasn’t on our end before contacting the SIP provider. But, having found no smoking gun on our end, we asked our SIP provider to see what they could find out.
The engineer from our client’s SIP provider pulled his captured, ran Wireshark, did some voodoo magic and said “Okay, try it now.” We tested quite a few calls and strangely enough, they all worked great. Inbound, outbound, it didn’t matter. After asking the obvious question of, “What’d you do?”, we were told that the TCP sockets were stacking up and had become unresponsive, so he had to clear them manually and now all was good. Okay … great. Now we know what broke it, how do we figure out what caused it to break? Unfortunately, that had to be escalated.
I’m cool with that.
The following week, I get an FYI email from the client. Guess what, it’s on the fritz again. The SIP provider again cleared the TCP sockets and we requested the issue be escalated to determine root cause. While awaiting word from the escalation, things ran smoothly for the client and as things are wont do when they are working, it was decided to give the provider some time while they look into it.
It was found that the issue was being caused by a TCP/IP patch that had been applied to the Taqua hardware that caused the hardware to be limited to 10 TCP sockets per trunk. Total. Once that eleventh socket was opened, funny things started to happen. Fortunately, the hardware has the ability to tear down inactive sockets. Unfortunately, the timer you have to set for this is limited in that the most frequently it can tear them down is every hour. Not very helpful.
Knowing exactly what was causing the issue, I dug in on our side to see what I could find …
Hopping into our Sonus SBC 1000, I decided to take a look at the SIP Server Tables for our connection, and there, hidden in plain sight, was our answer.
By default, this Reuse Timeout gets set to Forever. I reset it to Limited and left the Timeout Limit at the default five minutes.
After applying this setting, we were able to watch the inactive TCP sockets get torn down by the Sonus instead of needing the provider to take care of it. This solution has been in place for a little while now and we’ve confirmed with the provider that we have not gotten anywhere near the 10-socket limit.
The moral of the story? I don’t know. Escalate quickly? Don’t leave it to your SIP provider to tear down your inactive TCP sockets? I just hope that this experience can help someone resolve their issues without having to spend the hours on the phone with support.
As always, if you have a better way to do this, please drop me an email. I’m always open for new ideas.