The more they overthink the plumbing, the easier it is to stop up the drain.
-Star Trek III
Well as of late we’ve been fighting this aging SQL server at work. It was originally a NT 4.0 server with SQL 6.5 upgraded to 7.0 then 2000 with Windows 2000. It also had to do some work with Oracle, and the dba’s were using MSDTC to make sure their transactions were getting committed into Oracle. Oh and to keep it ‘fresh’ the Oracle client was version 7 as it originally was talking to an Oracle 7 DB.
Well as the years go by, that Oracle 7 DB became Oracle 9i (already obsolete!), and we suddenly hit a transaction wall.
And along the way we virtualized the server to go into our VMWare ESX 2.x cluster, and it’s been since migrated to VMWare ESX 3.5
The server was dropping tens of thousands of these XA???????????????????.trc files, into the \winnt\system32 directory.
Well naturally you’ll eventually hit this wall of how many 8.3 translations you can do before the system CRAWLS. And boy oh boy did we hit that wall. So at first my idea was to delete all of these trc files, and let it live, but that’s not such a ‘great’ idea… As this reeks of a fundamental problem.
So the ‘first’ step in all of this madness was to up the OS to Windows 2003 enterprise (it was 2000 Advanced server before) And see how things were doing. The OS upgrade went smoothly I had slipstreamed sp2 into the update, so I only had some 90+ updates needing to be done once the OS had been upgraded. And for ‘good’ measure I thought I’d take the server from 768MB of ram up to 2GB, and set VMWare to allow 4 cpu’s for the database server. The node it was running on wasn’t doing terribly much so what the hell right?
The server was now performing markibly SLOWER… And yes, still dropping TRC files like there was no tomorrow.
After a bunch of digging around, I found out that in 2003 you have to click a box in the component manager to allow XA (cross architecture) transactions! Well now it wasn’t dropping as many XA trc files, but after watching it for a while, when two went to run at the same time, the SQL server would crash with a hex code saying it was out of memory.
Out of memory? I’d just given it more!?
So I did the ‘logical’ thing and gave the system 5 GB of ram, and enabled the /3GB flag in boot.ini
I moved SQL server up to using 2GB of ram (out of 5, sure why not?). Same error.
Well this sucked, so we tried to update the Oracle client from 7 to 9i. In the process I found I couldn’t un-install the 7 client, nor could the 9i thing just ‘upgrade’ it, 9i kind of installed in parallel. Which led us to our next major fault, after swapping out the new client, removing the Oracle product key from the registry, and re-linking the Oracle servers using a new registration string, now ALL of our transactions against the oracle servers were failing.
Thank goodness for google, as we were able to deduce that the registry key HKEY_LOCAL_MACHINE\SOFTWARE\Microsfot\MSDTC\MTxOCI was populated with all the old Oracle 7 values….
OracleOciLib -> oci.dll
OracleSqlLib -> orasql9.dll
OracleTraceFilePath -> c:\oratrace
OracleXaLib -> oraclient9.dll
So changing it to reflect Oracle 9i and suddenly our transactions were running! Even two at a time!!
But there was no doubt about it the transactions were slower then hell. We had gone from 1 minute to 11 minutes on one, and 5 minutes to just under an HOUR.
I added MORE memory to only find the SQL server couldn’t see the network card. So I added another one, and it got even SLOWER.
So in a minute of panic, I reduced the ram back to 768MB, took the VM from 4 cpu’s to 2 cpu’s and forced SQL server to use a single processor.
And our timings are now fantastic! That 1 minute process can complete now in 12 seconds!!! The other process finishes in about a minute give or take, but it’s tremendously faster.
From what I can gather, since SQL is so IO bound the more ‘top’ hardware you give it, the harder it pushes the IO stalling itself… Naturally it’s different on a physical machine, but sometimes it’s interesting to see what happens.
And may this be a lesson, just because it can emulate multiple CPU’s doesn’t mean it’ll run parallel things ‘better’…
You can disable 8.3 filename creation on NTFS: http://support.microsoft.com/kb/121007
Yes you can disable the 8.3 name creation, but you see it doesn't fix the fundamental problem that all the transactions were failing.
Band-aids are good fun, but if you don't address the actual problem, then you are just wasting everyones time.
I don’t doubt your findings but this behaviour could also be that the server spans into multiple NUMA nodes and that has slowed down our Oracle instances as well, so we cut down to not cross into the next node, performance improved a lot, although not as much as expected.
It very well could have been as well, but it’s been 8 years, and a few companies later..
I can’t recall if these were Dell or HP, other than that the old original servers certainly were Dell.
Mine may have been a combo of the NUMA domain, along with other constraints behind the scenes.
It always struck me as funny as this MSSQL server took extracts and transforms from Oracle to do custom reporting.
I did notice the date after posting so I’m sorry about that, I’m currently following another post of yours about getting SQL 6.5 to work on Windows 10 as we have some legacy apps we need to keep working. I’ve tried upgrading the DB’s using all the methods available that i know, none allow the ancient application that hooks into it to function properly afterwards. But at a minimum all our security audits are finding the hosting OS to be the main point of failure, so having it working on windows 10 thorugh a hypervisor for the very few users that need it ocassionally will at least cover it from that perspective. Not ideal in production but needs must. Some interesting posts you have going on, keep up the good work is all i’ll say. 🙂
It’s no problem, just a little to late to dig further onto the issue that’s all.
It’d be interesting to look at the apps, especially if you have source, but I know that may be out of the scope of … anything sane. But if you want insane, check out SQL Relay:
It may be possible to catch them on legacy TDS protocol, and proxy them up to something newer.