We are currently reviewing several options in restructruing our server infrastructrue as it relates to high availability for all Tessitura applications. We would like to know how other organizations have successfully addressed fault tolerance within their Tessitura server configurations.
I'll take a crack at this. We have set up a clustered pair of servers for each function (2 SQL Servers, 2 seat/report servers, 2 API Gateways) using MS Clustering. The idea is that each server is backed up by an identical twin server; during normal operations, the "Live" node of the cluster is on one machine and the "Test" node is on the other, but both can run both Live and Test, and indeed would in the event of failover.
I would call it a qualified success; very often the clustering service itself is a source of trouble and requires some expertise. I've found that SQL Server clustering is the most tightly integrated and easiest to setup and maintain; one machine goes down and the other picks it right up. Clustering the seat and web server services takes some time and testing and a good knowledge of networking. We do not have our credit card server clustered.
Ideally we would have separated each pair of the cluster into two physical locations but this was not an option for us, so for disaster recovery we are looking into Tessitura's RADR solution.
Just a question.
Are any of you using or plan to use SQL always on? I'm also looking at availability and would appreciate any advice or learning's that you have.
Dmitrij (now at Sadler's Wells) and I set up an Always On SQL Cluster (MSSQL 2012) at the National Theatre. We were doing it as part of a larger project that was addressing PCI issues - so had separated Test and Live environments, moved clients to Remote Apps, load balanced terminal servers and also for client updates (UAC issues on windows 7), and load balanced the API server for performance and stability.
Had it with 1 Active and 1 Read-only (just so us DBAs could run queries even during on-sales)
It did come in useful as one of the updates that incuded a SP release failed on our main DB on a sunday and I was able to get the second DB up and active in about 15 minutes - admittedly took a couple of days to figure out a few of the steps I had missed (like reapplying the MASTER key encryption) - this also gave us time to look at the otehr DB and get that fixed and back in the cluster before our next on-sale.
After that we also worked with the SA team so that we were patched the secondary in hours and then the next day or so we flipped that to be the primary primary so that the other DB could then be patched. This meant a DBA could be around when this was happening.
Mark
The Science Museum of Minnesota is also HA on the production side. There were some things we needed to do for efficiency. One, was moving the SSRS reporting services to another server to centralize reporting. We also ran into an issue where the SIDS would not syncronize after a failover. We finally found a solution here: http://www.msbicoe.com/post/2012/04/05/Automatically-Transfer-and-Synchronize-SQL-Server-logins-SIDs-Passwords-and-Permissions.aspx for that.
The logins can be an issue, but as Tessitura stores active logins in a table I was able to write a script to re-enable those that were already on the server
declare @table as table ( username varchar(50), usersid varbinary(4000))
insert into @tableexec sp_change_users_login 'Report'
delete @tablewhere username not in ( select username from @table u join impresario.dbo.T_METUSER m on u.username=m.db_userid and m.inactive='N')
declare @login varchar(50)
select @login=( select top 1 username from @table order by username)
while @login is not nullbegin BEGIN try exec sp_change_users_login 'Update_One', @login, @login print 'Updated ' + @login END TRY BEGIN CATCH print 'Updated failed for ' + @login end catch
select @login=( select top 1 username from @table where username>@login order by username )end
Needs to be run on both impresario and impresario_cci
For newer users who didn't have a login on the secondary I just used the security program to create their accounts, but at least that number was a lot easier to manage.
Due to issues like this we didn't switch on automatic failover and left it is Manual.