Tessitura Scalability

Former Member
Former Member $organization

As Tessitura continues grows with V12 and web transaction continue to climb every year, we are looking more and more to the tools of the cloud and more elastic and scalable application technology.

Last year at TLCC2012, we were all quite impressed by the work of the Royal Opera House in collaboration with POP.  They have paved a path to a more agile platform.  But the work it took to do so was tremendous.   For some Tessitura licensees, this may never be necessary.  For others, it’s rare.   But I believe that there is a large majority of users who regularly experience peaks in demand through subscriptions or events on-sales, where the tools adopted by Royal Opera House are not only beneficial, but necessary.

Rob Greig, CTO of the Royal Opera House, mentioned a few of the largest bottlenecks they experienced:

Writing Orders to Tessitura
CMS/Event Data
Credit Card Processing

We can generally say that Tessitura is responsible for the first two of these;  The problem then becomes that Tessitura, aside from its GUI, is a database, and for some a very large one.  So large, in fact, that optimization for demand becomes impossible due to databases limitations of CPU, Memory and Disk.  Their solution, in very short, was sharding

To credit Tessitura, it typically isn’t necessary to build a sharded application until it’s really necessary.  So I suppose my real question is:

When will the community say its time?

Where does the technology currently sit on the roadmap?

We’re currently talking about performance enhancements, here at Carnegie Hall, but before I could justify doing any similar legwork as ROH, I’d need to know the answers to these important questions.

Thanks
James

 

Parents
  • Hi James

    Think some of the bottlenecks raised by the ROH are open to discussion, especially the Writing Orders to Tessitura.

    In our latest onsale we ran with a waiting room, to manage the number of users in the purchase path (Performance Seating page onwards the more dynamic data requiring API calls),  this was raised to allow 1000 people and saw a peak of 68 orders being written to Tessitura in a minute.

    This was during a 30 minute period in which we processed 1,278 orders.

    I think the one thing that has helped a lot with this is moving the credit card processing of the web orders outside of Tessitura which means Tessitura is processing "Cash" orders and therefore just writing the web orders to t_orders etc. 

    In fact the only bottleneck we were seeing were from the external Credit card site which was timing out when we tried to complete some of the authorisations, customers didn't notice this as we pushed their orders through and manually completed later.

    Mark

  • Former Member
    Former Member $organization in reply to Mark Ridley

    Mark,

    Thanks for the insightful reply.  These are the things that really need to be vetted and decided!

    I agree entirely, that all the "issues" are open to discussion.  As for our own issues, I'm not entirely sure what our next bottleneck is, but i know  they exist.

    In a hypothetical world, many of our current problems are more or less easy to solve.  I can pretend that our CMS liscense didn't limit our number of instances, and i could zap all static content to the cloud...But we're still faced with 1 API, 1 CC Processor and 1 Database, hence, the title of the thread.  We all know (or can learn) how to scale web servers and applications, even databases and APIs, but where do we draw the line between Tessitura's functionality, and that which should be build off of it?

    One issue that isn't negotiable is the database limitations of CPU, Memory, and Disk I/O.  The ROH's model takes most of this data to the cloud, minifies it, replicates it, and controls the important disk writes via an abstraction and throttling layer (this is at least how i recall it).  This stuff makes sense.  Its well designed for the tools we have and the business we're in.  But again, are these tools needed for the masses?  Or should they be custom build for a few?  

    We've all heard a lot about waiting rooms, and the general consensus that i hear is they work, but are they necessary?  Perhaps more importantly, will the customer tolerate it?  Or should they?  Today, we're driven by an on-demand society with high expectations.  If our site flickers for a minute, you'll be hearing it on twitter and so will the world.  This we can't tolerate.  So what do you do?

    James

Reply
  • Former Member
    Former Member $organization in reply to Mark Ridley

    Mark,

    Thanks for the insightful reply.  These are the things that really need to be vetted and decided!

    I agree entirely, that all the "issues" are open to discussion.  As for our own issues, I'm not entirely sure what our next bottleneck is, but i know  they exist.

    In a hypothetical world, many of our current problems are more or less easy to solve.  I can pretend that our CMS liscense didn't limit our number of instances, and i could zap all static content to the cloud...But we're still faced with 1 API, 1 CC Processor and 1 Database, hence, the title of the thread.  We all know (or can learn) how to scale web servers and applications, even databases and APIs, but where do we draw the line between Tessitura's functionality, and that which should be build off of it?

    One issue that isn't negotiable is the database limitations of CPU, Memory, and Disk I/O.  The ROH's model takes most of this data to the cloud, minifies it, replicates it, and controls the important disk writes via an abstraction and throttling layer (this is at least how i recall it).  This stuff makes sense.  Its well designed for the tools we have and the business we're in.  But again, are these tools needed for the masses?  Or should they be custom build for a few?  

    We've all heard a lot about waiting rooms, and the general consensus that i hear is they work, but are they necessary?  Perhaps more importantly, will the customer tolerate it?  Or should they?  Today, we're driven by an on-demand society with high expectations.  If our site flickers for a minute, you'll be hearing it on twitter and so will the world.  This we can't tolerate.  So what do you do?

    James

Children
  • Hi James

    I guess from my experience most of the bottlenecks we tracked down were originally found 5-6 years ago. Once you hit a bottleneck that caused your website to go slow or customer complaints you tended to work around it and then never really want to test it again.

    I used to work at ROH and so know the issue around deadlocking with order writing was due to an adaptation of lp_customer_rank which formatted addresses for us, this happened on a single day and I moved that procedure to run overnight rather than be triggered and we never saw the issue again, but the memory of the issue has always made that seem like a bottleneck and something to be cautious of.

    When the NT ran load tests on staging prior to the launch of the new site we were seeing Tessitura cope with over 100 orders a minute, again the limitaion on Test was the Credit Card authoriser API. In our live site we are upping the level of the waiting room more cautiously as we don't want to hit the limit as we cant risk our better customers getting a bad experience. At the moment the bottleneck we now have is not Tessitura but the credit card authoriser.

    Admittedly our peak traffic on these days only last an hour or so.

    The thing with the waiting room is that it has always been there to protect out infrastructure, it is almost looking like it is not needed for that now (although not 100% sure I want to risk removing it when high level members are booking).

    However, we do know that there are a limited number of seats that are highly prized and therefore the waiting room may not be needed to protect the infrastructure but we may still leave it their to allow for a better user experience, ie limit the number of customers actively going for the same seat so they are not playing "Battleships" on SYOS with other customers.

    In the last 5 years Tessitura have released the new version of Seat server, we have upgraded to SQL 2008r2 and Windows 2008R2.

    We are about to look to experiment on our test server running the API servers on Windows 2012 (already doing our TS APP using 2012 as made load balancing those easier)

    We also load production ad performance info and store it in our CMS and manually run scripts to update it, this is done for MOS start times and we poll availability in order that we are not hitting the backend to hard.

    Think it would be good for us all to pool our experiences and see whether the issues we all know about are the same issues and if we can determine whether they are still an issue.

    Mark

  • This, of course, is a very interesting discussion to me.  Our findings are much the same as you describe—every system has a limiting factor and in our current configuration the limiting factor is the database server.  And while we have seen very high-end hardware perform at speeds higher than we anticipated (up to 27,000 transactions, 100,000 tickets/hour with our test harness), there is still more that we can do.

     As you are aware we are currently in the process of building out our RESTful services layer, as perhaps the most fundamental and important parts of our Next Generation architecture.  A major part of this effort is moving much of the business logic that is now expressed in database stored procedures into that services layer.  The amount of business logic in the database is especially heavy in the transaction save process and we expect to see performance gains from this change.  This change also enables additional scalability because the services layer is built in such a way that running multiple instances of that is a very inexpensive scaling solution.

     However, this approach does not fully address the problem of being able to scale the database.  The work that Royal Opera House did adopts a standard pattern of serializing the transactions from easily scalable temporary storage into the Tessitura database—an approach that is used by many large transaction systems.  (It’s the reason, for example, that your airline confirmation often appears some amount of time after you complete your order. )  We have done some initial exploration into what it would take to build this type of scalability into our web cart infrastructure and will continue to explore hat.  At the moment we think that approach may be a better and lower cost solution for the majority of our members.

     Thanks for raising these issues and I hope to see the conversation continue here.

  • Former Member
    Former Member $organization in reply to Chuck Reif

    Thanks for the input, Chuck!

    From what I hear you saying, I gather that at the moment, this is on the roadmap...somewhere, but not yet scheduled and still in exploration.  Which sounds great, and allows me to weigh in some valuable pros and cons to our current performance enhancement efforts.

    But to continue on with the conversation;

    Is anyone else taking steps to scale their Sites/Tessitura? How about exploring interim solutions while Chuck and the team continue development?  ROH is just one example and one solution...but the options and paths are many.  How would you go about it?  

    I know there have been many discussions of users going to AWS and other cloud providers.  What has been your experience?  

    Maybe I'm getting off topic now, but how about bigger scaling topics like session management, seat serving / deadlocks, write availability, caching?   Also, load testing and performance testing solutions?  We recently started using RedGate for performance profiling, and have had excellent results, instantly finding a few performance issues (palm to forehead, kinds).

    Thanks again for all the input.

    James

  • We have been looking at going to AWS for instant scale-ability on demand.  However, in the meantime we have been scaling up using our own manual process of cloning our virtual web servers for a big on sale.  So far we have gone up to a 10 server web-cluster for our big on-sales and it has been working great.  The way we are scaling up we could really go to as many server as we want but 10 seems to work well and allows us to handle over 1000 concurrent users buying tickets.  As long as we have advanced notice of the on-sale date (which we usually do) we can spin up our web cluster to full capacity in about 2 hours.  The nice thing about moving to AWS would be that we could instantly scale up based on the current load of the servers.

    We have also testing load balancing requests to different API servers which we have found works well but doesn't really increase performance in any way.  The most likely cause for this is by the time that the API server is overloaded, the database is too. Adding additional API servers at that point just makes the bottleneck worse.

    Finally, we also utilize caching through nginx (formally through squid) to speed up our page load times considerably.  We have determined that since implementing caching about 90%+ of our web requests are pulled from the cache and not from our web servers.  This takes a lot of load off of the web servers leaving valuable resources for critical dynamic pages.

    I should mention that our web cluster is running Linux and utilizing HA, Load Balancing, the OCFS2 shared file system, Apache, MySQL, nginx and memcached.  It is built on the Ruby on Rails development platform.

    For tuning and performance monitoring we make use of Web Performance for load testing our website and NewRelic / AirBrake for monitoring errors and server performance.

    In our experience the Tessitura database has not been the bottleneck for us.  This is most likely due to the fact that we run a clustered SQL server with 64GB of ram connection to a SAN with very high  IO.

  • Former Member
    Former Member $organization in reply to Justin Snair

    Hi all

    This is a particularly interesting conversation for me, because I'm content planner for a WEB  discussion session at the San Francisco conference addressing exactly these issues. I now have several names to add to my list of potential Gurus :-) ......

    This is the draft session description::

    IT Solutions for Web Environments: Building a stronger foundation

    All web applications depend on the underlying IT framework, and the web team relies on the IT team for a large amount of its success. Let's work together and discuss solutions for issues such as large on-sales, monitoring, load testing, hosting and site optimization.

    Hopefully this forum discussion will  continue, and then we can have a really well-informed discussion in person in July. 

    Ken

     

  • Former Member
    Former Member $organization in reply to Former Member

    Sounds like a great session, Ken. Looking forward to that one.

    Justin, sounds like you're on the right track. Right now our caching solution is on our virtual load balancer, using the Riverbed's Stingray product. With a 10 minute TTL, we're serving up 51% from cache. We haven't really  experimented with increasing the time on that. We're also looking to a cloud service for dynamic image resizing to take the task off the server's CPU.

    We're running an IIS shop, and also started using WebPerformance.com for testing. Its nice to hear that you're doing your own scaling successfully!

    I'm always curious to hear what people find for bottlenecks beyond the standard types of issues.