EPiServer CMS, Lucene.NET and a recycling application domain
This blog post is about troubleshooting seemingly random recycles of the application domain. We take a look at the symptoms, troubleshooting and finally a resolution to the problem. This case is related to EPiServer CMS Lucene.NET, but the same troubleshooting approach can be applied to the same symptoms.
Introduction & background
EasySearch is an addition to EPiServer CMS which incorporates the search engine Lucene.NET. Lucene.NET provides an event-driven architecture to keep the search index up to date for pages in EPiServer CMS.
For first time use the search index needs to be created. This is done by iterating through the page tree from start page and onwards by the use of a scheduled job.
The website also implemented a custom page provider to serve content from back-end systems. The content served by the custom page provider was the primary data source that needed to be indexed, containing roughly 20 000 items.
Problem & behaviour
When scheduling and running the EasySearch indexing job the application domain serving our website would recycle seemingly at random.
There were no event entries in the log, nor any exceptions caught by the scheduled indexing job. The application domain would simply recycle without a reason as to why.
The indexing job could run between anything between 5 and 10 minutes before the application domain was recycled and the thread executing the indexing job subsequently lost.
Debugging the process
The application domain will be forced to recycle if an unhandled exception occurs in a non-request thread. It will also bring down the worker process and all the contained application domains with it. This is bad.
How does this relate to EPiServer CMS and scheduled jobs? A non-request thread is a thread that has been initiated without an actual HTTP request. In our case this would be a thread instantiated by the scheduler service.
Other “unhandled” exceptions are actually caught by ASP.NET error handler which usually displays the yellow screen of death.
Troubleshooting thus begun by simply attaching a debugging session to the process and instructing Visual Studio to break whenever an exception occurred.
This way we should be able to catch the exception before our worker process is mercilessly gunned down. No exception was caught, however, and the worker process died with a last cry of “-2”, bringing the application domains within with it.
log4net and ELMAH
ELMAH should catch any unhandled exceptions and log them to persistent storage. log4net can log information about how your code is being executed and the data processed.
I was hoping these tools would reveal bad data as we were traversing the page structure served by the custom page provider.
But after numerous attempts there were no correlation to the data. It seemed to be very random – but it never really is, is it?
Health monitoring for the application pool was turned off. The website was also instructed never to time out any requests.
Instead of scheduling the indexing job it was manually triggered. I expected it to fail as it always do, but instead it just steamed on. Eventually finishing and leaving us with an indexed website. Huh?
The difference between manually triggering the job and having the scheduler service run it is basically access to the HttpContext.Current-object which only exists when there is an actual HTTP request.
Hoping the lack of an HttpContext-object was the root of the problem I implemented nullreference-checks and try/catch blocks where needed. But still the same behaviour.
HttpRuntime and reflection
HttpRuntime runtime = (HttpRuntime)typeof(System.Web.HttpRuntime).InvokeMember("_theRuntime", BindingFlags.NonPublic | BindingFlags.Static | BindingFlags.GetField, null, null, null);
string shutDownMessage = (string)runtime.GetType().InvokeMember("_shutDownMessage", BindingFlags.NonPublic | BindingFlags.Instance | BindingFlags.GetField, null, runtime, null);
string shutDownStack = (string)runtime.GetType().InvokeMember("_shutDownStack", BindingFlags.NonPublic | BindingFlags.Instance | BindingFlags.GetField, null, runtime, null);
The sheer amount of writing to the index files which were located directly beneath the website root apparently caused ASP.NET to recycle the application domain due to too many change notifications.
By moving the index folder to a different location the problem was solved. No more recycling of the application domain.
Do not keep files which are frequently written to directly beneath the website folder. This may cause ASP.NET to recycle the application domain.
Thanks for reading!