CASE Study: Troubleshooting Index Engine Memory Issues - Part 1

The client is the financial arm of a major automobile manufacturer, the Livelink-based solution involves the Cardiff scanning solution to scan and classify incoming documents, index the classifications and archive the document. The business use is mostly of the search engine, used to bring up client documents during review, termination, and other events. The existing solution has about 15 million documents with a growth rate of approximately 1,000 per business day.

The problem being seen by IT was the index engine process dying due to memory exhaustion of the underlying Java Virtual Machine (JVM). This is a known and common problem, it is particularly relevant when the search and index engines are deployed on a Microsoft Windows platform. While there exists a 64-bit JVM version for Windows, it has not been certified by Open Text for use with Content Server (Livelink) and thus all Windows deployments use a 32-bit JVM. The consequence of this is that the JVM has a very hard memory limit of approximately 1.4GB of RAM, once hit the JVM process dies.

This limitation is not generally an issue for most applications, after all 1.4GB of RAM is a very large footprint! However for a search index intended to hold 1 million documents, that only provides 1,400 bytes to index all of the relevant data that business users might regularly search for....that isn't a lot of data by any stretch.

Nonetheless, the current search/index hardware recommendations coming from Open Text suggest a dedicated CPU per "index partition" (a search and index engine pairing) ... an awesome  cost when faced with 10s or 100s of millions of documents... RAM may be cheap these days, but server-class CPUs are most certainly NOT! Thus, it comes as no surprise that, as many other Livelink customers in the past have tried, this client attempted to maximize the amount of objects within a partition ... allowing each to be filled nearly to capacity and thus seeing frequent failures as content within those partitions was updated. The problem was, the partitions where all in "update only" mode and the client was convinced there was no changes being made to existing content.

The Diagnosis -- Step One: find the updates

 While Open Text generally makes very good software, they are not immune to bugs and as a customer of Open Text there is no real reason to believe that the fault of any issue is yours; there would seem a good possibility of it being a bug in the software. In this case there was an acknowledged "bug" within the index engine that caused fragmentation of the JVM heap memory that I will, perhaps inaccurately, describe as a typical garbage collection issue. The "bug" isn't of the type that can easily be fixed, it is essentially a design issue that only affects very large sites and only those pushing technological envelopes.

If this were indeed the problem, as Open Text Customer Support indicated it was, then there must be an indication of updates, even in face of the clients understanding. The reasoning is simple, a bug so egregious that it sent new content to an update-only index would a: rear it's ugly head often; and b: not be difficult to find or fix.

Thus, despite the fact that the client was expecting an architectural diagnosis that suggested a great many new servers and perhaps a redesign, what was required was a more focused business discussion. What was found was that there were a few dozen objectIDs within their system that would get updated many, many times with what appeared to be redundantly useless information. It seemed like a hiccup, or a skipping record (for those with memories of vinyl) ... and it was obviously automatic in that there were instances of objects being updated as much as 13 times in 10 seconds....too fast for human interaction.

So the updates were occurring, the smoking gun had been found but there was still a mystery on our hands ... why oh why was this happening? You'll have to wait until part 2 to find out