CASE Study: Troubleshooting Index Engine Memory Issues - Part 2

In Part One I talked about finding the smoking gun of activity happening within existing "update-only" Livelink Search Partitions that was causing JVM heap exhaustion and dying Index Engines which was predicted by Open Text Customer Support when trying to diagnose the continual problems a particular customer was having. The problem we now faced was in determining what was causing these updates because they were not expected by the customer and couldn't easily be explained.

The important twist here is that we didn't find just a few updates, we found thousands of updates all hitting the same couple dozen OjectIDs. We found many times more updates on these few ObjectIDs then we did new activity, though the expectation was that we would only find new activity. At this stage we could have spent a lot of time trying to determine why these objects were being updated had we attempted to solve this via technical means. Not really knowing how we could accomplish that, and knowing that the customer did not wish to spend a fortune in diagnosis (particularly when they were of the belief this was an architectural issue they could solve with more and better hardware) we were at a bit of an impasse. The problem needed to be looked at differently, with humans thinking rather than computers calculating.

Our next team meeting was what we needed to put all the pieces together. We had already informed our customer of our findings and that prompted them to elaborate on the business process key to their application. It was a scanning application using the Cardiff application front end to Livelink; in a nut shell, the application creates files within Livelink in a specific place based upon the "Name" of the document as determined when electronically scanned. That document is placed in a specific folder and if there exists a document in the same folder with the same name, it is assumed to be the same document and another version is created. This behaviour allows most re-scanning processes to end up creating a new version of an existing document rather than a new document with the same name.

Unfortunately this behaviour collides with another type of default behaviour, that of allowing unfilled fields to be used by users of the application. Essentially the application is run in a way that lets users not complete all the fields in the User Interface and then fails over to one of a few established "default" names (there were a number of different types of documents being scanned, the default names corresponded to these document types). So when "an user occasionally forgets to fill in a field, the application continues to work on the queue of documents created by the scanner" ... or something very close to that quote at least. It was at this point in the conversation that the final ball dropped and all the somewhat confusing pieces of information formed a completed puzzle.

One oddity that we couldn't quite come to terms with was that when I calculated how much metadata each object should be using in the search index partitions (by the simple math of "total amount of RAM / total number of indexed objects") I got a radically different number than when Chris calculated based upon the information he gathered from the database queries ... he calculated that each object had something close to 1.5kb but there couldn't have been that much because there wasn't the physical RAM use to support it.

And then there was the oddity of the update itself ... there wasn't supposed to be any, let alone thousands per month. There was no system-level explanation, Livelink wasn't supposed to be updating these files unless there was some access to them and nothing about the business process was supposed to do that. But the aforementioned default behaviour wasn't happening "once in a while" it was happening, at leas t in some months, with a frequency close to 10 times that of successful updates. In other words, there was a whole lot of information thought to be in the system then actually was in the system ... many 10s, perhaps 100s of thousands of files thought to be indexed were not.

You just never know where things are going to lead

We got involved because with the information we had been provided, it sounded like the customer had pushed the limits of Open Text's search engine infrastructure, gotten some conflicting advice from Open Text and an unpaid consultant (who shall remain nameless) and needed help to figure out what part of OT's advice versus the consultant's advice should be listened to. (BTW: on balance, both were wrong but Open Text's recommendations came in a well written document with their reasoning behind it, the consultant's advice came from a phone call). But that isn't at all what we uncovered.

What we found was evidence that a thought to be stable application working for the previous 18 months or so was in fact not stable at all. We had shown the customer that they had best regroup and rethink the solution and (I suspect) some of their monitoring and quality assurance procedures were in doubt. The customer initially was against our weekly meetings due to costs but we insisted that they held value and in the end it was these meetings that shone the necessary light on the problem.

I'm not sure whether we could have solved this faster by tackling the problem differently or indeed whether either I or Chris could have solved it completely alone (well, probably Chris could have in time, not sure I could have) ... it was the anomalies that were striking but those anomalies only showed up with the two distinct approaches to looking at the problem. Techies tend to look for technical problems but in this case the real problem was more procedural or process-based than technical; while the question "what is getting updated?" was asked and answered as "nothing" many times, once we had this last discussion the customer (who didn't really understand the significance) admitted to the process as described above but seriously assumed this happened at most once a day, not many times each second.

What's really missing is proper data integrity and sanity checks

While the problem here was not really Open Text's fault, notwithstanding their admission to sub-optimal JVM head garbage collection, but a mix between some questionable default behaviour and typical end-user behaviour, Livelink did not make it at all easy to determine the problem. Humans can read this summary and understand right away that something that fails 10 times more than it succeeds is something that ought to be caught automatically and that with 14+ million documents of the same type having only 1 version, there is something worth investigating when 50 of them have hundreds of versions.

Now Livelink is a toolbox and it probably isn't reasonable to expect it to be able to flag such things, in a very real sense the failures were successes to Livelink; the transactions asked to be done where done perfectly. It is only the application/solution that could understand that anything more than 2 versions of an object strongly suggests a problem to be investigated ... but such checks are not being done.

But Livelink doesn't make it easy either, there is no readily available tool or query that will tell you what objects are actually in an index partition. We kept asking the customer if they were sure they weren't losing information and they kept assuring us they weren't, and yet when confronted with the truth that they were in fact losing information they claimed to have known about that all the time but that it was a rare occasion (they'd know because a search would fail that should succeed and the document batch would then be rescanned and resubmitted ... of course that whole batch represents updates that will occur whether or not the original worked. There was / is simply no easy method to be able to report on the contents of search partitions and thus there was no easy way for the customer to have known things weren't working.

Perhaps that will get built one day and such problems can be caught long before they are a problem.