OSS and Data Migration Scalability – It’s Not Just About CPUs

I proudly discussed my finely crafted SQL statements in a previous post about Celona and data migration. The need to invest quite some time in learning a few more SQL tricks and building logically complex and very large single SQL commands (as opposed to conceptually simpler procedural PL/SQL routines) came about because I was trying to migrate some data in to a network inventory (Cramer Dimension 1.0 I believe). Without this approach I found that through-put was terrible, perhaps as slow as a couple of objects a second once the data tables grew past a certain threshold. And no one else could use the shared server, despite it's CPU utilization being pretty low at the time. It turned out to be an early lesson in Oracle full table scans. As the name suggests, almost every look-up I was doing (to 'join in' the new object to the rest of the network) resulted in Oracle having to scan through every single row in several tables. It wasn't CPU intensive, but it was thrashing the server's hard disk. The process was 'I/O bound', limited by the performance of storage resources, due to having to process all that data for every object creation.

I admit fault. It wasn't a problem with Oracle. My code, PL/SQL which I can now accept was naively written, was making Oracle's job harder than it had to be. The second lesson was that if you told Oracle what to do in a concise but precise way using SQL, rather than try and spell it out in procedural code, you'd often get far better performance.

I've seen similar issues many time since in my own processes (less so, I like to think, as time goes on) and in others. The code is programatically and syntactically correct. There's plenty of CPU power. Lots of RAM. The performance still sucks. This tends to happen a lot in data migration, transformation and load projects, probably because they tend to lead to people crafting new code and doing so isolated from the large data-sets it will eventually have to cope with.

It does seem to baffle people that its not about CPU power. That their 3Ghz quad-core workstation or 12 CPU server can only process two objects per second would almost be amusing except for the project plan re-factoring needed to accommodate a go-live swap-over with 1.5 years of planned down-time.

So what's the problem? "James", you're saying, "tell us which SQL function or Oracle config file we should be looking at".

It's unlikely to be that simple. If you've hit a problem like this there's a good chance that the overall architecture of the process, whether that be an OSS provision process or data migration process, is fundamentally wrong. If you're very lucky, and didn't hit problems from the outset, you might just have one or two dodgy sub-routines causing the problem.  In either case the possible root causes are many and varied.

It would be a long blog post if I detailed even a few architecture 'anti-patterns', database design best practices, or optimal SQL statements. Instead, let me illustrate both the scale of the issue, and the therefore areas you might want to start looking in to. Here's the big three issues that lead to poor scalability…

Lack of concurrency by process design. You have a process that operates sequentially, handling one object at a time. This may perform reasonably well on a single CPU, single disk desk-top PC, but it will not scale up to use multiple processors or make the most of enterprise-class storage arrays. Consider designing the process so that you can run multiple instances on batches of data, but be very careful of data integrity issues that must then be dealt with.

Lack of concurrency by data model design. You have concurrent processes running, but they're becoming serialized due to limited, shared database resources being locked. Any time you explicitly lock data or implicitly lock it as part of a data modification you will reduce concurrency. Locking is essential for data integrity but too much locking where it's not needed will result in serialized processes. That means one process, in effect, and therefore only one of your server's CPU cores having any real work to do.  Find the bottle-neck in the data-model and carefully look at how processes lock that resource.

Poor execution of database queries. We know the point of relational databases is that they do lots of clever stuff under the hood/bonnet to optimize how data is extracted for any given query. But we have to accept some responsibility for performance and ask queries in an appropriate, precise and concise way. A single badly written SQL query could take minutes to return results, and thrash the disks limiting any other work on the server. The same query, re-written, could return the same result in under a second with minimal I/O resources used. It's all down to the structure of the SQL, and can also be influenced by 'indexes' in the database that help with specific queries. There's about half a dozen common and fairly obvious (if you know what they are) causes of poor SQL performance, in Oracle specifically.

Any process that has to deal with large amounts of data, which is most OSS and data migration tasks, has to be designed with performance, scalability and concurrency in mind. The most important thing is to know your database intimately. Grab a good book, specific to Oracle, mySQL or SQL Server, whatever you're using. Pick up one on SQL tuning or database architectures and read the first three or four chapters at least. Most authors will cover the simple things that make the biggest different to scalability early on (ever read a Tom Kyte book without being told about bind variables in the first three pages?). Use them. In fact, write key code patterns in to your design documents.

Beyond technical knowledge of the database, a deep understanding of network inventory is also needed. What are the options for modeling, say, MTOSI-base physical objects? What data needs indexing to answer the sort of questions fired at an inventory? How should processes apply concurrency and data-integrity rules when modifying the network model in the database?

That's something I'll write about another day.

, ,