I still hear quite a lot of discussion about how the nightly ETL window is a problem for many organisations and how they can’t do everything they need to in the time they have available. To me this just reinforces the importance of spending time designing and building high-performance solutions that take advantage of the capabilities of your rdbms rather than relying on server performance to get you out of trouble. Here’s an example of a problem that when properly analysed and understood was resolved with a resultant performance gain that exceeded all expectations.
I recently returned to a problem I spectacularly failed to solve 2 years ago. The problem was how to translate tens of thousands of IP addresses into a geographic latitude and longitude - really, really fast. This problem is related to an earlier post on course and unit searches that were being used to help predict load.
The challenge is to convert each IP address into an IP number (an integer representation of the four octets) and then execute a query to find which of the current 3 million IP address ranges in the world the IP number fits into. The problem is compounded by not having any specific way of indexing and/or searching for a match based on WHERE… BETWEEN logic using Oracle and the need to do this for up to 40,000 rows a day as part of the nightly ETL.
I reconstructed the approach I’d used previously - indexing IP_FROM and IP_TO and doing as much code optimisation as I could manage. The result was 991 seconds to process a sample of 13,069 records - about 16 minutes. I’d also tried partitioning the IP address range table but hadn’t had much joy there either.
So at this point I could have decided that was the best performance we were going to get and just worn the consequences. Maybe I would have re-architected the ETL window with a separate parallel stream just for this particular piece of fairly discrete activity. However, that is the point of this article, we should resist that temptation and find a more optimal solution.
Eventually, using what appears to be an extremely trivial syntactical change, the same 13,069 rows are now processed in 0.2217 seconds. Yes, you read that properly, down from quarter of an hour to less than a second.
To understand how such a massive performance improvement could occur, I need to explain a bit about how Oracle handles the between search. Central to it is the main query - SELECT LATITUDE, LONGITUDE FROM BIG_TABLE WHERE IP_ADDRESS BETWEEN IP_FROM AND IP_TO
Both the IP_FROM and IP_TO columns were in the same index but to resolve the SELECT…BETWEEN query the database needed to return all the rows where IP_ADDRESS is less than IP_FROM and then (through another query) exclude the rows where IP_ADDRESS is greater than IP_TO.
Here’s the clever bit…
Because my index has columns IP_FROM, and IP_TO in ascending order, there will probably be hundreds of thousands of rows where IP_FROM is lower than the number I am looking for and in fact, the row I want will be the last row looked at because at this point my number is bigger than IP_FROM (and less than IP_TO). If I reverse the order of the columns in my index (IP_TO, IP_FROM) and qualify the query with …AND ROWNUM = 1 then the first row the database checks is the one I want and performance goes through the roof.
I won’t pretend I worked this out on my own. I found an excellent article here which gives a fuller explanation of the process. The point is, the optimisation is possible and worth the extra effort. Yes, it took me another couple of days of work and several hours of dropping and rebuilding indexes to test out the theories and find the best approach but along the way I have a better understanding of my rdbms and our ETL continues to be fast and is managed and run without compromise.

