I found a really useful and interesting way of testing large volumes of data this week, thanks to my favourite bit of visualisation software, Tableau. We’ve been pushing big volumes of data into the warehouse, then out into Cognos Metric Store and then using ETL to bring them back in again for reporting. All in all a fairly complicated, many-stepped and relatively risky process.
After following the usual development test steps, checking results and peer-reviewing code we would ordinarily have handed the results over for client testing. The problem in this case is that the results are produced at an isolated unit level and that overall totals and patterns are very hard to spot. While individual cases may look absolutely fine in 99.9% of cases, the outliers are the ones that if wrong, might upset the overall result.
So, how can we review hundreds of thousands of data items at a macro level as part of testing?
Well one way would be to write simple SQL statements that check the validity of data at different aggregation points – in our case these might be unit, course, discipline, school and faculty, but at best you would still be looking at lots of numbers and using personal interpretation to gauge correctness.
Instead, how about playing with the data using a visualisation tool? Manipulate the data in real time, drag and drop data items and attriubtes in and out of the visualisation at will until you’re convinced things look as they should. What I found in the above example is that things didn’t look as they should in the case of a single combination of data. This was the mean pass rate for a single unit across all teaching period (semesters) in a single year. It could not validly be anything other than between 0% and 100% but clearly shows one combination way in excess of 150%.
The important point here is this wouldn’t have been identified through simply sorting rows of data by a column or basic data profiling checks because the visualisation is a representation of a formula applied to the aggregation of many rows of data (in this case passes/attempts across units, by one delivery mode, in once school in one historical year. Yet it clearly stands out in the above visualisation.
Modern BI involves pushing out dimensional models from which literally billions of report combinations are possible to maybe hundreds or thousands of users. As developers we no longer have the luxury of being able to validate and verify columns and rows of numbers in specific reports before they are published. Visualisation software is a powerful tool to have at your disposal during pre-release testing of massive models like these. I know I will be using it from now on to enhance confidence and improve the quality of outputs from the data warehouse.