Archive for the ‘Data Warehousing’ category

Tracking projects in the warehouse

November 7th, 2009

We’ve started to more formally track the myriad of projects involving IT at UNE just of late.  I’m not directly involved with the process but think at the last count, there were in excess of 130 currently on the go.  The latest Kimball Design Tip #118 that arrived in the email last week is all about managing project backlogs dimensionally and it got me thinking again…

img_5667

I’ve wondered for a while whether it would be worth constructing a dimensional model and associated metadata and reporting capability to assist with IT (or indeed any) project tracking.  I think I’d come to the conclusion that it would be very difficult to capture the current state of all projects in a consistent and accurate way at a frequency that improves on the traditional monthly ’stock take’.  That, after all is the problem with traditional project tracking - it tends to be performed periodically - typically in the form of a weekly, fortnightly or monthly status report.  The Project Manager scurries around getting people to tell them how much work has been done or is remaining against each of the tasks and then rolls it all up to some percentage complete figure.  This process is repeated across many projects every time the status report deadline is looming.

A problem with this is that the status report deadlines aren’t necessarily synchronised so the organisation never has a point in time picture of the status of all of its projects.  The only way around this problem I know of is to synchronise the dates in which case there is an end of month (or worse) frenzy for a day or two when no one actually does much productive work while the updates are captured.

An alternative might be for the updates to be provided constantly, trickle-fed into some storage medium (like a warehouse perhaps) on a daily basis, ready for reporting on demand at any time. No end of month frenzy and continual currency of all project data.

I realise that the above might not apply to all organisations, perhaps a monthly reporting frequency is fine and perhaps there are other reasons for needing to use this approach.  I also realise that consulting organisations are serving both internal and client reporting needs which aren’t necessarily aligned - perhaps all the more reason for a more flexible dimensional approach?

If we decided to try and implement such a system, the critical component would be the trickle-feeding of the daily updates into the warehouse.  Using Scrum as our development approach really helps in getting status updates on a current sprint and many of the modern tools, including the one we use, store the tasks and updates in a relational database that could be interrogated by a nightly ETL process.  The project backlog would have to exist and be estimated for the entire known project but of course as that flexed, so too would the multi-project reporting.

Aside from being able to report using BI tools, another nice thing about having project reporting in the warehouse would be around conformed dimension reporting where we exploit pre-existing warehouse content.  This enables reports such as Staff Project FTE : Staff Total FTE by Department Ratio which could illustrate commitment or investment in projects or programs of work at any point in the organisational structure.

Worth some more thinking?  We’ll see how readily we can get access to the project updates and the frequency of these in our brave new world of program and project management, but if this data is available then I think it could be a relatively simple schema to build and quite a powerful one to use.

New Higher Ed Online Forum

October 21st, 2009

Just a quick one in case you weren’t aware.  The Higher Education Data Warehousing forum has now progressed from a humble listserv to a website at  http://hedw.org.  Membership is free and open to anyone with an email address with an edu extension - worldwide.

hedw1_0

At the moment the listserv mailing list is running in parallel but as I understand it, the plan is to review the list and perhaps create additional or replacement lists for direct correspondence in the future.

One of the immediate benefits of registering for the new site is that you can see the product and version number of the BI platform used at each of the registered institutions along with details of the ERP system where one exists.  As of right now there are already 28 registered institutions and 122 registered users.

Heading North

August 24th, 2009

I’m really looking forward to a trip into sunny Queensland for two University special interest groups this week.

On Wednesday I’m dropping into QUT for the Load Management SIG which I understand has attendees from 21 Australasian universities.  I haven’t been to one of these but I’m really keen to understand more about how load reporting is handled at various institutions.

Then on Thursday and Friday its the 4th Annual Higher Education Data Warehousing Forum at Griffith’s Nathan Campus (just down the road from QUT).  I’m presenting something on Agile development which should be fun and I see there are also presentations scheduled from QUT, Griffith, Wollongong, UTS, UNSW and Deakin so hopefully we’ll all learn a lot from each other as a result.

Watch out for lots of SIG-inspired posts soon…

Data Testing using Visualisation

August 6th, 2009

I found a really useful and interesting way of testing large volumes of data this week, thanks to my favourite bit of visualisation software, Tableau.  We’ve been pushing big volumes of data into the warehouse, then out into Cognos Metric Store and then using ETL to bring them back in again for reporting.  All in all a fairly complicated, many-stepped and relatively risky process.

After following the usual development test steps, checking results and peer-reviewing code we would ordinarily have handed the results over for client testing.  The problem in this case is that the results are produced at an isolated unit level and that overall totals and patterns are very hard to spot.  While individual cases may look absolutely fine in 99.9% of cases, the outliers are the ones that if wrong, might upset the overall result.

So, how can we review hundreds of thousands of data items at a macro level as part of testing?

Well one way would be to write simple SQL statements that check the validity of data at different aggregation points - in our case these might be unit, course, discipline, school and faculty, but at best you would still be looking at lots of numbers and using personal interpretation to gauge correctness.

picture-1

Instead, how about playing with the data using a visualisation tool?  Manipulate the data in real time, drag and drop data items and attriubtes in and out of the visualisation at will until you’re convinced things look as they should.  What I found in the above example is that things didn’t look as they should in the case of a single combination of data.  This was the mean pass rate for a single unit across all teaching period (semesters) in a single year.  It could not validly be anything other than between 0% and 100% but clearly shows one combination way in excess of 150%.

The important point here is this wouldn’t have been identified through simply sorting rows of data by a column or basic data profiling checks because the visualisation is a representation of a formula applied to the aggregation of many rows of data (in this case passes/attempts across units, by one delivery mode, in once school in one historical year.  Yet it clearly stands out in the above visualisation.

Modern BI involves pushing out dimensional models from which literally billions of report combinations are possible to maybe hundreds or thousands of users.  As developers we no longer have the luxury of being able to validate and verify columns and rows of numbers in specific reports before they are published.  Visualisation software is a powerful tool to have at your disposal during pre-release testing of massive models like these.  I know I will be using it from now on to enhance confidence and improve the quality of outputs from the data warehouse.

Bogan Foreign ETL

July 20th, 2009

Note to persons not familiar with the Australian slang term bogan: substitute chav, pikey, ned, tokkie, redneck, whitetrash and/or no doubt many others.  Wikipedia has a full and proper definition of bogan if you’re still confused.

So what’s this post all about?  Well, sooner or later I reckon you’ll need to put some foreign language data into your warehouse which, from an ETL perspective, presents some interesting challenges, particularly if your own ETL standards weren’t conceived with foreign language data in mind.

How would you feel if confronted with an undocumented database schema in German with one of the main columns going by the name of bogen (sic)?

normal_map-germany-old

With the recent implementation of our evaluation system we hit this challenge head-on.  With no documentation or access to development staff in Germany, we were faced with a database to which we had to first make a case for SQL query access so we could have a look at the schema and see what our options were.

I should say now, that what follows is not necessarily best practice, its just what we ended up doing when faced with a pressing need and little time in which to meet it.  Maybe next time we’d do things differently, we’ll have to wait and see how things go.  There are three main rules we’ve gone with:

  1. Translate all the database tables in the schema into your native ETL language
  2. Leave the columns in the tables in their existing language
  3. Draw an ERD of your source system that shows table and column names in both languages

Why translate the table names?

This helps you understand the basic meaning in those tables.  You’ll find some strange ones and some like bogen that will raise a smile but this should tell you a huge amount about the design and the data that lives within each table, it also helps you immediately target a certain area of the system if you are building, enhancing or troubleshooting.

Why leave database columns in their existing language?

We debated this at some length and settled on the approach of leaving things as they are because of the amount of work involved, but mainly because doing this actually introduces a couple of extra problems.  Firstly you introduce risk by mis-labelling, mapping, or confusing one column from another.  This is so easy to do when dealing with something that isn’t your first language, just think about how many times you’ve done this working in your native tongue.

Secondly, when you’re developing, peer-reviewing, maintaining, enhancing or just generally looking at and trying to understand the data structures, there is now an entirely new translate layer in the process (maybe it should be called ETTL) to consider.  You would need to retain the original source column name and the translated column name.  Your initial extract processes would pull the source data and map them to friendly column names after which all would continue as normal, but what happens when one of your downstream staging queries needs investigation?  At this point you’d have this horrible reverse-translation to do which didn’t appeal to us.

Of course, leaving things in their native format has its own problems too.  It is really hard to differentiate between like terms in a foreign language so, for example, unit_version, unit_offering and unit_code may all seem very similar and easily mistaken terms.

Why generate an ERD in two languages?

This serves a couple of purposes.  Firstly, it is a great reference source for anyone having to do work on the model and having both languages involved can only help.  However, I think the main benefit is gained from the undertaking of the exercise itself.  Having to sit down and study a schema in detail and translate and hand-craft a diagram that represents it is an exercise that will result in a better understanding of the associated table structures and dependencies.  If you’re lucky you will have tools that can do at least half of what you need.

So what is bogen all about?

Well bogen translates to something to do with archery and given that the evaluations system is concerned with measurement, we’ve decided that the field contains target information.  So if we were translating columns we’d probably call that one target_value but calling it bogen or bogan is much more entertaining.

If you have experienced something similar to this I’d be very interested to hear how you approached it and what you have learned from the experience.  Please let me know.  If you’re about to embark on something similar to this then good luck!