Archive for the ‘Data Warehousing’ category

Bogan Foreign ETL

July 20th, 2009

Note to persons not familiar with the Australian slang term bogan: substitute chav, pikey, ned, tokkie, redneck, whitetrash and/or no doubt many others.  Wikipedia has a full and proper definition of bogan if you’re still confused.

So what’s this post all about?  Well, sooner or later I reckon you’ll need to put some foreign language data into your warehouse which, from an ETL perspective, presents some interesting challenges, particularly if your own ETL standards weren’t conceived with foreign language data in mind.

How would you feel if confronted with an undocumented database schema in German with one of the main columns going by the name of bogen (sic)?

normal_map-germany-old

With the recent implementation of our evaluation system we hit this challenge head-on.  With no documentation or access to development staff in Germany, we were faced with a database to which we had to first make a case for SQL query access so we could have a look at the schema and see what our options were.

I should say now, that what follows is not necessarily best practice, its just what we ended up doing when faced with a pressing need and little time in which to meet it.  Maybe next time we’d do things differently, we’ll have to wait and see how things go.  There are three main rules we’ve gone with:

  1. Translate all the database tables in the schema into your native ETL language
  2. Leave the columns in the tables in their existing language
  3. Draw an ERD of your source system that shows table and column names in both languages

Why translate the table names?

This helps you understand the basic meaning in those tables.  You’ll find some strange ones and some like bogen that will raise a smile but this should tell you a huge amount about the design and the data that lives within each table, it also helps you immediately target a certain area of the system if you are building, enhancing or troubleshooting.

Why leave database columns in their existing language?

We debated this at some length and settled on the approach of leaving things as they are because of the amount of work involved, but mainly because doing this actually introduces a couple of extra problems.  Firstly you introduce risk by mis-labelling, mapping, or confusing one column from another.  This is so easy to do when dealing with something that isn’t your first language, just think about how many times you’ve done this working in your native tongue.

Secondly, when you’re developing, peer-reviewing, maintaining, enhancing or just generally looking at and trying to understand the data structures, there is now an entirely new translate layer in the process (maybe it should be called ETTL) to consider.  You would need to retain the original source column name and the translated column name.  Your initial extract processes would pull the source data and map them to friendly column names after which all would continue as normal, but what happens when one of your downstream staging queries needs investigation?  At this point you’d have this horrible reverse-translation to do which didn’t appeal to us.

Of course, leaving things in their native format has its own problems too.  It is really hard to differentiate between like terms in a foreign language so, for example, unit_version, unit_offering and unit_code may all seem very similar and easily mistaken terms.

Why generate an ERD in two languages?

This serves a couple of purposes.  Firstly, it is a great reference source for anyone having to do work on the model and having both languages involved can only help.  However, I think the main benefit is gained from the undertaking of the exercise itself.  Having to sit down and study a schema in detail and translate and hand-craft a diagram that represents it is an exercise that will result in a better understanding of the associated table structures and dependencies.  If you’re lucky you will have tools that can do at least half of what you need.

So what is bogen all about?

Well bogen translates to something to do with archery and given that the evaluations system is concerned with measurement, we’ve decided that the field contains target information.  So if we were translating columns we’d probably call that one target_value but calling it bogen or bogan is much more entertaining.

If you have experienced something similar to this I’d be very interested to hear how you approached it and what you have learned from the experience.  Please let me know.  If you’re about to embark on something similar to this then good luck!

Visualising Data Warehouse Volumes

July 13th, 2009

Envisaging the size of data warehouses can be hard and these snippets are from a much larger infographic on mozy.com which communicates data volume very well.  The whole post is here and is worth a look if you need to put data volume into context.

picture-2

picture-4

Our biggest warehouse table has just over 43 million rows in it.  This is a daily periodic snapshot of enrolment which we’ve been running since July 2007.  With a row length of 82 bytes that equates to a mere 3.2GB.  A relative drop in the petabyte ocean…

picture-1

Tracking ETL Performance

June 19th, 2009

Something that has been sitting on our list of things-to-do-when-we-have-a-moment for a long time is some automated auditing of ETL.  Ralph Kimball talked about the need to build monitoring into the ETL process during his ETL Architecture in Depth course and with our warehouse growing in size and complexity every day, it is no longer something we instinctively know.

James had a bit of time during our last sprint and created a neat Framework Manager package referencing the Cognos audit tables.  We now have a really useful set of automated charts that are produced each night showing the performance of each ETL step within the following three major ETL subsystems.

  • ODS - Operational Data Extraction
  • STG - Dimension and Fact Staging
  • PRS - Dimension and Fact Presentation

The following snippet of one chart from the STG subsystem shows build times in seconds for each table.  The S_x prefix is our naming convention for Staging tables where x is either (F)act, (B)ridge or (D)imension.  The actual charts are much lengthier than this example and are great for identifying bottlenecks and where to do some fine-tuning to improve performance.

picture-5

Our overall ETL jobstream in production typically takes between 1-2 hours depending on workload and environmental factors.

What is interesting is comparing these charts with the Development and User Acceptance warehouse jobstreams.  We’ve spotted instances where we’re missing an index or an index is different between environments purely through looking at the difference in performance.  Its a really useful tool in that respect.

What we plan to do in the future is track performance of individual steps over time to see in the impact of seasonality - such as heavy transactional load when results are published or during peak admission or enrolment times.  This same technique on a more macro timescale will help us in understanding warehouse growth and the impact of that on database performance which we can feed back to our DBA’s and Data Centre to assist capacity planning.

This is the type of thing that I imagine gets forgotten or just not done but if you can find the time, I reckon it will pay huge dividends.  You’ll spot issues in time to do something about them and gain a lot of valuable intelligence about your warehouse.  Of course, no one in the business will sponsor you to do this…

Applying Scrum to BI/DW Part1

June 1st, 2009

This post is inspired by Abdel’s comment here in which he asks for some specifics about how we actually apply Scrum in a BI/DW environment. I’ll answer the questions directly and add a few bits and pieces as they come to me. Of course this is just one perspective and as with most things in life, we’re learning as we go but I am more convinced than ever that Agile is the way to go for BI/DW development.

How do you specify teams?

We have little option here. There are 3 of us in our BI/DW team so we are always involved. We’re doing a lot of work with the Planning department at the moment so they are also in the team most of the time. There are of course times when we need to pull in a boutique developer - someone who has a special skill or expertise we need for a specific purpose - so they become part of the team for perhaps just one or two sprints.

The biggest hurdle we’ve had to overcome is identifying and getting a product owner, someone who not only has the vision but can also make decisions. These people are in high demand and are few and far between in most organisations. If they do their job properly they also need to do a lot of running around and consulting in order to make sure what we’re working on is indeed what we should be working on. We’ve found one now, I just hope we can hang onto them.

How do you choose the sprint process or sub-process?

This is where the product owner comes in although generally in practice we just know what it is we are supposed to be working on. We still have some kind of governance structure in place and I imagine you do too. There will be someone saying ‘you need to get x, y and z delivered’. On this basis you can say we’ll work on x for 2 months and then review what has been achieved. In that 2 months you might do 7 sprints and each one will chip away at the backlog of requirements until either your product owner says they have enough or you have some other higher priority to focus on.

When is it ok to stop your datamart development and move on?

I definitely think it is better and easier to focus on a delivery area for a period of time rather than saying you will deliver a product in its entirety - you’re getting into dangerous territory if you do the latter as you can’t accurately estimate when that will be and as we know, the scope and priority of the product delivery components will change constantly from day one. You specifically asked about delivering a datamart - you might for example, choose to deliver a single fact table with maybe 6 key dimensions. You might have other stories on your backlog that you might not get to. These could be things like:

  • deliver an aggregated fact table or some kind of high-performance summary table
  • migrate historical data into the core fact table
  • add new attributes to some of the core dimensions
  • add new dimensions to the model

This is where priorities and time come into your equation. If you can, you may keep delivering all of the above, or you may stop and put your energies into something entirely different. The approach manages whatever is needed for your organisation. The main thing is to ensure you deliver something of value in each sprint. Dimensional models are by design very extensible so this again lends itself very well to the above.

Sprint Duration

We’ve moved to 10 business day sprints, that works much better for us. A month is just too long and we run out of steam after the second week. With a demo day, we’re now not starting sprints on the same day of the week each time but that seems to be working out ok. Keep the flexibility to try new things, I think that is the most important point here.

Daily Stand-Up Location

In an attempt to improve collaboration we are having our daily stand-ups in the business area that we are doing the development for. Although this is a little bit inconvenient for us, it seems to help, we are working with them on their terms and often it is the best way for them to see and hear what we are doing. It also means they can easily get involved in our daily meetings and I think sends the right signals.

ETL, the Canary in the IT Coal Mine

April 11th, 2009

One of our DBA’s remarked last weekend that the data warehouse, specifically the ETL part, is the canary in the coal mine of IT infrastructure and systems and I think he is absolutely right.

canary_0

We’ve been having a few problems recently with our IT infrastructure during a time of significant change for UNE.  The cutover to our ‘new’ network is happening right now and as a consequence, the configuration of just about every device is being tinkered with to some extent.  Invariably this means that things are going to break.

Historically in an IT organisation, when things break in the middle of the day with hundreds or thousands of people consuming the related services, things get noticed and hopefully fixed pretty quickly.  When they break in the middle of the night, they don’t necessarily get noticed straight away.  When they break in the middle of the night on a weekend, or a public holiday they might not get noticed for a significant elapsed time.

Nowadays we have data warehouses and an associated daily ETL process that runs overnight and touches just about every significant online system that the university owns.  It also hurls vast quantities of data around the network and stores yet more data in the corprate data store.  Then there is the reporting which retrieves data from the corporate data store, manipulates it on yet further servers on a different platform, records activity in another data store on another platform before delivering output to filesystems and users’ mailboxes.  If anything in this complex arrangement fails, it gets noticed by ETL, immediately.  You might not like that and  your IT colleagues might not like it either when they get called up on a Sunday morning because something has failed.

So what does this all mean for for BI/DW?  Well I think it says something quite significant:

If you decide to have a serious BI/DW environment then it will have ramifications you may not have considered.  It will highlight things about your IT organisation that you might not necessarily want highlighted and which will require greater resource to manage.  If you act on what is highlighted however, it will result in a more robust and higher level of service to the broader organisation.

I’m not sure if data warehouses can save lives but canaries certainly can, if you choose to listen to them ;)