Bogan Foreign ETL

July 20th, 2009 by Rob Hale Leave a reply »

Note to persons not familiar with the Australian slang term bogan: substitute chav, pikey, ned, tokkie, redneck, whitetrash and/or no doubt many others.  Wikipedia has a full and proper definition of bogan if you’re still confused.

So what’s this post all about?  Well, sooner or later I reckon you’ll need to put some foreign language data into your warehouse which, from an ETL perspective, presents some interesting challenges, particularly if your own ETL standards weren’t conceived with foreign language data in mind.

How would you feel if confronted with an undocumented database schema in German with one of the main columns going by the name of bogen (sic)?

normal_map-germany-old

With the recent implementation of our evaluation system we hit this challenge head-on.  With no documentation or access to development staff in Germany, we were faced with a database to which we had to first make a case for SQL query access so we could have a look at the schema and see what our options were.

I should say now, that what follows is not necessarily best practice, its just what we ended up doing when faced with a pressing need and little time in which to meet it.  Maybe next time we’d do things differently, we’ll have to wait and see how things go.  There are three main rules we’ve gone with:

  1. Translate all the database tables in the schema into your native ETL language
  2. Leave the columns in the tables in their existing language
  3. Draw an ERD of your source system that shows table and column names in both languages

Why translate the table names?

This helps you understand the basic meaning in those tables.  You’ll find some strange ones and some like bogen that will raise a smile but this should tell you a huge amount about the design and the data that lives within each table, it also helps you immediately target a certain area of the system if you are building, enhancing or troubleshooting.

Why leave database columns in their existing language?

We debated this at some length and settled on the approach of leaving things as they are because of the amount of work involved, but mainly because doing this actually introduces a couple of extra problems.  Firstly you introduce risk by mis-labelling, mapping, or confusing one column from another.  This is so easy to do when dealing with something that isn’t your first language, just think about how many times you’ve done this working in your native tongue.

Secondly, when you’re developing, peer-reviewing, maintaining, enhancing or just generally looking at and trying to understand the data structures, there is now an entirely new translate layer in the process (maybe it should be called ETTL) to consider.  You would need to retain the original source column name and the translated column name.  Your initial extract processes would pull the source data and map them to friendly column names after which all would continue as normal, but what happens when one of your downstream staging queries needs investigation?  At this point you’d have this horrible reverse-translation to do which didn’t appeal to us.

Of course, leaving things in their native format has its own problems too.  It is really hard to differentiate between like terms in a foreign language so, for example, unit_version, unit_offering and unit_code may all seem very similar and easily mistaken terms.

Why generate an ERD in two languages?

This serves a couple of purposes.  Firstly, it is a great reference source for anyone having to do work on the model and having both languages involved can only help.  However, I think the main benefit is gained from the undertaking of the exercise itself.  Having to sit down and study a schema in detail and translate and hand-craft a diagram that represents it is an exercise that will result in a better understanding of the associated table structures and dependencies.  If you’re lucky you will have tools that can do at least half of what you need.

So what is bogen all about?

Well bogen translates to something to do with archery and given that the evaluations system is concerned with measurement, we’ve decided that the field contains target information.  So if we were translating columns we’d probably call that one target_value but calling it bogen or bogan is much more entertaining.

If you have experienced something similar to this I’d be very interested to hear how you approached it and what you have learned from the experience.  Please let me know.  If you’re about to embark on something similar to this then good luck!

Advertisement

Leave a Reply