Woah, Nellie.NET!

From my simple mind to… mine…

Thoughts on ETL tools, "domain-driven" data warehousing, and the state of data

leave a comment »

Several of my colleagues and I were discussing Ayende’s use of Filehelpers for his brand new ETL tool. Filehelpers is a way to access data and write back data to a file. You might want to load data off of a file if the information comes to you in a file format on a regular basis. In the leads business, a lot of providers are in the dark ages, often hand composing files in Excel and sending leads over to you in whatever format. Don’t even THINK about talking SOA with these guys!

SSIS sucks as an ETL tool; it is difficult to setup/deploy, delivers cryptic error messages, and in general adds a lot of fluff to what oftentimes should be a simple data extraction process. I think Ayende is justified in his approach of creating a tool. I’d like to venture to say that there are very few people qualified to do so.

ETL == Extract, Translate, Load – is a common buzzword used in data warehousing for an intermediary step used to cleanse, modify and load data from multiple sources into a common database. Often this is a staging database, but sometimes directly into the warehouse itself. Imagine trying to decipher I.B.M. from one system, IBM from another, and International Business Machines from yet another. They all have to mean the same thing in the final warehouse for the data to be useful. It can be a tough problem to solve in larger organizations (or in small lead businesses where nobody is consistent).

In my mind, the end result IS domain driven, but in the context of data warehouses. Users of the data are speaking a ubiquitous language – they HAVE to be, since the warehouse itself should be a well thought out (and summarized) representation of the business domain, in data form. Any differences from outside systems are translated into the ubiquitous language of the warehouse. At this point, the data can be used or moved into data marts for specialized use.

One final opinion: I see many databases that are NOT data warehouses or data marts, but claim to be because you can report off of them. They are often a hodgepodge of reporting data and techniques that would only allow an organization to be reactive. The data is not properly summarized or ubiquitous. By this I mean that I believe in three states that companies are in with respect to their data:

  1. Lagging – these organization don’t know what’s happening until the shit hits the fan. Their data is spread out through the organization and their execs can barely describe the state of the company beyond the financial data.
  2. Reactive – Some effort to organize the data is there. But the only way the company execs can make the data useful is to run reports, interpret the data subjectively and so on. They react after this analysis. Good, but not great.
  3. Enabled – Data is constantly mined by well thought out systems, dashboards are in place for execs, alerts are constantly sent to execs with updates. The data is truly working for the company, but in an automated fashion from all disparate systems in the org.

Nearly all the books at the top of this list are good resources: http://www.amazon.com/s/ref=nb_ss_gw/102-6071084-2797726?initialSearch=1&url=search-alias%3Daps&field-keywords=data+warehouse&Go.x=0&Go.y=0&Go=Go


Written by Nelson

August 18, 2007 at 2:48 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: