Handling Missing Keys (Network Steve Forum)

Handling Missing Keys

When loading a Fact Table, I typically use a Derived Column task to replace any null values in the data flow with a standard value representing "Unknown". In my case, any null value for a key becomes -1 and each of my Dimensions have a -1 record. I'm currently working on someone else's packages and don't really want to have to add derived columns to all of the data flows. Is there some reason I couldn't change the underlying tables to default to -1 assuming that the data type supports it? Is there some disadvantage of this other than being restricted to only one replacement value?

August 4th, 2011 5:43pm

Dave, If you run into Jeff Rush during the course of your work, ask him about this. He has done this and can explain in person.... I can't remember if Geoff ever got to see this working or not. The best solution is to create Inferred Dimension rows. 1. Create a stored proc - takes input of the Source ID, returns Dimension row ID as an output parameter. If SourceID is found in the Dimension table, return its current DimensionID. If it's not found, insert a new record somehow marked as Inferred (create a bit column), and return its DimensionID. Next time this runs, it will find that new row and not insert duplicate Dimension rows. (Trust me - had this happen and it was ugly.) 2. Use a Lookup transform to see if the row exists in your dimension. If it does, continue on. If not, call this proc, set the Output param to a variable, UNION ALL the results with your current data flow. 3. Create some job to go back and fetch that inferred member data at some point. You may also want a way to flag a record as completely gone from the source system so you don't keep on attempting to fetch its data, but hopefully this will be a rare enough occurrence that you should be fine. You can use -1 as some unknown value, but if you want accurate data later without trying to requery the source for your Fact tables, an Inferred Member row is going to give you much better results. We used -1 as Unknown for a while until we realized how much re-processing we'd have to really do to catch the correct values later. Putting them inside an Inferred Member row and updating the Dimension later proved much easier and less costly. Shoot me a PM if you want more details. I can see about mocking up a quick demo of this for you if you need it.

Free Windows Admin Tool Kit Click here and download it now

August 4th, 2011 6:16pm

Could have read this wrong. If it's truly unknown and always maps to a -1, go for it. If it's a value that doesn't happen to correspond to a current Dim row, use Inferred Members. We have a multi-tenant structure, so each customer gets their own "Unknown" value. It works, but is a bit trickier. If you do this, put a derived column at the very end of your data flow so you don't have to propagate it through the whole package. That's _not_ a fun task, especially pre-Denali. The only exception I might make to this is if there's some value in storing a NULL - for instance you want to represent exactly what's in the source system and it could store NULL, -1, empty string, or whatever. On a Fact table, it may not be as valuable to transform it to -1. As always, it will depend on what you're doing. :)

August 4th, 2011 6:22pm

This topic is archived. No further replies will be accepted.