SSIS: Inserting unknown members from fact table
Looking up surrogate keys in a dimension table and adding these to your data flow is easy when there is a match in your dimension table for every key in your fact table. However, I am puzzled by how to manage the data flow when no match can be found for a specific key in the fact table when doing the lookup AND I then want to insert this unknown key as an unknown/inferred member in the dimension table. The problem is further complicated by the fact that when I have inserted the unknown member in the dimension table and it has been assigned a surrogate key there, I want to add this surrogate key to my fact table - just as if there had been a match in the lookup in the first place.I realize that I must likely will have to start by somehow managing the error output from the lookup component and continue from there. But how?Any help is greatly appreciated.
August 18th, 2005 2:51pm
Have you tried the Slowly Changing Dimension Transformation? It has native support for SCD's including inferred members, so should be able to help. Very fast to use, compared to manually building Data Flows to do the same thing. You can always tweak the resulting Data Flow as well.
August 18th, 2005 3:25pm
The way I see it, the SCD wizard is for loading dimensions only. My problem arises when I am loading fact tables and I want to assign surrogate keys for each of the business keys in the fact table. I assign the surrogate keys by doing lookup transformations. However, when I encounter an "unknown" business key in the fact table (i.e. there is no match in the dimension table), I want to:1. Insert the unknown business key in the dimension table (and thereby automatically assign a surrogate key to it because the surrogate key field in my dimension table is an identity column).2. Add the newly generated surrogate key to my data flowHow would the SCD wizard help me with this?
August 18th, 2005 3:38pm
This is the way I do it.1. When I create my dimensions I have a Sql Task that adds an "unknown" entry to my dimension - usually with a surrogate key value of 1. I then load the rest of the records using a SCD Transform2. I create a variable in the package I use to lookup surrogate keys - such as "UnknownDate" and assign it a value of 1 (or whatever the key value is - I use this so I can just change the variable if my unknown key changes in the dim)3. I do a lookup for the business key & return the surrogate key. If it is a success, I go to a Union All, if it fails, I point it to a Row Count.4. I put a variable in the row count, something like rcNoQaDate - so I can log the "no match" records5. I then map the row count to a Derived Column component and create the surrogate key field in the data flow and assign the user variable "UnknownDate" 6. I map the derived column to my Union All, which matches up my surrogate key I created in the Derived Column to the one found in the lookup transform.Hope this makes sense. In some of my packages I also log the business key for the rows that don't match so I can do research on them for debugging. I just add a multicast after the derived column and go to an OLEDB Destination.-Evan Black
August 18th, 2005 8:06pm
Hmmm... Interesting. However - once again I must stress my need to insert the unknown business keys from the fact records as new/inferred members in the dimension table and THEN use the assigned surrogate key in the fact table. This cannot be done with the described approach, since all unknown business keys in the fact table will be mapped to a single "Unknown" member in the dimension table. More often than not (in my experience) this just does not cut it! The "unknown" business keys in the fact table often has a meaning for the business user, which is why it HAS to be exposed as a separate entity.Any other ideas? Thanks! :-)
August 18th, 2005 9:12pm
HiThe approach suggested by Evan Black is the one I generally follow. I insert a "Dummy" record, with a surrogate key of 1, in all dimensions. Whenever a fact lookup fails, I associate the fact with this "dummy" member. If you have an append/modify fact, then the fact records linked to the dummy member(s) will hopefully get cured the next time you run the refresh. This is due to the fact that subsequent dimension refreshes will get those new members and the same fact lookup will extract an existing dimension member at that time.You might want to read the article by Ralph Kimball on early arriving facts. It explains three possible approaches in all.http://www.rkimball.com/html/designtipsPDF/KimballDT57EarlyArriving.pdfRegardsMilinda_V
August 18th, 2005 11:45pm
Hello again,Actually, what I want to do is described by Kimball in option 3 in the mentioned paper. I want to insert the unknown fact record in the dimension table with "dummy" attributes. You see, I cannot be certain that I will ever get more information on the unknown business key in subsequent dimension loads. Therefore, simply exposing it in the dimension table with the dummy attributes is enough for the business user.I would still like to have an example of how to solve my problem in SSIS. After all, the way I understand inferred member support in the SCD, it actually assumes that unknown business keys from the fact table are being inserted into the dimension table and marked with a boolean value (inferred/not inferred). This way, the inferred members can be recognized in subsequent dimension loads and updated. For this to work, some method for inserting unknown business keys from the fact table and assigned a surrogate key to them in the data flow has to exist...?!
August 19th, 2005 9:03am
My approach (I'll write a blog about this when I find some time) is just a little bit different... I ALLWAYS give the "unknown" member the ID 0. And I define in every fact table a standard value for each foreing key of 0. So it's easy whenever you don't find any matching dimension record or simply "forget" to set it, you still have a "working" modell...
August 19th, 2005 12:06pm
OK... There does not seem to be a solution to my problem - at least not an easy one! Marco Russo has actually been blogging a bit about something similar. He has some good points. His conclusion, however, is that doing what I want to do in SSIS is quite cumbersome. See his blog:http://sqljunkies.com/WebLog/sqlbi/archive/2005/05/30/15684.aspxI really like SSIS, but since is a tool that very much is pointed towards BI development, I wonder why there has not been more focus on solving this particular problem!?Are my needs that atypical?
August 19th, 2005 4:00pm
Michael,have a look on my blog... http://sqljunkies.com/WebLog/tpagel/archive/2005/08/19/16495.aspxPerhaps this is the solution you're looking for...
August 19th, 2005 7:19pm
Thomas... Thank you very much for your contribution. It really looks like you found a working solution, which - in itself - is great. However, I can only begin to imagine the amount of work involved in repeating your example for every key in the fact table - not to mention every key in EVERY fact table. Anyway - we have a way to do it. Great!
August 20th, 2005 12:47am
Michael,I'm still wondering about why you need this stuff so much... I'm in BI business for some years and did quite a number of projects... This scenario wasn't new for me but it's still some kind of "special". The usual case is that you have your dimension and match it to your facts. It's not very common (at least for me) to build up dimensions from the facts... At my customers dimensions come from other systems (i.e. a ERP system). When dimension records are missing then this is indicating a problem on the source system's side, i.e. customers are deleted who had orders. Then it's no good idea to automatically "recreate" the customer but to push this to an "error" or "unknown" bucket to be inspected by a business administrator. These problems should be fixed on the source system side, not in the warehouse. But that's my way to handle it, your scenario might be different... So building up ETL processes like I designed in my blog for each key for each fact table, that would be much too much work, I aggree. However I can't offer you a simpler solution...
August 20th, 2005 12:08pm
Thomas,I do agree with you, and you are right: Poor data integrity in the source system is the only reason (combined with early-arriving facts) why I need this approach. However, I have a case where I have to do a POC based on extracts from only transaction tables in the source system. The transaction tables share dimensions and as such I need to create "common" dimension tables instead of just building the dimensions in AS right on top of the transaction tables.Besides, I still assume that when Microsoft designed the support for inferred members in the SCD wizard, they must have had this scenario in mind. ... and another thing: Simply placing all unknown business keys in one "unknown" bucket in the dimension table will make it difficult/impossible for the business users to correct the poor integrity in the source systems, since they would not know which products/customer/etc. they had to correct without a key to identify them.
August 21st, 2005 10:13am
I'm looking for a solution to import data from a flat file into an normalized data modell. To explain it a little simpler think about to following:The Data Souce is a CSV-File with FirstName, LastName and Category. Sample data could beDirk; Bauer; sailingPeter; Bauer; fishingMarc; Bauer; readingIn my data modell I have defined the 2 tables "Person" and "Category":Table "Person"----------------[PersonID] [int] IDENTITY(1,1) NOT NULL[CategoryID] [int] NOT NULL[FirstName] [nvarchar](50)[LastName] [nvarchar](50) Table "Category"----------------[CategoryID] [int] IDENTITY(1,1) NOT NULL[CategoryName] [nvarchar](50)Now I like to read my first row from the source and lookup a value for the CategoryID "sailing". As my data tables are empty right now, the lookup is not able to read a value for "sailing". Now I like to insert a new row in the table "Category" for the value "sailing" and receive the new "CategoryID" to insert my values in the table "Person" INCLUDING the new "CategoryID".I think this is a normal way of reading data from a source and performing some lookups. In my "real world" scenario I have to lookup about 20 foreign keys before I'm able to insert the row read from the flat file source. I really can't belief that this is a "special" case and I also can't belief that there is no easy and simple way to solve this with SSIS. Ok, the solution from Thomas is working but it is a very complex solution for this small problem. So, any help would be appreciated...Thanks,Dirk
August 29th, 2005 2:56pm
Dirk,You can do the import in two steps (dataflows), then it's not so hard to do... The downside would be that you have to read the data twice...The first dataflow imports the categories. You have your source, do a aggregate on the category and save it in the category table.The second dataflow reads the persons, assigns the categoryID by a lookup on the table you filled in the fist step and saves it in the Person table.For me it's a quite uncommon scenario to have "denormalized" fact tables, or fact tables having dimension data, too. It's not easy to fill dimension tables and the fact table from one source in one dataflow when you want to assign surrogate keys in it, too...
August 30th, 2005 9:36am
There is a very elegantmethod proposed by Joy Mundy and Warren Thornthwaite in "The Microsoft Data Warehouse Toolkit", although I have yet to get it to work 100%. Basically, if the Lookup fails,(1) redirect the rows to an OLEdb Command which inserts the natural (or business) key only into the dimension table (in whatever inferred member format you have decided upon, i.e. everything but the natural key and surrogate key are populated, or an inferred member key is populated).(2) Join a secondLookup to the outflow of the OLEdb commend. (3) Finally,Union All thetwo Lookups back together in the workflow. There are some setup tricks:on both Lookups, choose Edit > Advanced tab > Enable Memory Restriction. This keeps the SSIS packagefrom caching the Lookup values. However, when I use this approach, Iget multiple new entries in the Dimension table if the missing dimension key occurs through several records. If anyone know how to fix this multiple entry issue, that would be great.
April 20th, 2006 10:31pm
You have disabled caching entirely? Clear the Enable Caching under Enable Memory Restriction. The second dimension entry onwards means the lookup is not detecting rows that the command has previously inserted during that session. It must be caching in the lookup.
April 21st, 2006 2:18pm
The method you suggest will work, even though it is quite inefficient, since every rowpassed intothe lookup component results in a roundtrip to the database. A more efficient method that I have found to be useful, is the one used for inferred members in Project Real (http://www.microsoft.com/technet/prodtechnol/sql/2005/realetldp.mspx). This method involves a script component, in which a hash table is created for the inferred members. A stored procedure in the database handles the insertion of the inferred member in the dimension table, and this is only done once for each unknown business key. Every subsequent encounter of the already inserted unknown business key(s) will be handled using the values in the hash table. This method allows for full caching in the initial lookup component. The method from Project Real, efficient as it is, is quitetime consumingand cumbersome to implement. As has been discussed before in this thread, I think we still need a component in the data flow that can handle both the lookup and the insertion of unknown records - as well asreturn the relevant surrogate keys for theinferred members. Thomas Pagel wrote a wishlist for a "dynamic lookup transform", which will do some of this (http://sqljunkies.com/WebLog/tpagel/archive/2006/01/05/17767.aspx).
April 21st, 2006 7:27pm
I did notice that the processing was pretty slow, I guess that is why. I will look up the solution you offered. Thanks for the tip! This is definitely something Microsoft needs to address.
April 21st, 2006 10:35pm
Both of you guys are correct. However, you are missing a key fundamental loop-back design in the warehouse where you should create Error-log tableto recordthe early-arriving fact record that has a default unknown value assigned to it. This way, you can create exception reports for the business userstohelp fix the source data because will be able to tell which dimension key is missing. The error-log table would have the same structure as the fact table (in the staging area of course) and some more audit columns like DateFound, ErrorType (Early Arriving Fact) etc.... thanks
May 6th, 2006 10:28am
Sorry for the bump from the stone ages...I was just fighting with inferred members and found a much more elegant way to accomplish this in SSIS 2008. This post from the SQL CAT team should be the solution anyone using 2008 is looking for. http://blogs.msdn.com/sqlcat/archive/2009/05/13/assigning-surrogate-keys-to-early-arriving-facts-using-integration-services.aspxIn fact, I would suggest that someone update the project REAL documentation to indicate this solution if you are on SSIS 2008. It's not fully integrated, but that is pretty close if you ask me. Hope this helps folks.
October 2nd, 2009 3:17am
Hello all , Regarding similar to this topic i have found the blog describing about the "Creating Inferred Dimension Members with SSIS" in this link http://www.sqlservercentral.com/articles/Data+Warehouse/71233/ Hope it will help us too understand more about the Inferred Dimension . Thanks Anil Maharjan anil
December 29th, 2010 12:08am