Help with correctly reading CSV nonprinting control characters: VS 2008 got it right, but VS 2010 trips over extra random ones.
Last week I upgraded to SQL 2012 and I'm working through upgrading some SSIS packages created in VS 2008. There is an existing 2008 package which uploads a new CSV file into SQL Server every night. This VS2008 package works fine but when I convert it to the new VS2010/SQL Server 2012 format it blows up as follows: The original CSV file uses no text delimiters, uses Char 9 (Tab) for a field delimiter and uses Char 13 (Carr. Return) for its end of record marker. The CSV file has about 40 fields and about 50k records/rows/lines. Many of the fields are text fields and they frequently contain all kinds of nonprinting characters including the Tab and CR used to delimit the fields and records. Somehow the old SSIS package was always able to correctly identify the field and row boundries in the file; it was never fooled. When the CSV files are imported into Excel 2010, it also can correctly identify the field and row boundries. I don't know how they do it without being tripped up by the extra Tabs and CR found in some of the text fields, but they do it right. On the other hand the new 2012 SSIS package (produced by the SSIS Upgrade Wiz) never works correctly: it always trips up when it runs into an extra unescaped Tab or CR in the middle of a text field, treating it as though it were a field or row delimiter. QUESTIONS: how do Excel and SSIS 2008 avoid misinterpreting the random Tab and CR characters and what can I do to the new SSIS 2012 package to help it stop tripping on those same random Tab and CR characters?HomeCookN
July 2nd, 2012 10:05pm

I think the answer is in the fact that to establish its metadata, by default, SSIS 2008 samples only a subset of rows and Excel checks only 8 top rows, so if a file happened to not to have these CRs or Char 9s there at the package design stage then the rest of file(s) was treated as such, ignoring the aforementioned chars. In SSIS 2012 you must use a script to replace the "... extra unescaped Tab or CR in the middle of a text field". I posted this technique in my blog: http://geekswithblogs.net/Compudicted/archive/2011/09/19/ssis-how-to-remove-occasional-quotes-and-replace-the-column.aspxArthur My Blog
Free Windows Admin Tool Kit Click here and download it now
July 2nd, 2012 11:12pm

I think the answer is in the fact that to establish its metadata, by default, SSIS 2008 samples only a subset of rows and Excel checks only 8 top rows, so if a file happened to not to have these CRs or Char 9s there at the package design stage then the rest of file(s) was treated as such, ignoring the aforementioned chars. In SSIS 2012 you must use a script to replace the "... extra unescaped Tab or CR in the middle of a text field". I posted this technique in my blog: http://geekswithblogs.net/Compudicted/archive/2011/09/19/ssis-how-to-remove-occasional-quotes-and-replace-the-column.aspxArthur My Blog
July 2nd, 2012 11:18pm

Thanks a lot for your suggestion, Arthur, and I'm sorry it's taken me so long to reply. I have taken a look at your blog post showing your recommended technique. Unfortunately, my case is not quite the same situation. In my case, I do not want to remove or replace ALL the CR characters in the file. I want to keep all the CRs that legitimately indicate the end of a record/row. I want to first identify, and then lose or escape the CR characters that occur WITHIN the body of a text field so for me the first challenge is to determine if each CR is 'legitimate,' i.e., actually at the end of a complete row of date. If it's a 'bad' CR, i.e., in the middle of a field, then I need to escape it so that it no longer breaks the partial row. I'm working on how to do this via Reg Edit. HomeCookN
Free Windows Admin Tool Kit Click here and download it now
July 9th, 2012 12:10pm

You just replace the column delimiter, keeping the original "control" chars, so I trust my approach may actually work for your scenario.Arthur My Blog
July 9th, 2012 12:13pm

Typical Row would be like this: Col 1 text: xyz TAB Col2 text: abc TAB Col 3 text: mno TAB Col 4 text: pqr CR and it works fine because there are no CR characters in any of the columns. But an atypical Row with an imbedded CR (within column #2s text in this example): Col 1 text: xyz TAB Col2 text: abCRc TAB Col 3 text: mno TAB Col 4 text: pqr CR will blow up because the imbedded CR will turn it into 2 separate rows which will look like this: Col 1 text: xyz TAB Col2 text: abCR c TAB Col 3 text: mno TAB Col 4 text: pqr CR. I dont believe I need to do anything with the column delimiters as you suggested, Author: I only need to test for and identify extraneous CR characters which occur in the middle of a row. Anyway, I really appreciate your time and help and I am going to mark your reply as Answered. Thank you.HomeCookN
Free Windows Admin Tool Kit Click here and download it now
July 9th, 2012 1:30pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics