danthedataman.com

March 23, 2009

More: export XML from Unidata (and import into SQL Server)

Filed under: ETL — Dan @ 8:46 pm
Tags: , , , , ,

Had a comment posted on my export XML from Unidata blog entry requesting more info on our ETL process, so this blog entry is an attempt to answer those questions.  These answers, of course, are based on the current state of our ETL process.

How many files are you exporting to SQL Server?
We are exporting 54 files to SQL Server.  Included in that are 22 files which are part of 5 file suites (5 fiscal years of BUD.WORK (BWK.ACYR) and GL Summary (GLS.ACYR) files; 3 financial aid years of TA, SA, SL, and LA files).  The remaining 32 files are not part of file suites.

What are the performance specs of this extract (how long does it take you to export STUDENT.ACAD.CRED?).
The XML extract for the 50+ files as mentioned above seems to take approximately 40 minutes, give or take about 5 minutes (actually last night looks like it took 34 1/2 minutes).
So the STUDENT.ACAD.CRED XML extract, for example, looks like it’s taking about 10 minutes, that’s for just over 300,000 records, and makes about a 300 MB xml file.  In comparison, PERSON, which is almost 200,000 records, takes about 4 1/2 minutes to make a 100 MB file.  STUDENT.TERMS, about 65,000 records, takes almost 10 minutes to generate a 12 MB file.  Number of fields and number of records plays a small part in the time to dump the data to XML, but the big performance difference-maker is computed-column complexity.  We just added STTR.CUM.GPA to our STUDENT.TERMS extract, and that’s why there’s such a drastic difference in performance between that file’s extract, and some of the other files.

How are you handling multi-valued fields and especially associations?  I would expect that the multi-value fields (LISTS) would break out as an “_LS” dependent (Child) table while the associations would also become a dependent table both using the ID of the Parent table and an Ordinal position.  Using your XML approach how did you make this happen?  Also, are you using SQL Server 2005 with DTS or SSIS?
Since XML is hierarchical, each Unidata file exports to a single XML file, with the multi-values part of that XML file’s data structure.  I am using SQL Server Integration Services (SSIS) to import the XML data into SQL Server 2005.  One of the Data Flow Sources in SSIS is an XML Source, which is what I used.  So this XML Source automatically separates the XML file into separate tables for each multi-value.  Specifically, it creates one table for the single-value data, and then it creates 2 tables for each multi-value dataset.  Perhaps the SSIS package could be built to output the data differently, but for the most part I just took the defaults.  So for example, based on the fields we have included in our extract from the PERSON file, here are some of the tables that are generated in SQL Server: PERSON, PERSON_PARENTS_MS, PERSON_PARENTS_MV, PERSON_PEOPLE_EMAIL-MS, and PERSON_PEOPLE_EMAIL-MV.  PERSON is generated by default, and contains any single value data, and the SSIS-generated Id which links to the multi-value tables.  The PERSON_PEOPLE_EMAIL tables were generated from the PEOPLE.EMAIL association, and include all the fields from that association that were included in our XML extract.  The PERSON_PARENTS tables come from the field PARENTS, which is a Q-pointer.  SSIS breaks each multi-value dataset into 2 tables: the …MV table contains the SSIS-generated ID from the PERSON table (i.e., PERSON_Id), and the SSIS-generated ID corresponding to the multi-value data table (i.e., PEOPLE_EMAIL-MV_Id).  The …MS table contains the data for the multi-value dataset, along with the corresponding key (i.e., PEOPLE_EMAIL-MV_Id) to link it back to it’s originating PERSON record.

Some more things I learned in the process of implementing this with SSIS in SQL Server 2005:

  • The XML mapping file on the Linux server that I referred to in my original blog entry on this is necessary to strip dots out of Unidata/Colleague field and file names, so that SSIS/SQL Server can process the data structures.  Also, in some cases, I renamed the XML record itself, so that the SSIS-generated ID name would not conflict with the pre-existing Unidata/Colleague key field name.  (The example in the original blog entry is a case of this: I changed the record name to COUNTIES_rec, because there was an existing field already called COUNTIES_ID.)
  • Something that I was not able to handle with the mapping file, but that needed to be taken care of before the XML could be processed by SSIS, was dots in multi-value field/association names.  I couldn’t get the mapping file to handle those, so ended up running the XML file, after it was generated, through the Linux sed command for each multi-value/association with a dot.  (Thanks to my supervisor Curt for pointing me in that direction.)
  • There were a few Unidata (Colleague) files which contained character(s) in the data which SSIS did not like.  I blogged about that, along with what I did about it, here.
  • One last thing I’ll mention is that XML Data Flow Source in SSIS gives you an opportunity to generate the .xsd file (XML schema definition) from the XML file that you provide (and I used that capability).  There were occasions, though, when I would later have to modify the data type in an .xsd from what SSIS chose.

Well thank you Greg for the questions, and I’m sorry for the delay getting answers out.  Hope this helps you and perhaps some others, too.

Advertisements

August 20, 2008

get back to blogging with fix for invalid XML

Filed under: SSIS — Dan @ 4:11 pm
Tags: , ,

Whew!  I can’t believe it’s been a month and a half since I last blogged.  And I hadn’t even posted any kind of data blog yet on WordPress, which is ironic given my blog url…… 

So here’s a little something to remedy both of those situations.

Trying to import an XML file with SQL Server 2005 Integration Services (SSIS), I encountered the following error: [XML Source [1]] Error: The component “XML Source” (1) was unable to process the XML data. ‘’, hexadecimal value 0x13, is an invalid character. Line 9634, position 40. 

There is no doubt a better, more deterministic way to solve this, and I may get to that eventually, but for now, here are some resources that enabled me to work around it and keep moving on my project.

Became aware of invalid XML characters, so the above error I received made more sense to me, from here: http://kbalertz.com/325694/Parsing-Containing-Invalid-Character-Raise-ArgumentException.aspx

http://www.asciitable.com/ — gives Hex & Octal ASCII values, which I needed, as well as Decimal value and character description.

Found my solution: http://www.unix.com/shell-programming-scripting/36352-replace-junk-chars-sed.html.  Since SSIS did not even seem able to handle the invalid XML, I had to deal with it before it got to SSIS.  Next logical place was on the Linux server where the XML files were being generated, where I did have some similar logic to resolve some similar issues.  I had already, using iconv, translated the file to UTF-8, but that obviously wasn’t helping.  Was using sed to replace dots with underscores, but did not find any documentation readily which talked about using ASCII code with sed.  The link at the top of this paragraph made me aware of tr, which I was able to use successfully to remove the invalid XML character.  Here is the command I used: tr -d ‘\23’ < INPUTFILE.xml > OUTPUTFILE.xml.  tr stands for ‘translate,’ it appears; the -d option deletes the indicated string from input.  I used http://asciitable.com (mentioned above) to convert the Hex 13 I had into the Octal 23 required for the tr command.

Well, that’s it for now.  Hope this is helpful for someone.

-Dan

Create a free website or blog at WordPress.com.