Had a comment posted on my export XML from Unidata blog entry requesting more info on our ETL process, so this blog entry is an attempt to answer those questions. These answers, of course, are based on the current state of our ETL process.
How many files are you exporting to SQL Server?
We are exporting 54 files to SQL Server. Included in that are 22 files which are part of 5 file suites (5 fiscal years of BUD.WORK (BWK.ACYR) and GL Summary (GLS.ACYR) files; 3 financial aid years of TA, SA, SL, and LA files). The remaining 32 files are not part of file suites.
What are the performance specs of this extract (how long does it take you to export STUDENT.ACAD.CRED?).
The XML extract for the 50+ files as mentioned above seems to take approximately 40 minutes, give or take about 5 minutes (actually last night looks like it took 34 1/2 minutes).
So the STUDENT.ACAD.CRED XML extract, for example, looks like it’s taking about 10 minutes, that’s for just over 300,000 records, and makes about a 300 MB xml file. In comparison, PERSON, which is almost 200,000 records, takes about 4 1/2 minutes to make a 100 MB file. STUDENT.TERMS, about 65,000 records, takes almost 10 minutes to generate a 12 MB file. Number of fields and number of records plays a small part in the time to dump the data to XML, but the big performance difference-maker is computed-column complexity. We just added STTR.CUM.GPA to our STUDENT.TERMS extract, and that’s why there’s such a drastic difference in performance between that file’s extract, and some of the other files.
How are you handling multi-valued fields and especially associations? I would expect that the multi-value fields (LISTS) would break out as an “_LS” dependent (Child) table while the associations would also become a dependent table both using the ID of the Parent table and an Ordinal position. Using your XML approach how did you make this happen? Also, are you using SQL Server 2005 with DTS or SSIS?
Since XML is hierarchical, each Unidata file exports to a single XML file, with the multi-values part of that XML file’s data structure. I am using SQL Server Integration Services (SSIS) to import the XML data into SQL Server 2005. One of the Data Flow Sources in SSIS is an XML Source, which is what I used. So this XML Source automatically separates the XML file into separate tables for each multi-value. Specifically, it creates one table for the single-value data, and then it creates 2 tables for each multi-value dataset. Perhaps the SSIS package could be built to output the data differently, but for the most part I just took the defaults. So for example, based on the fields we have included in our extract from the PERSON file, here are some of the tables that are generated in SQL Server: PERSON, PERSON_PARENTS_MS, PERSON_PARENTS_MV, PERSON_PEOPLE_EMAIL-MS, and PERSON_PEOPLE_EMAIL-MV. PERSON is generated by default, and contains any single value data, and the SSIS-generated Id which links to the multi-value tables. The PERSON_PEOPLE_EMAIL tables were generated from the PEOPLE.EMAIL association, and include all the fields from that association that were included in our XML extract. The PERSON_PARENTS tables come from the field PARENTS, which is a Q-pointer. SSIS breaks each multi-value dataset into 2 tables: the …MV table contains the SSIS-generated ID from the PERSON table (i.e., PERSON_Id), and the SSIS-generated ID corresponding to the multi-value data table (i.e., PEOPLE_EMAIL-MV_Id). The …MS table contains the data for the multi-value dataset, along with the corresponding key (i.e., PEOPLE_EMAIL-MV_Id) to link it back to it’s originating PERSON record.
Some more things I learned in the process of implementing this with SSIS in SQL Server 2005:
- The XML mapping file on the Linux server that I referred to in my original blog entry on this is necessary to strip dots out of Unidata/Colleague field and file names, so that SSIS/SQL Server can process the data structures. Also, in some cases, I renamed the XML record itself, so that the SSIS-generated ID name would not conflict with the pre-existing Unidata/Colleague key field name. (The example in the original blog entry is a case of this: I changed the record name to COUNTIES_rec, because there was an existing field already called COUNTIES_ID.)
- Something that I was not able to handle with the mapping file, but that needed to be taken care of before the XML could be processed by SSIS, was dots in multi-value field/association names. I couldn’t get the mapping file to handle those, so ended up running the XML file, after it was generated, through the Linux sed command for each multi-value/association with a dot. (Thanks to my supervisor Curt for pointing me in that direction.)
- There were a few Unidata (Colleague) files which contained character(s) in the data which SSIS did not like. I blogged about that, along with what I did about it, here.
- One last thing I’ll mention is that XML Data Flow Source in SSIS gives you an opportunity to generate the .xsd file (XML schema definition) from the XML file that you provide (and I used that capability). There were occasions, though, when I would later have to modify the data type in an .xsd from what SSIS chose.
Well thank you Greg for the questions, and I’m sorry for the delay getting answers out. Hope this helps you and perhaps some others, too.
Dan,
This is very interesting. If you get a chance could you please go over the workflow that you use to acutally build your table in SQL Server using SSIS. I am a little uncertain as to where one should start since while there is a XML Data Flow Source there is no XML Connection Manager. Are you using the XML Task Control Flow? If so how?
Greg
Comment by Greg Coelho — June 23, 2009 @ 3:30 pm |