danthedataman.com

March 23, 2009

More: export XML from Unidata (and import into SQL Server)

Filed under: ETL — Dan @ 8:46 pm
Tags: , , , , ,

Had a comment posted on my export XML from Unidata blog entry requesting more info on our ETL process, so this blog entry is an attempt to answer those questions.  These answers, of course, are based on the current state of our ETL process.

How many files are you exporting to SQL Server?
We are exporting 54 files to SQL Server.  Included in that are 22 files which are part of 5 file suites (5 fiscal years of BUD.WORK (BWK.ACYR) and GL Summary (GLS.ACYR) files; 3 financial aid years of TA, SA, SL, and LA files).  The remaining 32 files are not part of file suites.

What are the performance specs of this extract (how long does it take you to export STUDENT.ACAD.CRED?).
The XML extract for the 50+ files as mentioned above seems to take approximately 40 minutes, give or take about 5 minutes (actually last night looks like it took 34 1/2 minutes).
So the STUDENT.ACAD.CRED XML extract, for example, looks like it’s taking about 10 minutes, that’s for just over 300,000 records, and makes about a 300 MB xml file.  In comparison, PERSON, which is almost 200,000 records, takes about 4 1/2 minutes to make a 100 MB file.  STUDENT.TERMS, about 65,000 records, takes almost 10 minutes to generate a 12 MB file.  Number of fields and number of records plays a small part in the time to dump the data to XML, but the big performance difference-maker is computed-column complexity.  We just added STTR.CUM.GPA to our STUDENT.TERMS extract, and that’s why there’s such a drastic difference in performance between that file’s extract, and some of the other files.

How are you handling multi-valued fields and especially associations?  I would expect that the multi-value fields (LISTS) would break out as an “_LS” dependent (Child) table while the associations would also become a dependent table both using the ID of the Parent table and an Ordinal position.  Using your XML approach how did you make this happen?  Also, are you using SQL Server 2005 with DTS or SSIS?
Since XML is hierarchical, each Unidata file exports to a single XML file, with the multi-values part of that XML file’s data structure.  I am using SQL Server Integration Services (SSIS) to import the XML data into SQL Server 2005.  One of the Data Flow Sources in SSIS is an XML Source, which is what I used.  So this XML Source automatically separates the XML file into separate tables for each multi-value.  Specifically, it creates one table for the single-value data, and then it creates 2 tables for each multi-value dataset.  Perhaps the SSIS package could be built to output the data differently, but for the most part I just took the defaults.  So for example, based on the fields we have included in our extract from the PERSON file, here are some of the tables that are generated in SQL Server: PERSON, PERSON_PARENTS_MS, PERSON_PARENTS_MV, PERSON_PEOPLE_EMAIL-MS, and PERSON_PEOPLE_EMAIL-MV.  PERSON is generated by default, and contains any single value data, and the SSIS-generated Id which links to the multi-value tables.  The PERSON_PEOPLE_EMAIL tables were generated from the PEOPLE.EMAIL association, and include all the fields from that association that were included in our XML extract.  The PERSON_PARENTS tables come from the field PARENTS, which is a Q-pointer.  SSIS breaks each multi-value dataset into 2 tables: the …MV table contains the SSIS-generated ID from the PERSON table (i.e., PERSON_Id), and the SSIS-generated ID corresponding to the multi-value data table (i.e., PEOPLE_EMAIL-MV_Id).  The …MS table contains the data for the multi-value dataset, along with the corresponding key (i.e., PEOPLE_EMAIL-MV_Id) to link it back to it’s originating PERSON record.

Some more things I learned in the process of implementing this with SSIS in SQL Server 2005:

  • The XML mapping file on the Linux server that I referred to in my original blog entry on this is necessary to strip dots out of Unidata/Colleague field and file names, so that SSIS/SQL Server can process the data structures.  Also, in some cases, I renamed the XML record itself, so that the SSIS-generated ID name would not conflict with the pre-existing Unidata/Colleague key field name.  (The example in the original blog entry is a case of this: I changed the record name to COUNTIES_rec, because there was an existing field already called COUNTIES_ID.)
  • Something that I was not able to handle with the mapping file, but that needed to be taken care of before the XML could be processed by SSIS, was dots in multi-value field/association names.  I couldn’t get the mapping file to handle those, so ended up running the XML file, after it was generated, through the Linux sed command for each multi-value/association with a dot.  (Thanks to my supervisor Curt for pointing me in that direction.)
  • There were a few Unidata (Colleague) files which contained character(s) in the data which SSIS did not like.  I blogged about that, along with what I did about it, here.
  • One last thing I’ll mention is that XML Data Flow Source in SSIS gives you an opportunity to generate the .xsd file (XML schema definition) from the XML file that you provide (and I used that capability).  There were occasions, though, when I would later have to modify the data type in an .xsd from what SSIS chose.

Well thank you Greg for the questions, and I’m sorry for the delay getting answers out.  Hope this helps you and perhaps some others, too.

November 13, 2008

Clock running slow in Linux guest on VMWare Server: one solution

Filed under: VMWare — Dan @ 3:19 pm
Tags: , , ,

Correct clock time on this Linux VM (Virtual Machine) is important for us, as the purpose of this Linux machine is to house the Extract part of our primary ETL (Extract/Transform/Load) process, so if the clock is not right, the extracted XML files will not be ready for the next ETL step.  Learned today to pay close attention when attempting to re-install/upgrade VMWare Tools, which is the core of our clock sync solution, as an incorrect re-install attempt can disable it.  With that in mind, here’s the process I went through two years ago, when initially setting up this Linux VM as part of our ETL, and part of which I ran through again today to correct the failed VMWare Tools install.

  1. Checked for updates to VMWare Server
  2. Turned off virtual machine and backed up by copying virtual machine files
  3. Added kernel parameters to boot loader: to all lines in /boot/grub/grub.conf that started with “kernel”, added “clock=pit nosmp noapic nolapic” (per http://kb.vmware.com/KanisaPlatform/Publishing/329/1420_f.SAL_Public.html)
  4. Reset clock in Linux guest
  5. Installed & configured VMWare Tools in Linux guest (from http://www.vmware.com/support/ws4/doc/new_guest_tools_ws.html)

VMware Tools for Linux Guests

  1. Power on the virtual machine.

2.      After the guest operating system has started, prepare your virtual machine to install VMware Tools.

Choose File > Install VMware Tools.

The remaining steps take place inside the virtual machine.

3.      Be sure the guest operating system is running in text mode. You cannot install VMware Tools while X is running.

4.      As root (su –), mount the VMware Tools virtual CD-ROM image, change to a working directory (for example, /tmp), uncompress the installer, then unmount the CD-ROM image.

Note: You do not use an actual CD-ROM to install VMware Tools, nor do you need to download the CD-ROM image or burn a physical CD-ROM of this image file. The VMware Workstation software contains an ISO image that looks like a CD-ROM to your guest operating system. This image contains all the files needed to install VMware Tools in your guest operating system.

Note: Some Linux distributions use different device names or organize the /dev directory differently. If your CD-ROM drive is not /dev/cdrom, modify the following commands to reflect the conventions used by your distribution.

mount /dev/cdrom /mnt
cd /tmp
tar zxf /mnt/vmware-linux-tools.tar.gz
umount /mnt

5.      Run the VMware Tools installer.

cd vmware-tools-distrib
./vmware-install.pl

6.      Log out of the root account.

exit

7.      Start X and your graphical environment.

8.      In an X terminal, launch the VMware Tools background application.

vmware-toolbox &

Note: You may run VMware Tools as root or as a normal user. To shrink virtual disks, you must run VMware Tools as root (su –).

    6.  After launching VMWare Tools application (vmware-toolbox &), on the Options tab, check the Time synchronization … box.

Had to remind myself how to start Linux in text mode, and found that here: http://www.webmasterforums.com/software-distro-specific/1760-starting-linux-text-mode-rh9.html: “edit /etc/inittab….the line that says id:5:initdefault, make it 3 instead
reboot…”

 

Also noticed that the file vmware-linux-tools.tar.gz from the ISO image actually had a different name, too, in this case.

 

Our Specs:

Host machine: Windows Server 2003

VMWare Server 1.0.6

Linux guest: RHEL 4

 

September 17, 2008

Data Term: Snapshot date

Filed under: definition — Dan @ 11:32 am
Tags: , ,

Ok, so you need a ‘snapshot’ of your data or database.  You’re institution has decided on a Snapshot Date.  Sooo….do you just take a copy of your data any time during the business day of the snapshot date, or when?  This, of course, depends on your definition of Snapshot, which I’m not going to focus on, but will just say it refers to a ‘set of data’ (i.e., transactional database, reporting database, data warehouse, spreadsheet, ….. or whatever).

So here’s the definition: Snapshot Date – data as-of end of business day.  It would be safe to say, technically, the Snapshot could be taken anytime between end of business day on the Snapshot Date, and beginning of business day on the next business day.

So practically speaking, we have a weeknightly ETL process that ‘refreshes’ our Operational Reporting database (OpsDB) from our transactional system.  To take a snapshot of OpsDB, any time during the business day following the Snapshot Date a copy can be made of it and be considered a valid snapshot-date Snapshot (of OpsDB).

August 28, 2008

export XML from Unidata

Filed under: Unidata — Dan @ 8:53 am
Tags: , , ,

Thought it might be worthwhile to post here also something I recently posted in a Datatel forum, about how to export XML from Unidata.  This is part of our institutions ETL (Extract/Transform/Load) process, in which we transfer data from our transactional system (Datatel’s Colleague) to an operational reporting database (in Microsoft SQL Server 2005).

Here are the basics on how to export XML from Unidata (7.1):

1. Change to your production Colleague folder, i.e., cd /datatel/coll18/live/apphome

2. Create the _XML_ folder inside your production Colleague folder, i.e., /usr/ud71/bin/udt CREATE.FILE DIR _XML_

3. Create XML map, if desired (i.e., if you want to replace dots with underscores in Colleague field names). This will need to be in _XML_ folder when you dump to XML. Here is a sample file:
<?XML version=”1.0″?>
<U2xml-mapping xmlns:U2xml=”http://www.ibm.com/U2-xml“>
<U2xml:mapping file=”COUNTIES”
record=”COUNTIES_rec”
/>
<U2xml:mapping file=”COUNTIES”
field=”COUNTIES.ID”
map-to=”COUNTIES_ID”
/>
<U2xml:mapping file=”COUNTIES”
field=”CNTY.DESC”
map-to=”CNTY_DESC”
/>
</U2xml-mapping>

4. Run the Unidata command to LIST to XML, i.e,
/usr/ud71/bin/udt LIST COUNTIES COUNTIES.ID CNTY.DESC TO COUNTIES TOXML XMLMAPPING COUNTIES.txt
XMLMAPPING uses the map COUNTIES.txt, in this case. TO COUNTIES TOXML creates a file COUNTIES.xml in the _XML_ folder you created above.

I found the documentation for this on IBMs website, Unidata manual called ‘Using Uniquery’, chapter 9, ‘Creating XML Documents’

Blog at WordPress.com.