Q&A: Working With the High Volume Data Warehouse in Centerprise

The first in our Centerprise Best Practices Webinar Series discusses the features of Centerprise that make it the ideal integration solution for the high volume data warehouse. Topics include data quality (profiling, quality measurements, and validation), translating data to star schema (maintaining foreign key relationships and cardinality with slowly changing dimensions), and performance, including querying data with in-database joins and caching. We’ve posted the Q&A below, which delves into some interesting topics.

You can view the webinar video, as well as all our demo and tutorial videos, at Astera TV.

Q: Can data profiling stand alone?

A: Yes, absolutely. That’s exactly what is done in the order analysis example shown below. If you look at the targets, I’m not actually writing to a data warehouse or moving any data, all I’m doing is creating reports for that. So yes, you can use this information as a standalone piece, what I’m doing in this case is gathering information about this particular schema.

Q:  Can we group a bunch of data quality rules and use them in multiple flows?

A: Yes. As shown below, you can create multiple rules and make them a reasonable component by simply dragging and dropping onto the project. You can see that this data check component then becomes a greyed out box and now it is by reference. Now, if I have any other flow I can use this data check because it is by reference. It is a very good practice that any time you can make something reusable you should do so. You’ll thank yourself in the future.

Q: Explain how the Persistent Lookup Cache improves performance

A: In the below example using the dimension product lookup, this is something that is going to be used over and over again, so you don’t want to have to load this dimension table every time. You can see where this dimension table may have hundreds of thousands of rows in it, so if you are loading every single fact table, and, in doing so you have to load all the data for a lookup, clearly all the processing, all the data going across when migrating. is going to be consumed in just lookups. So instead now in Centerprise you can use the Persistent Lookup Cache. Centerprise has a built-in database where it stores this information, which is stored on disk, so you don’t have to worry about memory usage. Every time a lookup uses this label, it will then consult that cache rather than making a trip to the database. Going to the database is inherently expensive anyway, and making a trip to a very large table over and over again and retrieving all records is extremely expensive and can actually bring your process to a halt. I recommend using the Persistent Cache any time you have the option to do so.

Q: What are the most frequent causes of performance issues in loading data warehouses that users have experienced with Centerprise?

A: The number one cause of performance issues with Centerprise, as with any data integration program, is data volume – having too many lookups, especially too many lookups all in a column. In the example below you can see that in the fact table all the keys need some sort of lookup and if you have, for instance, 10 lookups right before the dimension table, each one of those lookups has to complete before the record can be inserted into a fact table. So having a lot of inefficient lookups will slow down the dataflow considerably.

A second issue that might impair performance is the initial query. The way to solve this is to parameterize these queries, which can be done in a couple different ways. First, you can use variables that are controlled the outside. For example, if you have a workflow that triggers all your data flows you can set this workflow for records for a limited time span, say a week, for instance. This will cut down significantly the amount of data going between the source database and Centerprise.

A third option is very similar to using variables, but instead uses incremental load based on audit fields. If you have a field you know is guaranteed to get modified every time a change happens to it, you can use the modify date head in the audit field as shown in the example below and  it will store that information on a file.

Then in subsequent times the dataflow is going to run, it will consult that file and basically do the same thing you defined in your “where clause,” but will do it automatically in that “where” file. So the upside is you don’t have to keep track of the variables, the downside is now you have an incremental file per object that you are loading from. This brings up the point that you may even want the sources to be shared actions, that way you don’t have to keep defining them and their audit fields.

Q: Does Use Constraint-Based Write automatically figure out the sequence of writing

A: Yes it does. It doesn’t matter how many tables you are writing to as long as they are in the same database. You choose the Use Constraint-Based Write and it knows in which order to write. It knows it has to write the customer first and then the sales order—it takes care of the order of operations it writes for you.

Q: How does Diff Processor compare to the upsert performance?

A: Diff Processor is much faster than the upsert. Upsert is going to fire off another query to see if the information exists or not, while the Diff Processor works by sending all the records in bunches to the target system. They are then written to a temporary table and joined. That comparison happens on the database side rather than on Centerprise side, so large chunks are prepared on the database side rather than using a separate query to find out whether an insert or update needs to happen. Basically, upsert does it one record at a time and Diff Processor compares in batches.We’ve found it to be orders of magnitude faster.

Q: Do you support fast load for Teradata?

A: Yes, both fast load and multi load. Fast load is automatically used when you are writing into an empty table.