Dataflows – The Cornerstone of Data Integration Part 1

Part 1 – Introduction to Centerprise Dataflows

Because dataflows are the cornerstone of data integration, we’ve organized into a blog series our documentation on Centerprise dataflows. Centerprise Dataflow Designer, with its visual interface, drag-and-drop capabilities, instant preview, and full complement of sources, targets, and transformations, is the perfect way for users to create and maintain effective and efficient dataflows. This blog series of eight parts will cover everything you need to know in order to get the most out of your Centerprise integration projects. To see the complete outline of this series and what’s in store over the coming weeks, click here>>.

Modularity and Reusability

A dataflow contains a set of transformations that are executed in a user-defined sequence. Usually, data is read from one or more data sources, goes through a series of transformations, and the transformed data is then written to one or more destinations. Modularity enhances the maintainability of your dataflows by making them easier to read and understand. It also promotes reusability by isolating frequently used logic into individual components that can be leveraged as “black boxes” by other flows. Centerprise supports multiple types of reusable components, including subflows, shared actions, shared connections, and detached transformations. Visit our blog on Centerprise Best Practices: Modularity and Reusability for additional details.

Centerprise dataflows provide seamless integration between data sources and destinations, helping users integrate applications within the enterprise as well as with outside customers, vendors, and other business partners. In a Centerprise dataflow, any number of sources and destinations can be mixed and matched on a single visual dataflow diagram, enabling transformations, validations, and routing to be specified as the data moves down the pipeline.

With Centerprise dataflows, data can be merged from multiple disparate sources, data can be split from a single source into multiple destinations, and a series of relatively simple to highly complex transformations can be performed. Centerprise built-in transformations include field-level transformations such as expressions, lookups, and functions, as well as record set-level transformations such as sort, join, union, merge, filter, route, normalize, denormalize, and many others. Centerprise also provides transformations to enable users to apply data quality rules, ensuring that the data meets a specified criteria. In addition, users can route the flow of data by using a custom decision logic that is suitable for a particular scenario.

Subflows

For a set of complex transformations that are used repeatedly, subflows can be created that enable users to build modular integration projects. A subflow is a dataflow that can be used inside another dataflow. Any number of subflows can be called to run inside the dataflow. A subflow makes it possible to hide the underlying logic inside the subflow and treat it as a black box within the main dataflow. This simplifies and streamlines the design of integration jobs, increases reusability, and results in an easier-to-understand overall diagram. Over time, as the logic inside the subflow changes, the subflow can be updated and the update is automatically reflected in the main dataflow.

Centerprise subflow

Example Centerprise subflow

 

Advanced Logging

The advanced logging functionality in Centerprise provides detailed visibility into the data at each step in the dataflow. A special ‘data quality mode’ is available to help capture error messages and related status information as records move through the dataflow pipeline. The data quality statistics can be written into any destination, so that both the individual data records and aggregate data profile are available for review and analysis.

Parameterization

Centerprise dataflows can run on local or remote servers. To support smooth development-to-production deployment, Centerprise provides extensive parameterization. This capability enables you to change database connection information, file paths, authentication information and other values at runtime without modifying the underlying documents.

Dataflow Designer

A new dataflow can be created from scratch with just a few clicks using the Centerprise graphical Dataflow Designer. Dataflow designer enables users to drag and drop objects onto the dataflow, copy or move them between dataflows, change properties, create maps, and save objects for reuse in a different dataflow, among many other things, all with the capability of an unlimited undo-redo of previous actions.

Objects can be added to a dataflow in several ways, including direct drag-and-drop of files from any Explorer window, drag-and-drop of tables or views from the built-in Data Source Browser, or by adding an object directly from the Flow Toolbox.

Flow Toolbox

The objects on the Flow Toolbox are organized into expandable categories. The following main categories are available:

Sources

Data sources are starting points for any dataflow. Data is read from the data source, and may optionally go on to succeeding transformations before it is written to a destination. Data sources cannot succeed any object other than parameters, context, or singleton objects.

You can assign any data source as singleton. Singleton sources are useful when reading values from configuration files or databases, which are then supplied as parameters to the other dataflow objects. When a data source is marked as singleton, Centerprise reads only the first record from the data source and makes it available for maps and parameters throughout the entire life of a dataflow. This makes singleton objects useful for providing configuration and environment information to the current dataflow.

Destinations

You will normally use destination objects on your dataflow to write to a database, file, or web service. A destination object must follow a source object either directly or indirectly via a chain of transformations. A destination object does not necessarily have to be the ending object on your dataflow, as another destination object, a subflow, or a log/profiler object may succeed it.

Transformations

A transformation object processes and changes the data travelling from an upstream object. Transformation objects can be used to convert, combine, filter, route, join, split, merge, look up, or otherwise process the incoming data. Centerprise transformations are of two types–single record transformations and set transformations.

Single Record Transformations

Single record transformations are used to derive or look up new values by using element values in preceding transformations or sources. The results of single record transformations can be viewed as appending more values to the preceding layout. A name parse function, for instance, takes a full name and breaks it into individual name components. These components can be mapped to succeeding transformations, or written to a destination. Examples of single record transformations include expressions, functions, and lookups.

Set Transformations

Set transformations work on the record set and can combine, route, filter, and otherwise manipulate a record set. Set transformations can change the order and content of records in the input stream. Examples of set transformations include join, filter, route, sort, union, and more.

Blocking Transformations

Blocking transformations accumulate some or all records before processing these records. Sort transformation is one example. It waits for end of input before sorting and releasing records. Other blocking transformations include join, aggregate, and denormalize.

Resources

The Resources category provides access to context and parameter objects, as well as shared database connections. These objects are useful for parameterizing a dataflow. They pass in values from outside the dataflow or use values from the job context, such as Server Name or Scheduled Job Id. The values coming from context and parameter objects, as well as the fields in singleton sources, can be accessed directly in many places throughout the dataflow using the parameter replacement notation $(<parameter_name>). Shared Connection objects provide the ability to use a single connection and, optionally, a single transaction for multiple destinations. This enables users to write to multiple destinations in the same transaction and roll back the entire transaction if necessary.