Creating a Report Model with ReportMiner 6.2 – Part 1

ReportMiner liberates business data trapped in printed documents such as reports, bank statements, and invoices in popular formats such as PDF, PRN, TXT, XLS, and XLSX. That electronic data can then be integrated into company databases and leveraged for operations and business intelligence applications. ReportMiner 6.2’s easy-to-use environment enables users to view the document being mined and define logic for extracting data blocks and fields within these blocks.

To extract data from a printed document, called data mining or report mining, you first need to create a report model that contains the definition of the report’s structure and then export it to your destination of choice. You can also use your report as a source object in a dataflow, where you can take advantage of the advanced transformations and conversion features of ReportMiner.

A report model normally has a data region and fields belonging to this region. Depending on the structure of the data, you can create a separate Header and Footer, and append regions with their own fields. ReportMiner supports true hierarchical data extraction such that a data region can have child data regions and the child regions can have their own children and so on. This week we’ll learn about extracting Header data and next week we learn the details of how to create the fields that make up the Header.

To create a new report layout, go to File -> New and select Report Model (Figure 1).

Figure 1

ReportMiner supports extracting unstructured data from text, EDI, Excel, PRN, and PDF files. All file types fall under the content type Report except for Excel, which has its own content type (Figure 2).

Figure 2

Select the data file to be used as a sample file. We will use data from this file to create our report model. Depending on the content type of your data, reading options will change. For example, if you have a PDF file, you can select the scaling factor, font, tab size, and passwords.

We selected a sample data file for Orders as shown in the screenshot below. The selected file is loaded into the Report Definition Editor (Figure 3).

Figure 3

Note: You can also load a different data file in the report definition editor at a later time.

Click the file icon on the toolbar and navigate to the file you want to load.

Let’s take a look at this report. At the top of our sample is general order information, such as Company Name, Order Date and Time, Customer Name, Account Number, and others. Following it is the detailed order information, such as the order items making up the order.

Extracting Header Data

Our sample report has two logical regions, the Header region and the Data region. Unlike some other common reports, this report has no Footer region.

The Header is at the very top of the report, spanning three lines starting at the line with the order date (Figure 4).

Figure 4

So the first step in creating our report model will be to define the Header for our report.

In the Report Definition Editor, select the top three lines. This is the area that covers the Header. Right click on your selection and using the context menu select one of the following options, shown in the context menu in Figure 5.

Figure 5

Since we are creating the Header, select Add Page Header Region.

The Report Browser on the left hand side now shows a new node called Header (Figure 6).

Figure 6

Now, let’s take a closer look at the Header. The Header in our sample always starts with a date, shown at the very first line and in the very first character position of the Header. We can use the date as an identifying pattern for the header.

Any time the  pattern occurs in the file, ReportMiner will treat it as the beginning of the Header. Let’s enter the wildcard characters denoting digits, as shown in Figure 7.

Figure 7

Any time this pattern occurs inside the file, ReportMiner will treat it as the starting point of the Header.

Notice that the Report Definition Editor now highlights the header in purple. The Header spans three lines, as shown by the purple block in the editor. The height of the Header or any other region (i.e., the number of lines that the header spans) is controlled by the Line Count input below the Report Toolbar.

The next step is to create the fields making up the Header. We’ll show you how to do that in next week’s post.

