Navigating The Data “In Natura”

In the year 2000, I was hired to work as a consultant in a State Public Agency in Brazil with the mission of developing products for the Examination Area.

At that time, a large software company had been developing for more than two years, a Data Warehouse to serve the Examination Department, but no results seemed visible within the time limits of the project.

When analyzing the generated documents, we verified that this was merely a documentation of intentions following the standards of the company’s methodology and that it reflected a specification made by the users, lacking, therefore, greater technical details necessary to achieve the project.

Given the urgency of results, we then took another approach considering the following premises regarding the data required for the project:

  • The project in progress had implemented new cadaster and collection systems and was in the process of implementing others with the use of modern methodologies and tools that guaranteed good data quality;

  • The collection information provided by the banks was also reliable and of good quality;

  • The other information anticipated to participate in the system, such as that of the SERPRO’s COMEXT System (Federal Data Processing Service), was totally reliable;

  • Other information provided by the taxpayers was processed through applications downloaded from their sites, that had been recently developed by the project in execution;

  • The Writ of Violation system was fully digitalized and recently implemented by the same project.

As for the necessary requirements for servicing the area we considered that:

  • A relevant part of the needs of the Examination Area was in obtaining a group of taxpayers that would respond to some criteria which could vary according to the examination program being executed;

  • This program varies periodically, and is difficult to systematize since it will depend on events, priorities or new internal or external information added to the Database;

  • Users require the greatest possible flexibility, in order not to depend on the Information Technology Area in the preparation of the parameters and cases that may be selected which respond to the criteria defined by the examination programs;

  • The users need flexibility for allowing the analysis and adjustment of the number of cases to their management capacity;

  • The system would be used by a very restricted number of users.

We then proposed the following solution:

  • The system should be repetitive, without the need for IT intervention;

  • The system should minimize the operational procedures for generating the necessary data, i.e., the information used should be, to the extent possible, already existing in the database;

  • The system would have a layer on the Data Model so that the names of the objects of the database were the names known by the business area and not the standards implemented by the IT area;

  • The system should use the same data structures existing in the operational databases (replicated or not);

  • Define the Theme concept as a set of columns of a table that may be of interest to the business. Thus, a table could be broken down into several Themes.

  • Definition of some attributes regarding the columns of the tables that confer a specific function in the data recovery. For example: Define whether the column is an Element type, if it will be used as a Filter, Dimension, Metrics or a link to another table. A column could have several attributes;

  • Since the resulting object is always a group of taxpayers that satisfy some condition, we proposed to define that the result is always a series of Elements, which can be: CPF (Individual Taxpayer Identifier), CNPJ (Corporate Entity Identifier), RENAVAM (Motor Vehicle Identifier), IE – State Registration, or any other;

  • Use of the group theory in the operations involving Union, Intersection and Subtraction of the selected elements (of the same type) producing other resulting groups for the determination of the group of interest;

The system works with successive refinements: A research consists of Elements, as a result of selections of attributes of a Theme. The inclusion of new attributes generates a new group  which is included in the initial group, and so on. At any time, the research can be saved, as it is already the final result, or for use in other research, either for retrieval or presentation of data.

At this point some readers must be concerned with the proposal for accessing operational data. The idea is to use the same structures; therefore, in the case under analysis, the same operational data were used, due to the limited number of users who access the system and mainly by an implementation of the system known as Base Set. The Base Set is a set used as a reference, starting, that is, any result set always belongs or is contained in the Base Set.

Entering a Theme such as Collection, Electronic Invoices, among others, without a Base Set can be time-consuming for a system that should be repetitive. Therefore, if the Base Set has a few thousand Elements and the database is well configurated, there are no major problems.


We consider the existence of the Cadaster Theme, the Taxpayers´ Economic Activities Theme, the Collection Theme and the Imports Theme:

The objective is to obtain the Taxpayers identified by the CNPJ from a certain range of capital stock, a set of Economic Activities, who originated more than a certain amount and paid less than a tax amount in a given reference period;

  • From the Cadaster Theme, the user retrieves the set of taxpayers (CNPJ) with capital stock in the desired range. From the Economic Activities Theme, the user selects the taxpayers (CNPJ) that exercise the economic activities of interest. Note that the refinement is automatic;

  • The user may save the search result to be used as the Base Set of interest, and the CNPJ of the taxpayers for further analysis. This Base Set is the most efficient way to investigate other information, which are much larger files than those of the Cadaster and Taxpayer Economic Activities themes, or to use it in other searches;

  • As the system works with successive refinements, even if there is no search saved, the system will use the set obtained in (1) for continuing the research, but with a lower efficiency.

  • Starting from the Base Set and the Collection Theme, the user recovers the set of taxpayers (CNPJ) with payments of one or several Revenue Codes lower than a certain value in the given reference period. From the Imports Theme, the user retrieves the set of taxpayers (CNPJ) that resulted in values above a value of the period. At this moment, the system used the Base Set for both Themes, besides having automatically done the refinement.

In the 4 themes we have the desired result, a list of CNPJ that correspond to the example. The user could perform any of the other operations in the sets obtaining the sets as needed, or follow another navigation strategy. In addition, if the number of elements does not meet the requirements, the user may adjust the capital stock or use other economic activity codes or other parameters to include or eliminate taxpayers from the selected set.

Once the appropriate set of elements is obtained, except for a research résumé, history, this set can be used at any time to extract data from any themes configurated in the system, or serve as a control for the generation of cases and issuance of service orders for the examination teams, or serve as sets to participate in other operations with new sets.

This system became available in its initial version in 6 months and the result was so satisfactory that it came to be used by the Office of the Secretary of Revenue to extract managerial information from the collection area, among others.

In the initial version implemented this system was known as “PLAFIS – Examination Planning – Management Module”. Later, it was known as “JONAS – Just Online Navigation Analysis and Selection System”.

The Story Repeats Itself

Later, in 2014, I was hired for a consultancy project in another State Public Agency in Brazil. At that time, the Administration had a great expectation in a project, which had been under development for two years, called “DW” which consisted of the structuring of a reliable database established in a server as part of the operating environment. The data was transported daily to this environment through a system based on ACL (Audit Command Language) which carried out the recovery procedures in “Batches” (Lots).

Although our project gave some support to the DW project, at the end of another three years the DW project was declared non-viable. In other words, after 5 years, the expectation of having a database to retrieve essential information for the Administration was frustrated.

Thinking of a transition solution until we had the new system and a DW that met the Administration’s expectations, I recovered the JONAS system described above that was on platforms no longer supported (Windows XP and Delphi 5), which was installed in a virtual machine, which could be an alternative in the process of finding a solution that would meet the information needs of the administration.

The JONAS was configurated over the Operational Database that served the Administration and did not require any adjustment for the presentation of the prototype as a proposal of an alternative path. Unfortunately, our project did not have deadlines or resources for the development of solution in the current platforms, which ended up occurring later, with the construction of JONAS 2.0 using other resources.

The described solution is very adequate for the systematic or temporary recovery of information, while awaiting or not for other solutions. Its information capacity is something impressive. Let us see the following example as an exercise:

Assuming there are 20 Themes, each with 5 recoverable information attributes (a dimension, for example, Municipality, is unique regardless of the possible values) we can make the following estimate:

In total 20 x 5 = 100 attributes. From the combinatory analysis:

C 100,1 = 100! / (1! * 99!) = 100
C 100,2 = 100! / (2! * 98!) = 100 * 99 / 2 = 4.950
C 100,3 = 100! / (3! * 97!) = 100 * 99 * 98 / 6 = 161.700

That is: There are 166,750 combination options with up to 3 attributes, if there were more attributes …

Obviously, many of the combinations may not make sense, but they are available for the experts. We can say that: In the depths of the unprocessed data there is a wealth of information waiting to be highlighted.


View of the interface with Search History

View of the Information Analysis, selecting the Economic Activity Dimension

902 total views, 1 views today

Leave a Reply

Your email address will not be published.

CIAT Subscriptions

Browse through the site without restrictions. Consult and download the contents.

Subscribe to our electronic newsletters:

  • Blog
  • Academic offer (Only in spanish)
  • Newsletter
  • Publications
  • News alert

Activate subscription

CIAT Members

Representatives, Correspondent and Authorized staff (TA)