april 20, 2021

Collibra in the wild

I was curious to discover how Collibra is used in the wild, to see at what kind of organizations it would make sense to implement a data governance platform. I reached out to my network to find out if I could get a look under the hood of this Belgian data management platform.

The Chief Data Officer at a Belgian retail bank was so kind to set this up, and a couple of weeks later I got the opportunity to follow a guided tour through their environment.

Overall I was impressed by the sheer magnitude of the data governance program at the bank and the capacity and flexibility of Collibra to handle this smoothly. The speed at which certain regulatory requirements like BCBS 239 and GDPR were put in place, forced the banks to put data governance systems in place very quickly.

This explains the fact that they were very early adopters of the data governance platform. Improvements are certainly possible in the way Collibra is implemented at this Belgian retail bank. Possible the next steps could be to further integrate business terms and logical data models and to align the different technologies to facilitate automating the data catalogs and lineage.

Content

What does a data platform do?

Collibra at a Belgian retail bank.

Data dictionary

Data lineage

Data quality

What does a data platform do?

If I had to start from a blank sheet and describe what I would expect a data management platform to do, I came up with two things: data catalog and data lineage. A look at the website of Collibra quickly learns me that I missed the data quality component of such a platform.

On the website of Collibra we find the following products: Data Governance, Data Catalog, Data Privacy, Data Lineage and Data Quality. For me a data catalog is the starting point for a data dictionary of which data privacy is an attribute. But I don’t want to get lost in a semantic discussion around these terms so l’ll try to keep the language as crisp a possible – as Jo thought us at Deloitte – and assume a data platform should do three things: data dictionary, data lineage and data quality.

Functionality of Collibra

Collibra at a Belgian retail bank

Collibra is oriented towards the users in the business, while other data governance tools like Microsoft Purview or Informatica Axon are built more for technical IT specialists. The starting page for Collibra at this bank allows to reflect this approach by presenting itself as a hub where topics about Data inventory, Data sharing, Data quality, Data usage, and Enterprise Data Model can be explored. Some of those topics seem aimed at more technical users nonetheless.

Home page of the data management platform

Data dictionary

The data dictionary overview contains a list with applications, the tribe they reside in, their asset manager, a link to the data dictionary and a description. The integration of Enterprise Data Models in SAP PowerDesigner into Collibra is underway. If the link between the conceptual model of business terms and the logical model in PowerDesigner is well documented, Collibra will be able to build the data catalog automatically. For now this is already the case for Business Objects reports, where the names of the fields correspond with the names of the data elements in the business.

A possible use case here is, if a business intelligence developer wants to use certain data in a report, she can shop for data in the data catalog and request access to the referential, who will make sure the use of the data is for a legitimate interest and a certain period is defined.

Another use case is the possibility to install data governance as a gate that has to be passed when developing a new application, just like security and architecture are  gates that have to be passed. In case of the data governance gate the requirements will be that the privacy and ethics of the data usage are guaranteed and well documented.

Data dictionary (blurred out for confidentiality reasons)

Data lineage

In the data inventory you can find the data lineage, meaning the way the different data elements flow through the different systems. In the example you can see in which applications the email address is used. This allows for the kind of traceability that is mandatory for banks. You can also identify the master source for the different business terms and technically it would be possible to implement master data management in Collibra by implementing workflows.

Data lineage (blurred out for confidentiality reasons)

This data lineage in Collibra at the bank is not generated automatically based on the analysis of data transfer processes, as I was expecting, but was collected in a declarative way. This means by interviewing the responsibles about the usage of the different data elements. Technically it is possible to write a parser on the sql script to extract the data lineage information from the databases and ETL tools, but this is not yet in place at this Belgian retail bank.

Data quality

The data quality rules are developed on request and the result is displayed in Collibra. It is interesting to see the different data quality dimensions that are tested in the different rules: timeliness, completeness, validity, uniqueness, consistency, accuracy. Data quality is very important in view of the regulatory environment for a bank.

Machine learning can be used for data classifications. This way the algorithm will suggest a datatype for a certain data element and a percentage of certainty that this field is actually of this type. For example an email address.

Data quality rules

Conclusion

Keep in mind this a very specific early instance of Collibra, that doesn’t have all the  latest features at its disposal. To get a better view on the newest additions and extensions I recommend having a look at Collibra University. One of the reasons I wanted to better understand what Collibra does, is to see whether it would be interesting to implement a data governance platform as smaller or medium size enterprises. Taking the effort in work and budget into account I only see a use case in sectors with very stringent regulations regarding data in organizations of a certain size.