In order to create conclusive data quality rules, data engineers and departments often have to take their own data apart first. Far too rarely does the data correspond to the desired format, distribution, dependency or fill level.
This is where data profiling comes into play. With this methodology, you find out exactly
- how the data is structured,
- which rules have to be implemented and
which other requirements have to be taken into account.
With data profiling, users receive meaningful analyses of their data, on the basis of which they can create uniform rules that work – even if the original data was unknown to them beforehand! For this purpose, HEDDA.IO has implemented the open-source framework Pandas Data Profiling, which allows extensive and complex analyses to be created.
We have advanced the implementation to such an extent that it can be used from pyspark, .NET Interactive Notebooks as well as via our Azure Functions. All statistics are then available to the user per execution in the corresponding dialogue in the web UI as well as in the notebook widget.
HEDDA.IO is a new, attractive alternative for all data scientists and data engineers who want to optimize their work environment and sustainably improve the quality of their results.