The last two posts in this blog were all about discussing concepts related to Common Data Models (CDM) and also, describing an approachable process (writing it with some comprehensive code) on how to project CDM for real case scenarios. With today’s post, I want to discuss possibilities on the tooling currently available out there to bring code-based CDMs (like the one we’ve built) into life.

If you didn’t have the chance to take a look at the articles mentioned early on, I would like to strongly encourage you to do so, as they ground up the concepts will discuss further ahead.

As you should be aware at this point, the biggest selling point for Common Data Models is the ability it provides for multiple applications to interact with a unified data structure, avoiding, among various other aspects, duplicity of data. I went through the benefits in the first article of this series (link above). But, where does it actually lands in practical terms? Is there a service called “Common Data Model” in Azure? How do I send data to that structure? Can I generate an actual data (SQL/MySQL or whatever) from a given CDM?

The answers for those questions are the subject of this final article of the series.

Hosting CDM’s files

As you might remember, the CDM we developed through the second article was hosted locally, in our work machine. Now, in order to give it life in the cloud and enable it to feed up applications in a high scalable way, we’ve got to bring it up into Azure. That’s the very first step.

The perfect place for CDM’s files in Microsoft’s cloud platform is Azure Data Lake Storage Gen2 (ADLS).

Figure 1. The relationship between Microsoft’s cloud and services with Azure Data Lake Storage Gen2

The reasons for that are crystal clear, aren’t they?

  • First, ADLS sits on top of Azure Storage infrastructure, which gives you built-in benefits, like: high-availability, “infinite” storage capacity, dynamic data replication, different layers of security, so forth.
  • ADLS enables hierarchical namespaces, which allows the collection of objects/files within an account to be organized into a hierarchy of directories and nested subdirectories in the same way that the file system on your computer is organized. With a hierarchical namespace enabled, a storage account becomes capable of providing the scalability and cost-effectiveness of object storage, with file system semantics that are familiar to analytics engines and frameworks.
  • Performance is optimized because you do not need to copy or transform data as a prerequisite for analysis. Compared to the flat namespace on Blob storage, the hierarchical namespace greatly improves the performance of directory management operations, which improves overall job performance.
  • ADLS integrates itself seamless with variety of services in Microsoft’s cloud: Azure Data Factory, Logic Apps, Power Automate, Power BI, are some of the examples.

That’s why, as a first step towards to have our CDM alive in the cloud, we’re going to create a Data Lake Storage Gen2. It can be done through Azure Portal, PowerShell, Azure CLI, ARM template or even directly calling Azure APIs for that purpose. I’ll do it through Azure Portal for the sake of simplicity.

Azure’s documentation does a great job describing the Storage Account (where ADLS actually sits on top) creation and ADLS enablement process in this tutorial, so I’m not duplicating it here. At the end of that process, you should have a blade on the Azure Portal pretty similar to the one presented by the Figure 2.

Figure 2. Creating a storage account with Data Lake Gen2 enabled

Next up, we can upload CDM’s files and directories into Data Lake’s file system directly. We could that directly through the Azure Portal, via Azure CLI, Powershell, Azure Data Factory (imagine that you have your CDM being delivered in production by a automated pipeline, for instance?!), AzCopy, or even Azure Storage Explorer. To centralize everything in one place, I’m using the Azure Portal itself to go after it.

To do so, first, from the storage account main blade, I’ve clicked on “Containers” under “Data Lake Storage” (Figure 3).

Figure 3. Getting access to ADLS containers area

Then, I happened to create a new container, called “cdm”. This one will be responsible for hosting my CDM’s structure files later on. The Figure 4 showcases the container newly created.

Figure 4. CDM’s container

Once inside of the container newly created, I was finally able to upload all the files and directories that are part of the Americas University’s Common Data Model previously developed, as presented by Figure 5.

Figure 5. Common Data Model sitting in ADLS’ container

Next, we have to add a couple of special permissions (Role-based Access Control – RBAC) both for the storage account and the container “cdm” itself. Those permissions will be required later on to authorize a given corporate user to perform operations on top of the data. The permissions that has to be set are described through the table below.

Resouce levelRole assignment
Storage AccountOwner
Container “cdm”Storage Reader and Data Access
Container “cdm”Storage Blob Data Owner
Table 1. Role Assignments to be added
 

If, for some reason, you need to grant that access to a given application or service programmatically, you must create a service principal first, and then add the following role assignments to that service principal.

To do so, from within the storage account overview blade, give a click on “Access Control (IAM)”, on top of the left-hand-side menu. Once in there, give a click on the button labeled as “Add role assignments”. A new floating window will be shown. First, select the role “Owner”, then type in the corporative credential or service principal for which you want to grant permission to, then select the result of the search, and finally, click on “Save”. The role assignment is now concluded.

The Figure 6 below showcases that process.

Figure 6. Adding a new role assignment to the Storage Account

Now, switch in to the context of “cdm” container (the blade that lists CDM’s files). Once in there, give a click on “Access Control (IAM)”, on top of the left-hand-side menu. Then, repeat the process of adding a new role assignment described by Figure 6 for each one of the remaining (refer to Table 1).

Interacting with Common Data Model

Great! We now have our CDM ready to go. But, at this point, you might be wondering “Cool, but how can I interact with the Data Model? How can I record data? How can I update data?”

Well, in fact, from the value proposition we learned throughout these articles, the great thing about CDM is the fact it acts as a consolidated database for different applications but, thus far, the only thing we have is a bunch of files and directories (that now sits on a storage infrastructure) that represents a set of entities and relationships but with no interaction at all.

That’s actually the beauty of the whole thing. Think, once you have a database schema structured 100% on top of a given metadata (that can be deployed by automated solutions, for instance), you can interact with that “database” with so many different ways… manually (uploading files manually) included.

Luckily for us, Microsoft has been working in different ways of automation and integration in order to have both individuals and applications interacting with the data. This is what we are going to explore from now on.

Power BI Data Producers

In order for us to manage our Common Data Model through a UX rather than via code, Power BI Dataflows are the immediate rescue. Among dozens of very useful features, it allows us to reference our CDM manifest file from ADLS into a dataflow, and from there, bringing it to life as something visually manageable.

The approach Power BI takes to manage data in CDM, is to leverage a concept called “data producer”. It is nothing but a service or application, such as Dynamics 365 or Power BI dataflows, that creates data in Common Data Model folders in Data Lake Storage Gen2. The data producer is responsible for creating the folder, the model.json file, and the associated data files. Because the data producer adds relevant metadata, each consumer can more easily leverage the data that’s produced (please, see Figure 7).

Figure 7. Files structure generated by Power BI’s data producer in CDM

Let’s get into the configuration of connecting the Data Lake Storage where our CDM is sitting on with Power BI (via dataflow) for us to interact with it from there.

Connecting CDM to Power BI dataflow

First, we have to create a new Workspace in Power BI. Workspaces are logical containers in the service that allow teams to collaborate within dataflows, apps, reports and more. As Dataflows exist only in workspaces, we have to create it first.

Microsoft does a very good job on documenting the process of creating a new workspace, so I’m not duplicating it here. You can find those instructions following this link. Once you do that, you should be seeing something pretty similar to what has been shown by Figure 8.

Figure 8. Workspace named “demo-aucdm” just created

Now, we’re are ready to start the configuration of the workspace. The very first thing to do is to enable the premium capacity feature for the workspace itself. This will enable us to manipulate data from within the workspace into our CDM.

To get there, from within the Power BI’s workspace newly created, give a click on “Settings” in the top menu of the screen. It will call out a modal on the right side of the screen. In that screen, give a click on “Premium” tab and turn on the premium capacity. Select if you want to go for either a small or large dataset and then, click “Save”. Your screen should look like the one presented by Figure 9.

Figure 9. Enabling premium capacity for the workspace

Next up, we need to connect the ADLS instance where our CDM files are sitting onto to the workspace. This connection will allow the Dataflow to read those files later on.

To do that, also under “Settings” modal we just saw, give a click on “Azure connections (preview)”. Once in there, please, follow the instructions listed in this tutorial. Once you’re done with it, you should be seeing a screen pretty similar to the one presented by Figure 10.

Figure 10. Azure Data Lake Storage connected to “demo-aucdm”

Great! At this point, we’ve gathered everything we need to set up a new Dataflow which will bring up our CDM into the scene. We’ll do it next!

Creating Power BI Dataflow for CDM

Power BI’s Dataflows supports a variety of methods to work with data, as you can see through the Figure 11.

Figure 11. Different possibilities for working with data in a given Dataflow

From bringing data structures from existing data sources (like SQL Databases, MySQL, Oracle, and the list goes on) to attaching Common Data Models, Dataflows provide a very flexible way for working with data from within Power BI.

As you might have noticed, there are two different options for working with Common Data Models: Manually mapping entities from a given manifest file (first option in Figure 11); Attaching a entire CDM’s structure directly from ADLS (last option in Figure 11).

We could go for both. Because I want to have the possibility of manually select what entities I want to manipulate (rather than loading up everything) in my Common Data Model, I’ll go for the first option.

Lets create a new Dataflow under the workspace we previously set up (in my case, it is named “demo-aucdm”). Once in the workspace’s panel, click on “New” and then, “Dataflow”, as showcased by Figure 12.

Figure 12. Picking the “Dataflow” option

As result, you will be able to see a screen with the options presented by Figure 11. As mentioned early on, we’ll go for the first option, i. e., “Define new entities”. Go ahead and click on “Add new entities”, as highlighted by Figure 13.

Figure 13. Add new entities

Next, among the source options presented, select “Azure Data Lake Storage Gen2” (as we have our CDM files sitting in there) as featured in the Figure 14.

Figure 14. Select Azure Data Lake Storage Gen2

Now, we have to configure the connection to the ADLS itself. Please, see the list presented below to know how to get your connection set up.

  • URL: Here you should bring your ADLS (not Blob Storage) URL. It will look like this https://{your-url}.dfs.core.windows.net/. At the end, you must append the path to the directory that holds CDM’s manifest file. My final URL look like this “https://{your-url}.dfs.core.windows.net/cdm/clickstream/“.
  • Data view: Select “CDM folder view”.
  • On-premisess data gateway: Leave “(none)” selected.
  • Authentication kind: Select “Organizational account”.
  • Then, click on “Sign in”, provide your corporative credentials, wait for the authentication to conclude and then, click “Next”. At the end, your configuration screen should look like Figure 15.
Figure 15. Final step of configuration

Next, Power BI is going to show an UI of an explorer. As you can see, the directory which holds the manifest file is interpreted as a database (in my case, “clickstream”), and once expanded, lists all the entities under that given database, the columns for each entity with its types, so on so forth, as shown by Figure 16.

Be sure to select the entities you want to interact with. In the case of the scenario for Americas University, I have “Comments” only.

Figure 16. CDM being read by Power BI’s Dataflow

Almost there! Before we start interacting with data, you must wrap up this process. Please, give a click on “Transform Data”, and then “Save & Close” at the bottom of the screen. Also, Power BI will ask you for a title for your dataflow through a modal window. Do it and click “Save”.

Done. Here is our CDM brought to life in terms of “data management” and ready for us to interact with. Next, let’s perform some operations on top of our CDM structure and see how it does reflect back into ADLS.

Exporting data from MySQL and Cosmos DB into an intermediate database

If you remember from the other two articles in the series, we have two data sources set up which are collecting comments both from Twitter (Cosmos DB database in Azure) and Americas University’s blog (Azure Databases for MySQL). So, as we are looking for data centralization in CDM, we’ve got to find a way to bring that data into “Comments” entity.

Because we’re focusing on Power BI’s dataflow for this interaction and the service doesn’t currently support natively connectors to both Cosmos DB and Azure Databases for MySQL in Azure, we’ve got to rely on an intermediate step before to import the data later on.

For this fictitious scenario, I’m going to create an intermediate database on top of Azure SQL Database as there is a built-in connector for SQL Databases in dataflow.

Important to mention that this process serves purpose of this demo only. In production, we could have an automated pipeline that would automatically drive copying data, and doing data transformations before hand.

I won’t go through the process of copying data from a given source to a destination environment due the fact this is already widely covered throughout Azure Data Factory (ADF)‘s documentation. A very informational tutorial teaching how to get there can be seen here.

Figure 17 presents the data now sitting into my “aucdmcomments” database. This data was moved from both Azure MySQL and Cosmos DB into Azure SQL via ADF through a “Copy” operation. Figure 18 presents the success of that operation.

Figure 17. Data copied into Azure SQL via Data Factory
Figure 18. Data successfully transferred

Inserting data into “Comments” entity via Dataflow

Now we’re ready to finally bring our data into CDM. Let’s get straight to this!

Head back to workspace’s main blade. In there, you should be able to see the dataflow you just created (in my case, I have nominated “aucdm-dataflow”), as illustrated by Figure 19.

Figure 19. Listing dataflow under workspace “demo-aucdm”

By clicking on the dataflow’s name, you’ll get into the list of entities available under that given dataflow, as presented by Figure 20.

Figure 20. Listing entities available in CDM

For us to bring data from external data sources into “Comments” entity in CDM, go ahead and click on the first icon among the four ones in front of entity’s name (right side), the one with a pen in front of a table. It will take you to a UI throughout which you’ll be able to manipulate data, as showcased by Figure 21.

Figure 21. Screen for data management provided by dataflow

Let’s start by bringing data from MySQL. To do that, from the entity’s management panel view, give a click on “Get data” on the upper-left menu, as highlighted in Figure 22.

Figure 22. Starting the process of getting data from external sources

This action will throw out a big floating modal showing dozens of different data connectors to various data sources. Click on “Azure” tab and then, pick up the option called “Azure SQL Database”, as highlighted by Figure 23.

Figure 23. Selecting Azure SQL Database as data source

Next, we’ve got to fill out the form with connectivity data to the Azure SQL database we’re looking at. At the end of that configuration, my screen looked like the one presented by Figure 24.

Figure 24. Connecting to intermediate Azure SQL database

Next, the wizard give me the opportunity of selecting the tables which I want to bring data from. In our case, there is only one table, the “comments_exported”, so I’ll go for it, and then, click on “Create”, at the bottom of the floating modal (please, refer to Figure 25).

Figure 25. Selecting the data to be moved into CDM’s entity

Done. The data is now imported to CDM.

This post wraps up this short series about Common Data Models and possible ways to implement it. There is a lot more that could be done with this, as transforming data via Azure Databricks, for instance, and generating a new actual database from the CDM’s schema.

Also, CDM can be leveraged as reference to Microsoft Dataverse in order to bring existing entities and data into the context of Power Apps and Dynamics 365.

Some references you might want to take a look follow below:

Hope it helps. Enjoy!


0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *