Recently, I was invited by a customer (Americas University) to be part of a discussion around “how to build a content tracker solution” to address a specific need they have. A couple of weeks later, we came up with the solution which I’m going to guide you throughout this post.

Scenario

Over the past years, Americas University has been acquiring several other educational institutes, incorporating both students, physical units and digital content produced by those. In order to better manage the creation and update of digital pedagogical content for students to be consumed through their online learning platform (LMS), a “Content Management Department” (CMD) was created.

CMD’s main duty is to make sure students have the content they need to improve their learning process in place, updated, online, by demand, and at the same time, guarantee the right resources are being applied to overcome the real needs, meaning, duplications must be avoided, old content should be updated properly, so on so forth. To get there, they need to keep up an accurate view about existing content, which isn’t accurately happening at this point. Nowadays, Americas University counts on a single spreadsheet to track content, and it has been proving ineffective, as there is plenty of content distributed over static files (like PDF, PPTX, HTML, Videos, and such) spread all over the institutions not being properly tracked.

Question leveraged by customer was: “How to keep up the tracking of that content considering the diversity of content’s data sources and the growing database of institutions coming in?” This is where we came up with the bellow’s proposition/solution.

The proposed solution

The way it (the scenario) was presented, it seemed a typical situation suitable to Azure Cognitive Search (ACS) in combination with other Azure services as basically, CMD should be able to into different data sources (regardless it is a regular database or static files, videos, and such), as described below:

  • Azure SQL Database: One of the data sources CMD is targeting to use is the tracking spreadsheet already existing. Because this is all about a relational table that doesn’t have its structure changing frequently, we’re going to use SQL Database to hold off that data.
  • Azure Blob Storage: In addition, CMD is looking in to reach out static files already produced. That’s where Blob Storage comes to play. It is going to host static files which encloses useful content for online trainings, to be used as critical mass for search as well.
  • Video Indexer API (VI): Americas University is bringing its entire set of videos from 3rd-part platforms into Azure Video Indexer through Media Services. Because of it, we’ve automatically gained the ability to use VI’s APIs calls to search content within the insights produced by the indexer.
  • Azure App Service: We’re going to use Azure App Services to host the web application that will oversee the communication between final Americas University’s final users with Azure services.

The proposed architecture for this solution can be seen through the Figure 1 below.

Figure 1. Architecture defined to the proposed solution

Delivering the solution

Everything starts with the data sources configuration. In this case, the very first step was getting the dataset sitting on the spreadsheet mentioned early on adjusted to fit in a Azure SQL Database. Then we evolved and got the static files sitting on the blob storage. Finally, we got to have Video Indexer API configured and videos being processed properly within the account.

As a logical next step, we do configure the Azure Cognitive Search to pull data off into service’s indexers. We’re going to get there soon.

Finally, after developing the web solution which will serve as public interface between Azure services and Americas University’s final users, we get to publish it into Azure App Service to make it publicly available as expected.

Let’s dive into it?

Importing data into Azure SQL Database

As mentioned early on, CMD has provided a huge spreadsheet (around 105K lines) throughout which some sort of manual content tracking is already happening. They’re currently using this as a primary data source to kind of put the content files itself against some important metadata, like: pedagogical areas, tags, subjects, categories, so on so forth.

Unfortunately, CMD doesn’t allows me to show up data structure in here, but has a large number of columns that ties important pedagogical metrics to the actual content files. After some adjustments made in the data type level, it was fully imported into Azure SQL Database for later usage by Azure Cognitive Search.

The process I used to get data from the spreadsheet into Azure SQL Database is very well documented on Microsoft documentation, so I’m not duplicating it here. You can follow the step-by-step process described in this post to get there. Also, if you don’t know how to create and configure an SQL Database in Azure, this document will get you there quickly.

The Figure 2 shows up the data sitting into my Azure SQL Database instance.

Figure 2. Spreadsheet data sitting into an Azure SQL Database

Moving static files into Azure Storage

Nowadays, there are several ways to move files from a given source into Azure Storage Blobs. The right approach to be adopted will depend, basically, on three different variables (data size, network and frequency of data movement) tied to the moving scenario.

Let’s say you need to move a large amount (like TB or even PB) of data into Azure Storage but you don’t have a nice bandwidth available to make it happen. The approach you might need to select in this case would be an offline movement through Azure Data Box, for instance.

In the other hand, if you’re facing a scenario where you gotta move a small set of data and the bandwidth available is “just ok”, you might want to go for moving data over either AzCopy or Azure Storage Explorer.

For a complete guidance of the best-practices on how to pull data off to Azure Storage, I strongly recommend you the reading of the content available in this post. It goes after different scenarios and does make recommendations around the ideal approach for each one of them.

For the solution here proposed, because at this point it is all about a Proof-Of-Concept (POC), we just moved a small part of the static content pointed by Americas University to its storage. A good network bandwidth was available and it was a matter of moving just a couple of MB of data, so we picked Azure Storage Explorer as primary tool for that purpose.

The process you can go through towards to move data from on-prem to blob storage via Azure Storage Explorer is well described in this post, so I’m not describing it here either, to avoid duplicity of content.

The Figure 3 below presents a small portion of the static files sitting into Azure Storage.

Figure 3. Azure storage holding off some static files to serve the search

Configuring Video Indexer for videos processing

Video Indexer is an Azure service which delivers to companies an automated way to extract insights from videos by leveraging specialized artificial intelligence models. One of the great benefits of the tool is the ability to connect to Media Services, an Azure service design to process and stream videos by demand.

By having these two working together, companies can create an entire automated video processing flow, which could include the following routines: receive videos, process them out and automatically extract insights from those. Then, you can do whatever you want with those insights. In our case, it will be used to provide feedback on the content being tracked.

A couple weeks ago, I wrote a post (here at this website) which discuss how to create an entire automated pipeline for video processing. This post describes exactly what we’ve done on that regard for Americas University so, please go read it.

The Figure 4 does present a pretty small set of videos encoded and processed by the automated pipeline mentioned early on.

Figure 4. Videos processed by Video Indexer sitting into Azure

Azure Cognitive Search

Documentation describes Azure Cognitive Search as “the only cloud search service with built-in AI capabilities that enrich all types of information to easily identify and explore relevant content at scale. Formerly known as Azure Search, it uses the same integrated Microsoft natural language stack that Bing and Office have used for more than a decade, and AI services across vision, language, and speech”.

Basically, by leveraging the service, you can easily connect this enterprise-grade search engine to different data sources and incorporate the results of the searches to your applications, either by using SDKs to specific languages (currently there are SDKs available for C#, Python, Java, and Node.js) or consuming service’s REST APIs directly.

It is important to mention at this point the fact that Azure Cognitive Search is a search engine based on Apache Lucene, and also that, it does bring some structural concepts that will need to understand in the first place if you want to perform searches properly. I won’t get deep in these concepts as this is not the core goal of this post (hereby I’m focusing on the solution as a whole), but I strongly encourage you to do so by visiting the service’s documentation, available here.

When integrating the service, your code or an external tool invokes data ingestion (indexing) module to create and load an index. Optionally, you can add cognitive skills to apply AI processes during indexing. Doing so you can add new information and structures useful for search and other scenarios. Please, see the illustration presented for the Figure 5.

Figure 5. High-level usage flow for Azure Cognitive Search

Some additional key concepts you need to understand to follow along with Azure Cognitive Search:

  • Indexer: An indexer is a crawler that extracts searchable data and metadata from an external Azure data source and populates an index based on field-to-field mappings between the index and your data source. This approach is sometimes referred to as a ‘pull model’ because the service pulls data in without you having to write any code that adds data to an index. You can use an indexer as the sole means for data ingestion, or use a combination of techniques that include the use of an indexer for loading just some of the fields in your index.
  • Data Source: An indexer obtains data source connection from a data source object. The data source definition provides a connection string and possibly credentials.
  • Index: An index is the primary means of organizing and searching documents in Azure Cognitive Search, similar to how a table organizes records in a database. Each index has a collection of documents that all conform to the index schema (field names, data types, and attributes), but indexes also specify additional constructs (suggesters, scoring profiles, and CORS configuration) that define other search behaviors.

That’s all we need for now. Let get into the search’s configuration before writing some code. For didactic purposes, I’m grounding up the config operations on Azure portal, but it is never too much remember you could go after it both through REST APIs, Powershell or Azure CLI.

Creating a new Azure Cognitive Search service instance

First things first, right? And the very first thing for us to do towards to go search, is to create a new instance of the service in Azure. This “creation process” is very simple and is well documented in here. Reason by which I’m not duplicating it here.

After going through the process described on above’s paragraph, I was able to see my new ACS instance up and running, as displayed by Figure 6.

Figure 6. A new ACS instance up and running

Ingesting data into ACS from Azure SQL Database

Now we need to start pulling data off the data sources we have previously configured. We’re going to start by ingesting data from our SQL Database, shall we?

On the Azure portal, within the ACS’s overview blade, give a click on “Import data”, top-mid placed. This option is highlighted on the Figure 6 presented above. This action is going to head you to the screen which asks you to select the data source through a dropdown . You must select “Azure SQL Database”. By doing so, you should now be seeing the screen presented by Figure 7. Please, note that I have already provided both connection string and credentials to my database to the service. Also, I have successfully tested my connection and selected the table who holds the data I’m bringing in.

Figure 7. Connecting our existing SQL Database to ACS

After clicking on “Next”, I deliberately ignored the “Cognitive Skills” tab as I’m going to use the data alone as critical mass to my searches. I did it by clicking “Next” once again. It took me to the next tab, which is “Customize target index”.

Here (please refer to Figure 8) is where we do some sort of configuration to the index being automatically being created under-the-hood. Basically, what I’m doing here is kind of saying to the search engine “which field(s)” of “what type” it should be considering on the search and “which ones” are either “Retrievable, Filterable, Sortable, Facetable and/or Searchable”. Because every single column is critical to the customer in terms of search, I have to guarantee to Americas University’s applications, the ability of retrieving, filtering and sorting to every column. Ah, don’t forget to point out a column to act as primary key in the index.

Figure 8. Adjusting data types to serve the search

Now, if everything went well, after a couple of minutes (the wait here really depends on the size of the dataset you’re bringing in) you should be able to see a new index created under the tab “Indexes” under ACS’s overview blade. The Figure 9 shows up how that screen should look like.

Figure 9. Report of data ingestion into the new index (azuresql-index)

Now we’re ready to perform some searches on top of this data. I’ll do something really basic search here as my goal is only make sure everything is working properly.

To do this, I’ll head back to the Overview’s blade. In there, I’ll give a click on “Search Explorer”, top-mid placed on the screen. By doing so, it shows me the screen presented by Figure 10. Please, note that I have a simple search already performed, which refers to “to be verb”. The results, as you can see as well, is being appended into the big text area under the search’s form.

Figure 10. Preforming an initial search

Ingesting data into ACS from Blog Storage

Next step would be ingesting data from the second data source we put together, shall we? I’m referring to the Blob Storage one, where a really small portion of CMD’s files are sitting on. The same way we did before, first, we set up the data source itself. Then, we create a new indexer who consumes data from the data source and finally, after indexer creation, we test the ingestion result by doing some basic search.

The steps you gotta take towards to get the configuration done are exactly the same we went through in above’s section. The Figures 11 and 12 will show final configuration for both data source and index.

Figure 11. Blob storage’s data source configuration
Figure 12. Index properly configured to read static file data

Cool. Now that we have everything settled in terms of configuration, it is time to move forward and try some search on top of the documents we just got crackled and ingested. The Figure 13 is going to show the results for a pretty simple query search for “artificial intelligence”. It does attest everything is working properly on that regard.

Figure 13. Results returned for the query search “artificial intelligence”

Interfacing via web application with Azure Cognitive Search

If you’re a developer, we finally landed at the cool part of this. From now on, we’re going to write some code to consume search information from our ACS. This would be what Americas University’s final users should be seeing and how would they interact with our backend.

The application is fully operational and available on GitHub through this repository, so I won’t touch every single detail of the app. You can so it by yourself later on. Rather, I’ll be showing off only the key parts of integration with ACS APIs. Feel free to collaborate, make it better and use to your presentations. Important to mention though, that this code wasn’t crested to serve production environments. Its only goal here is to showcase the utilization of programming languages (in this case, C#) to pull off data from ACS indexes.

The application hereby is pretty simple. It is comprised of ASP.NET Core MVC (3.1) on the backend leveraging ASP.NET Razor module on the frontend. It does utilize both Bootstrap and JQuery to support the resources we’re putting together.

Also, we are taking advantage of Azure Cognitive Search SDK for C#. We could directly call out the ACS APIs to get the work done, but indeed the SDK does encapsulate the entire communication for us so, because it is pretty handy, I’m picking it. Please, refer to the Figure 14 to see application’s frontend.

Figure 14. Application who communicates with ACS’s service

As you can see through the Figure 14, we have two different search options: “Search 1: Metadata-based” and “Search 2: Video indexer-based“. While the first option takes the communication with ACS APIs to perform the search through the SDK, the second one leverages the built-in search engine available within Video Indexer service to get the work done. By doing so, we are bringing together (I mean, in one single place) two different search experiences. The two different approaches can be seen throughout the the Figure 15.

Figure 15. General navigation flow for content searcher

Search 1: Metadata-based

By “metadata-based” I mean search the content indexed by Azure Cognitive Search through the ingestion process we saw early on in this post. It is important to make that distinction as we have a second search model, based on Video Indexer directly (we’ll get there soon) also to explore over the upcoming section. The web app’s search flow for this model can be seen through the green part of the Figure 15 at above’s section.

The Code Snippet 1 below, shows up the piece of code who picks up the search query inputted by the final user, builds up the the ACS API call (bringing along all the needed parameters) and pushes it to the actual API. The searching process then takes place and when it is done, returns a list of documents resulted from that process.

Code Snippet 1. Asynchronously calling the ACS API through the SDK

Important considerations about the above’s piece of code:

  • First, I call the method “InitSearch”, which initialize internal variables, pulls data off the appsettings.json file and do some additional work on our behalf. You can see exactly what this method does examining the GitHub repository.
  • Then, we do define some search’s parameters. These parameters will act as filter for the search. The example highlights fictional 3 parameters but you can take either as many as you want or none.
  • Then, we call the ACS’s API to perform the actual search and gather the results under a DocumentSearchResult list, previously defined in our Models directory (look for SearchData class over there).
  • Finally, if everything went well with our HTTP request, we return the fulfilled model to the caller view towards to get it plotted over there to the final user.

The final result can be seen through Figure 16.

Figure 16. Web App searching for a specific term via UI

Search 2: Video indexer-based

As mentioned before in this post, when it comes to videos search, in our case, the search solution is not going to directly rely on ACS as engine search. Rather, we’re going to communicate with Azure Video Indexer’s search API towards to navigate over the insights extracted from the pool of videos already sitting into the service.

Important to mention that we took above’s approach to take advantage of the process of moving videos into Azure already started by Americas University in a different project. Azure Cognitive Search also provides the ability of looking into standalone videos stored within a given Blob Storage though, the same way it deals with another static files. So, if you don’t have a Video Indexer strategy in place, you can go for ACS as search solution for this purpose as well.

Disclaimers made, let’s get into the solution itself. A good start would be looking back at the navigation flow (this time focusing the red part) presented by Figure 15.

The Code Snippet 2 below shows up the action method whereby the to Video Indexer API has been made. The same way we did with ACS APIs, the method receives the query search, builds up the request by bringing the parameters needed and then, pushes it against the Search API with Video Indexer service.

Code Snippet 2 Asynchronously calling the VI API Search

Brief explanation about what is happening here:

  • First, I call a method I’ve got (GetVideoIndexerAccessToken) which goes all the way up to Video Indexer API and get a valid access token. You can see that method’s definition on project’s repository.
  • Next, I have to gather two mandatory information: API’s URI and the dynamic value for ” Ocp-Apim-Subscription-Key” property. You can grab it by navigating to API’s portal under “Keys” section. As you can see, I’m reading both values from appsetings.json.
  • Then, I start building the request itself. As first action, I add the two requested parameters into my call’s header. Then, I build the query string with the parameters I want.
  • Finally, I perform the call to the service. When the response comes back, I return it to a typed view waiting for that info.

The final result can be seen through the Figure 17.

Figure 17. Returning data from VI API

Wrapping up

The goal of this post was to showcase how we can create a rich search experience by leveraging built-in Azure services, SDK and simple API calls. By making a combination of Azure Cognitive Search and Video Indexer APIs, we were able to build something useful to Americas University’s Content Management Department, adding agility with low cost.

Hopefully, this project is going to give you an idea on how to get started on integrating search in Azure with your existing applications.


0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *