Enriching AI and NLP in Azure Cognitive Search

July 14, 2021

Large, unstructured datasets like the JFK Files, which contains over 34,000 pages of documents about the CIA investigation of the 1963 JFK assassination, include typewritten and handwritten notes, photos and diagrams, and other unstructured data that standard search solutions can’t parse.

AI enrichment in Azure Cognitive Search can extract and enhance searchable, indexable text from images, blobs, and other unstructured data sources like the JFK Files by using pre-trained machine learning skillsets from the Cognitive Services Computer Vision and Text Analytics APIs. You can also create and attach custom skills to add special processing for domain-specific data like CIA Cryptonyms. Azure Cognitive Search can then index and search the context.

The Azure Cognitive Search skills in this example solution fall into the following categories:

Image-processing built-in skills like optical character recognition (OCR), print extraction, and image analysis include object and face detection, tag and caption generation, and celebrity and landmark identification. These skills create text representations of image content, which are searchable using the query capabilities of Azure Cognitive Search. Document cracking is the process of extracting or creating text content from non-text sources.
Natural language processing built-in skills like entity recognition, language detection, key phrase extraction, and text recognition map unstructured text to searchable and filterable fields in an index.
Custom skills that capture domain-specific data. These skills are build with the custom skills interface.

This example solution uses Azure Cognitive Search AI enrichment to extract meaning from the original complex, unstructured JFK Files dataset. You can work through the project, watch the process in action in an online video, or explore the JFK Files with an online demo.

Potential use cases

Increase the value and utility of unstructured text and image content in search and data science apps.
Use custom skills to integrate open-source, third-party, or first-party code into indexing pipelines.
Make scanned JPG, PNG, or bitmap documents full-text searchable.
Produce better outcomes than standard PDF text extraction for PDFs with combined image and text.
Create new information from inherently meaningful raw content or context that’s hidden in larger unstructured or semi-structured documents.

Architecture converting unstructured data to structured data

This diagram illustrates the process of passing unstructured data through the Cognitive Search skills pipeline to produce structured, indexable data.

Blob storage provides unstructured document and image data to Cognitive Search.
Cognitive Search applies pre-built cognitive skillsets to the data, including OCR, text and handwriting recognition, image analysis, entity recognition, and full-text search.
The Cognitive Search extensibility mechanism uses an Azure Function to apply the CIA Cryptonyms custom skill to the data.
The pre-built and custom skillsets deliver structured knowledge that Azure Cognitive Search can index.

Components

Azure Cognitive Search works with other Azure components to provide this solution.

Azure Blob Storage

Azure Blob Storage is REST-based object storage for data that you can access from anywhere in the world via HTTPS. You can use Blob storage to expose data publicly to the world, or to store application data privately. Blob storage is ideal for large amounts of unstructured data like text or graphics.

Azure Cognitive Search

Cognitive Search indexes the content and powers the user experience. You use Cognitive Search capabilities to apply pre-built cognitive skills to the content, and use the extensibility mechanism to add custom skills.

The Computer Vision API uses text recognition APIs to extract and recognize text information from images. Read uses the latest recognition models, and is optimized for large, text-heavy documents and noisy images. OCR isn’t optimized for large documents, but supports more languages. The current example solution uses OCR to produce data in the hOCR format.
The Text Analytics API extracts text information from unstructured documents by using capabilities like Named Entity Recognition (NER), key phrase extraction, and full-text search.
Custom skills extend Cognitive Search to apply specific enrichment transformations to content. The current example solution creates a custom skill to apply CIA Cryptonyms, which decode uppercase code names in CIA documents. For example, the CIA assigned the cryptonym GPFLOOR to Lee Harvey Oswald, so the custom CIA Cryptonym skill links any JFK files containing that cryptonym with Oswald.

Azure Functions

Azure Functions is a serverless compute service that lets you run small pieces of event-triggered code without having to explicitly provision or manage infrastructure. This example solution uses an Azure Function method to apply the CIA Cryptonyms list to the JFK Files as a custom skill.

Azure App Service

This example solution also builds a standalone web app in Azure App Service for testing, demonstrating, searching the index, and exploring connections in the enriched and indexed documents.

Considerations

The code project and demo showcase a particular Cognitive Search use case. This example solution isn’t intended to be a framework or scalable architecture for all scenarios, but to provide a general guideline and example.
OCR results vary greatly depending on scan and image quality. The Computer Vision Read API uses the latest recognition models, but has less language support than OCR.
Some scanned and native PDF formats may not parse correctly in Cognitive Search.
The JFK Files sample project and demo create a public website and publicly readable storage container for extracted images, so don’t use this solution with non-public data.

This article has been published from the source link without modifications to the text. Only the headline has been changed.

Sarthi A

The Blockgeni editorial team covers the latest in artificial intelligence, blockchain, machine learning and data engineering. Our writers track industry news, research and emerging technologies to keep tech professionals informed.

Enriching AI and NLP in Azure Cognitive Search

Potential use cases

Architecture converting unstructured data to structured data

Components

Azure Blob Storage

Azure Cognitive Search

Azure Functions

Azure App Service

Considerations

Related

Most Popular

Meta AI Restructuring: 7,000 Staff Shifted as 10% Face Cuts

DIY Project: Build Your First TinyML Smart Device

TinyML: Running Machine Learning on Small Devices

Crypto Legislation: Act Now or Wait Until 2029

Silicon Valley’s AI Divide: Tech Workers Stuck in a Brutal Job Market

Standard Chartered Cuts 7,800 Jobs, Cites AI Replacement

Follow Us

POPULAR POSTS

Nvidia’s China H200 Crisis: What a $30B Blockade Means

Terrorism victims now claim the $344 million in Iranian cryptocurrency that the Treasury seized

AI Backlash: When Tech Anxiety Turns Into Unrest

Meta AI Restructuring: 7,000 Staff Shifted as 10% Face Cuts

POPULAR CATEGORY

Meta AI Restructuring: 7,000 Staff Shifted as 10% Face Cuts

Enriching AI and NLP in Azure Cognitive Search

Potential use cases

Architecture converting unstructured data to structured data

Components

Azure Blob Storage

Azure Cognitive Search

Azure Functions

Azure App Service

Considerations

Related

RELATED ARTICLES

Most Popular

Follow Us

POPULAR POSTS

POPULAR CATEGORY