Amazon adds new “native” scanning capability to AWS Comprehend

Get a free Techzine subscription!

The platform can now scan documents in native MS Word and Adobe PDF formats.

Amazon Web Services announced this week that they have added new features to their Amazon Comprehend service. Specifically, Comprehend now boasts the ability to extract custom details from documents in their native format.

The new functionality is able to extract such things as personally identifiable information, (PII). It can also do entity extraction, document classification and sentiment analysis. AWS said the new features will help users find insights within unconstructed documents such as email, dense paragraphs of text, or social media feeds.

Anant Patel, the Product Lead for Comprehend, and Andrea Morton-Youmans, an AWS Product Marketing Manager, introduced the features in a blog post. “Starting today, you can use custom entity recognition in Amazon Comprehend on more document types without the need to convert files to plain text,” they wrote.

Extracting “entities” from dense text, bullet lists and more

Amazon Comprehend can now process varying document layouts such as dense text and lists or bullets in PDF and Word while extracting entities (specific words) from documents. Historically, users could only use Amazon Comprehend on plain text documents, which required them to flatten the documents into machine-readable text.

With these new features, users can now use natural language processing (NLP) to extract custom entities from PDF, Word, and plain text documents using the same API. This means that there is less document preprocessing required.

This feature can help with document processing workflows in business verticals such as insurance, mortgage, finance, and more. With this new feature, users can now employ machine learning to extract custom entities using a single model and API call.

“For example, you can process automotive or health insurance claims and extract entities such as claim amount, co-pay amount, or primary and dependent names,” they write. “You can also apply this solution to mortgages to extract an applicant name, co-signer, down payment amount, or other financial documents.”

Financial services can process documents such as SEC filings and extract specific entities such as proxy proposals, earnings reports, or board of director names, they added.