Background
Enterprises use intelligent document analysis (IDA) to examine documents (such as policies, contracts, and legal agreements) for specific terms, and then identify those documents that may pose a risk to the business. IDA can also identify a particular document (such as legal, finance, or marketing) so that it can be categorized and routed to an appropriate department.
Paper-based documents still account for 46% of all records, which represents substantial costs to public sector organizations. An average government agency receives and manually routes approximately 3.5 million documents annually.† Manual routing takes seven to ten minutes per document to read the letter or document before routing it. This manual process is time-consuming and costly.
The majority of documents managed by intelligent document processing (IDP) solutions are structured or semi-structured, leaving a significant portion of unstructured documents unmanaged. AI can make automated processing and categorizing of documents—structured, semi-structured, and unstructured—more cost-effective.
Solution
Term frequency-inverse document frequency (TF-IDF) was used to measure and quantify the importance or relevance of string representations in the documents. A support vector classification (SVC) model was trained to categorize the documents. The publicly available dataset‡ used in the training contained about 200K topic-related documents obtained from HuffPost*. Dataset text was cleaned using stop word removal, stemming, and tokenization. The supervised training model classifies the document based on the headline into 42 predetermined categories, such as entertainment or politics.
The data ingest and text processing was optimized using Intel® Distribution of Modin* and processed 46% faster than stock Modin. Training and inferencing of the SVC model were optimized using Intel® Extension for Scikit-learn*. The optimizations improved training time by 96% and inferencing time by 60%. Reviewing and sorting the documents had an accuracy of 65%. Intel Distribution of Modin and Intel Extension for Scikit-learn are part of Intel’s end-to-end AI software portfolio of tools and framework optimizations that are powered by oneAPI.
Technology
Optimized with Intel oneAPI for Better Performance
- Data processing with TF-IDF and Intel Distribution of Modin
- SVC with Intel Extension for Scikit-learn
- Amazon EC2* M6i with 3rd generation Intel® Xeon® Scalable processor
Benefits
Data scientists can build a better IDP solution to address the semi-structured and unstructured documents. The time saved in training and inference allows data scientists to put more AI models into production.
Government organizations can automate the processing and categorization of more incoming semi-structured and unstructured documents and realize cost savings.
Benefits include:
- Less time needed to build the machine learning pipeline with an instruction set from data ingest to model development to deployment
- Compute savings from faster data preprocessing, model training, and inferencing time using oneAPI optimizations from Intel
- Optimized performance using your compute of choice (such as CPU, GPU, or FPGA) with oneAPI interoperability across hardware architectures
References
† IDC Survey Spotlight: What Types of Documents Are Organizations Managing with Intelligent Document Processing (IDP) Solutions, April 2021 (Available by paid subscription only.)
‡ News Category Dataset, Kaggle, Inc. Licensed under Creative Commons 1.0 Universal (CC0 1.0) Public Domain Dedication