| Data Type | Best Extractor Method | Pitfall to Avoid | |------------------------|-------------------------------|------------------------------------------| | Tables (HTML, Excel) | Data Scraping / Selectors | Dynamic row IDs | | PDF Invoices | OCR + Regex / Anchor-based | Multi-page layouts | | Emails (body/attachments)| IMAP / Outlook extractors | Encoding mismatches | | Legacy App Screens | Screen Scraping (FullText) | Overlapping UI elements | | JSON / XML APIs | Deserialize JSON / XPath | Missing namespaces |
Using AI models (like UiPath's CV or ABBYY), the robot "sees" the UI similarly to a human. It identifies UI elements as "buttons," "text fields," or "tables" even within images or virtualized environments (Citrix). rpa extractor
As of 2025, the RPA extractor is undergoing a massive shift thanks to Large Language Models (LLMs) and GPT-style architectures. | Data Type | Best Extractor Method |
Traditional Extractor: "I will look for the word 'Total' and extract the number following it." Generative Extractor (LLM): "Here is a messy invoice. Please return a JSON object with the total. By the way, I understand that 'Sum Due,' 'Amount Payable,' and 'Balance' all mean 'Total.'" Using AI models (like UiPath's CV or ABBYY),
Platforms like UiPath Autopilot and Microsoft Copilot are integrating LLMs directly into the extraction process. This means your RPA extractor will no longer need to be "trained" on 500 sample documents. You can simply prompt it: "Extract the ship-to address and the PO number from this email chain."
The most basic and fasted method. The bot uses defined patterns (e.g., \d3-\d2-\d4 for US Social Security numbers) to find data.