What Parsli Extracts from an Email — Field Schema Reference
A field-by-field reference for the data Parsli surfaces from incoming emails. Below: the default JSON output, confidence-score baselines per field type, and the custom fields most teams add on top of the defaults.
What You Can Extract
Define your schema with any combination of these fields — or add your own custom fields.
sender_name / sender_email
Display name and RFC 5322 address from the From: header. Deterministic — pulled from the envelope, not inferred.
received_at
ISO 8601 UTC timestamp from the Received: header chain. Falls back to Date: header if Received: is missing.
subject
Decoded subject line, MIME-encoded values normalized to UTF-8.
body_intent
LLM-inferred classification: invoice, order, lead, notification, rate_quote, support_reply, or custom enum from your schema.
body_amounts[]
Array of monetary values labeled with their adjacent text — total, tax, fuel surcharge, etc. Currency normalized to ISO 4217.
body_dates[]
Array of dates with role labels — pickup, delivery, due_date, expiration. Output as ISO 8601 even when source is informal.
signature.name / signature.title / signature.phone
Sender signature parsed into structured fields. Includes phone normalization to E.164 when country is detectable.
attachments[]
Per-attachment object with filename, MIME type, page count, and the full extracted-fields JSON from running the attachment through Parsli's parser.
thread_id / reply_chain
RFC 2822 Message-ID + an ordered array of prior messages in the thread for deduplication and context.
raw_html / raw_plain
Both rendered-text bodies (the AI parses from `raw_plain` by default; HTML tags are stripped before extraction).
Heads up: This page covers email content extraction — pulling structured fields out of message bodies and attachments. If you arrived looking for an email scraper (collecting email addresses from web pages or files), see the patterns section below for cross-references; address harvesting is a different operation with different output.
The default email extraction schema
Every email Parsli ingests goes through the same extraction pipeline and returns a consistent JSON envelope. Below is the full default schema — every field listed is present in the output unless its source is missing from the source email. Custom fields you add via the schema builder are appended under `custom`.
| Field | Type | Example | Confidence baseline |
|---|---|---|---|
| sender.name | string | "Sarah Brennan" | 0.95+ (header) |
| sender.email | "ops@acme-logistics.com" | 0.99 (deterministic) | |
| received_at | datetime (ISO 8601) | "2026-04-23T14:32:00Z" | 0.99 (header) |
| subject | string | "Rate confirmation #RC-7821" | 0.98 (header) |
| body.intent | enum | "rate_confirmation" | 0.85–0.95 (LLM) |
| body.amounts[] | array<money> | [{ value: 1240.00, currency: "USD", label: "Linehaul" }] | 0.85–0.95 |
| body.dates[] | array<date> | [{ value: "2026-04-25", label: "Pickup" }] | 0.90–0.97 |
| signature.name | string | "Sarah Brennan" | 0.75–0.92 |
| signature.phone | phone (E.164) | "+13125550143" | 0.80–0.95 |
| attachments[].filename | string | "invoice_7821.pdf" | 0.99 (header) |
| attachments[].extracted_fields | object | { shipper: "Acme", weight_lbs: 18420 } | Inherits attachment doctype baseline |
| thread.thread_id | string | "<rc7821@acme-logistics.com>" | 0.99 (RFC 2822 Message-ID) |
| custom.<field> | user-defined | See custom field examples below | Field-dependent |
JSON output
Sample output for a rate-confirmation email with one PDF attachment. Every Parsli email extraction returns an envelope of this shape.
{
"doc_id": "doc_a8f29d4e",
"received_at": "2026-04-23T14:32:00Z",
"sender": {
"name": "Sarah Brennan",
"email": "ops@acme-logistics.com"
},
"subject": "Rate confirmation #RC-7821",
"body": {
"intent": "rate_confirmation",
"intent_confidence": 0.94,
"amounts": [
{ "label": "Linehaul rate", "value": 1240.00, "currency": "USD", "confidence": 0.97 },
{ "label": "Fuel surcharge", "value": 186.00, "currency": "USD", "confidence": 0.91 }
],
"dates": [
{ "label": "Pickup", "value": "2026-04-25", "confidence": 0.98 },
{ "label": "Delivery", "value": "2026-04-27", "confidence": 0.96 }
]
},
"signature": {
"name": "Sarah Brennan",
"title": "Dispatch Coordinator",
"phone": "+13125550143",
"email": "ops@acme-logistics.com"
},
"attachments": [
{
"filename": "BOL_7821.pdf",
"mime": "application/pdf",
"page_count": 2,
"extraction_mode": "digital",
"extracted_fields": {
"shipper": "Acme Logistics",
"consignee": "Midwest Distribution",
"weight_lbs": 18420,
"pro_number": "ACL-99182734"
}
}
],
"thread": {
"thread_id": "<rc7821@acme-logistics.com>",
"reply_count": 0
}
}How confidence scoring works
Every field Parsli returns ships with a confidence score between 0 and 1. The score reflects how certain the extraction pipeline is that the value is correct — not how 'good' the email is. Use the score to gate downstream automation: route fields below your threshold to a human-review queue or fail the parse and ask the sender to resend.
Header fields (sender, subject, received_at)
Always 0.95+ — these come from RFC 5322 / 2822 headers and are deterministic.
<0.95 only when headers are malformed or missing (rare; typically forwarded-by-an-MTA cases).
Body amounts and dates
0.92+ when the value is adjacent to a clear label (e.g. 'Total: $1,240.00') and uses standard formatting.
<0.80 when labels are ambiguous, currency is not specified, or the email contains multiple candidate values without disambiguating context.
body.intent classification
0.90+ when the email matches one of the trained intent classes (invoice, order, lead, rate_confirmation, etc.).
<0.75 when the email is mixed-purpose or in a domain the classifier hasn't seen — at this confidence the field is included but flagged.
Signature parsing
0.85+ when the sign-off is conventional (name, title, contact block).
<0.70 for emails with no signature, signatures embedded in inline images, or names that collide with body content.
Attachment field extraction
Inherits the doctype baseline. Digital PDFs typically score 0.95+; image-only PDFs typically 0.85–0.95 depending on scan quality.
<0.80 for low-resolution scans, faxed BOLs with thermal-print fade, or unusual layouts.
Adding custom fields to the email schema
Most teams add 3–10 custom fields on top of the defaults. Custom fields appear under `custom.<your_field_name>` in the JSON output and get the same confidence treatment as built-in fields. Custom fields are defined in the dashboard schema builder (no code) or by sending a JSON Schema to the parser config endpoint.
order_numberstring (regex: `[A-Z]{2,}-\d{4,8}`)Pull a structured order ID out of the body or subject line. The regex acts as a hint, not a hard filter — the LLM still finds the right value if formatting drifts.
po_referencestringLocate a purchase-order number anywhere in the body or any attachment. Useful for AP workflows that match invoices to POs.
shipping_carrierenum (UPS, FedEx, USPS, DHL, Other)Constrain the LLM to a known set of carriers; useful when downstream systems require a strict enum and not a free-text guess.
ticket_idstring with cross-referenceExtract a ticket ID and validate it against an existing row in your CRM or Helpdesk via webhook before ingestion. If the validation fails, the parse is held in review.
urgencyenum (low, normal, high) — LLM-inferredClassify body sentiment to triage incoming requests. Combine with intent classification to route urgent leads or VIP-customer support directly to a Slack channel.
Parser modes for email processing
The same field schema runs in four operating modes. Most accounts mix two or three depending on the workload — push for inbox automation, batch for backlogs, sandbox for tuning. Pricing is identical across modes; you pay per page extracted regardless of how the message arrived.
Real-time push processing
Gmail push notifications and SMTP delivery fire the parser within ~30 seconds of a message landing. The default mode for inbox-management workloads where freshness matters.
Scheduled batch processing
Cron-driven sweeps over a Gmail label, an IMAP folder, or a backlog of .eml files. Useful for week-end reconciliation or one-time migrations from a legacy email parser.
API-only processing
POST a raw email or .eml file to the REST endpoint, get the JSON envelope back. No inbox connection required — used by automated email-processing pipelines that already have message bytes in hand.
Free-tier sandbox processing
30 pages per month, no credit card, every feature on. Used by teams comparing AI extraction to a pattern-matching email parser before committing — same JSON output as paid tiers, identical confidence scoring.
Other email-extraction patterns
Adjacent intents users sometimes land on this page looking for. Each row points to the right page or guide for the specific job — they're listed here for disambiguation, not because they're handled by the email-content parser above.
Extract email addresses from PDFs
Different operation — text-mining a PDF for any email-shaped string. Use the PDF parser with a custom string field set to an email regex.
See the right pageExtract email addresses from Excel / CSV
Pull a column or scan free-text cells for email addresses. The spreadsheet parser handles this; treat the email column as a typed string field.
See the right pageEmail address extraction from websites (scraping)
Out of scope for Parsli — this is web-scraping intent, not email parsing. We recommend a dedicated scraper for that workflow; we ship downstream of it once the addresses are in your CRM.
Extract emails from Gmail
If you mean 'extract data out of emails in my Gmail account', that's exactly the workflow above — connect Gmail OAuth, define a schema, run the parser.
See the right pagePower Automate / Outlook 365 text extraction
Power Automate's built-in 'extract text from email' action covers basic regex patterns. Parsli runs as the AI step inside a Power Automate flow when patterns aren't enough — connect via webhook.
See the right pageSupported Formats
- Gmail (native OAuth, read-only scope)
- Microsoft 365 / Outlook (forwarding + webhook)
- Forwarded emails to your unique parsli.co inbox address
- MIME multipart/alternative, multipart/mixed, multipart/related
- Attachments: PDF, PNG, JPG, TIFF, .docx, .xlsx, .csv
Free Tools for Emails
Try these free browser-based tools. No sign-up required.
Frequently Asked Questions
Does Parsli parse the HTML body or the plain-text body?
By default Parsli parses `raw_plain` — the rendered plain-text version of the email. For multipart/alternative messages where the plain-text part is missing or malformed, the HTML is stripped of tags and parsed as text. Both `raw_html` and `raw_plain` are returned in the JSON output so you can audit which version drove the extraction.
How does Parsli handle email threads?
By default the parser extracts from the latest message in the thread and includes the prior messages as a `reply_chain` array for context. You can switch the schema to extract from the full thread (useful for support emails where the relevant data is buried in the customer's first message) — set `thread_mode: 'full'` on the parser config.
What happens with quoted/forwarded portions of the body?
Quoted lines (lines beginning with `>`) and `--- Forwarded message ---` blocks are detected and excluded from the primary extraction unless the schema explicitly requests them. Forwarded attachments are still processed in full.
Are email signatures parsed into structured fields?
Yes. The signature block is detected via positional heuristics + the LLM, then parsed into `signature.name`, `signature.title`, `signature.email`, `signature.phone`, and `signature.company`. Phone numbers are normalized to E.164 when a country code is present or inferable from `sender_email`.
What encodings and character sets are supported?
MIME-decoded headers and bodies are normalized to UTF-8 before extraction. Source encodings supported: UTF-8, ISO-8859-1 through -16, Windows-1250–1258, Shift_JIS, GB18030, EUC-KR. Right-to-left scripts (Arabic, Hebrew) are preserved through extraction and JSON output.
How are attachments that are image scans (not digital PDFs) handled?
Image-only PDFs and direct image attachments are routed through the OCR pipeline first, then the OCR output is fed into the same field-extraction step as a digital PDF. The `attachments[].extraction_mode` field in the output records which path was used (`digital`, `ocr`, or `mixed` for hybrid PDFs).