Document Type

What Parsli Extracts from an Email — Field Schema Reference

A field-by-field reference for the data Parsli surfaces from incoming emails. Below: the default JSON output, confidence-score baselines per field type, and the custom fields most teams add on top of the defaults.

What You Can Extract

Define your schema with any combination of these fields — or add your own custom fields.

sender_name / sender_email

Display name and RFC 5322 address from the From: header. Deterministic — pulled from the envelope, not inferred.

received_at

ISO 8601 UTC timestamp from the Received: header chain. Falls back to Date: header if Received: is missing.

subject

Decoded subject line, MIME-encoded values normalized to UTF-8.

body_intent

LLM-inferred classification: invoice, order, lead, notification, rate_quote, support_reply, or custom enum from your schema.

body_amounts[]

Array of monetary values labeled with their adjacent text — total, tax, fuel surcharge, etc. Currency normalized to ISO 4217.

body_dates[]

Array of dates with role labels — pickup, delivery, due_date, expiration. Output as ISO 8601 even when source is informal.

signature.name / signature.title / signature.phone

Sender signature parsed into structured fields. Includes phone normalization to E.164 when country is detectable.

attachments[]

Per-attachment object with filename, MIME type, page count, and the full extracted-fields JSON from running the attachment through Parsli's parser.

thread_id / reply_chain

RFC 2822 Message-ID + an ordered array of prior messages in the thread for deduplication and context.

raw_html / raw_plain

Both rendered-text bodies (the AI parses from `raw_plain` by default; HTML tags are stripped before extraction).

Heads up: This page covers email content extraction — pulling structured fields out of message bodies and attachments. If you arrived looking for an email scraper (collecting email addresses from web pages or files), see the patterns section below for cross-references; address harvesting is a different operation with different output.

The default email extraction schema

Every email Parsli ingests goes through the same extraction pipeline and returns a consistent JSON envelope. Below is the full default schema — every field listed is present in the output unless its source is missing from the source email. Custom fields you add via the schema builder are appended under `custom`.

FieldTypeExampleConfidence baseline
sender.namestring"Sarah Brennan"0.95+ (header)
sender.emailemail"ops@acme-logistics.com"0.99 (deterministic)
received_atdatetime (ISO 8601)"2026-04-23T14:32:00Z"0.99 (header)
subjectstring"Rate confirmation #RC-7821"0.98 (header)
body.intentenum"rate_confirmation"0.85–0.95 (LLM)
body.amounts[]array<money>[{ value: 1240.00, currency: "USD", label: "Linehaul" }]0.85–0.95
body.dates[]array<date>[{ value: "2026-04-25", label: "Pickup" }]0.90–0.97
signature.namestring"Sarah Brennan"0.75–0.92
signature.phonephone (E.164)"+13125550143"0.80–0.95
attachments[].filenamestring"invoice_7821.pdf"0.99 (header)
attachments[].extracted_fieldsobject{ shipper: "Acme", weight_lbs: 18420 }Inherits attachment doctype baseline
thread.thread_idstring"<rc7821@acme-logistics.com>"0.99 (RFC 2822 Message-ID)
custom.<field>user-definedSee custom field examples belowField-dependent

JSON output

Sample output for a rate-confirmation email with one PDF attachment. Every Parsli email extraction returns an envelope of this shape.

{
  "doc_id": "doc_a8f29d4e",
  "received_at": "2026-04-23T14:32:00Z",
  "sender": {
    "name": "Sarah Brennan",
    "email": "ops@acme-logistics.com"
  },
  "subject": "Rate confirmation #RC-7821",
  "body": {
    "intent": "rate_confirmation",
    "intent_confidence": 0.94,
    "amounts": [
      { "label": "Linehaul rate", "value": 1240.00, "currency": "USD", "confidence": 0.97 },
      { "label": "Fuel surcharge", "value":  186.00, "currency": "USD", "confidence": 0.91 }
    ],
    "dates": [
      { "label": "Pickup",   "value": "2026-04-25", "confidence": 0.98 },
      { "label": "Delivery", "value": "2026-04-27", "confidence": 0.96 }
    ]
  },
  "signature": {
    "name":  "Sarah Brennan",
    "title": "Dispatch Coordinator",
    "phone": "+13125550143",
    "email": "ops@acme-logistics.com"
  },
  "attachments": [
    {
      "filename": "BOL_7821.pdf",
      "mime": "application/pdf",
      "page_count": 2,
      "extraction_mode": "digital",
      "extracted_fields": {
        "shipper":   "Acme Logistics",
        "consignee": "Midwest Distribution",
        "weight_lbs": 18420,
        "pro_number": "ACL-99182734"
      }
    }
  ],
  "thread": {
    "thread_id": "<rc7821@acme-logistics.com>",
    "reply_count": 0
  }
}

How confidence scoring works

Every field Parsli returns ships with a confidence score between 0 and 1. The score reflects how certain the extraction pipeline is that the value is correct — not how 'good' the email is. Use the score to gate downstream automation: route fields below your threshold to a human-review queue or fail the parse and ask the sender to resend.

Header fields (sender, subject, received_at)

High confidence

Always 0.95+ — these come from RFC 5322 / 2822 headers and are deterministic.

Low confidence

<0.95 only when headers are malformed or missing (rare; typically forwarded-by-an-MTA cases).

Body amounts and dates

High confidence

0.92+ when the value is adjacent to a clear label (e.g. 'Total: $1,240.00') and uses standard formatting.

Low confidence

<0.80 when labels are ambiguous, currency is not specified, or the email contains multiple candidate values without disambiguating context.

body.intent classification

High confidence

0.90+ when the email matches one of the trained intent classes (invoice, order, lead, rate_confirmation, etc.).

Low confidence

<0.75 when the email is mixed-purpose or in a domain the classifier hasn't seen — at this confidence the field is included but flagged.

Signature parsing

High confidence

0.85+ when the sign-off is conventional (name, title, contact block).

Low confidence

<0.70 for emails with no signature, signatures embedded in inline images, or names that collide with body content.

Attachment field extraction

High confidence

Inherits the doctype baseline. Digital PDFs typically score 0.95+; image-only PDFs typically 0.85–0.95 depending on scan quality.

Low confidence

<0.80 for low-resolution scans, faxed BOLs with thermal-print fade, or unusual layouts.

Adding custom fields to the email schema

Most teams add 3–10 custom fields on top of the defaults. Custom fields appear under `custom.<your_field_name>` in the JSON output and get the same confidence treatment as built-in fields. Custom fields are defined in the dashboard schema builder (no code) or by sending a JSON Schema to the parser config endpoint.

order_numberstring (regex: `[A-Z]{2,}-\d{4,8}`)

Pull a structured order ID out of the body or subject line. The regex acts as a hint, not a hard filter — the LLM still finds the right value if formatting drifts.

po_referencestring

Locate a purchase-order number anywhere in the body or any attachment. Useful for AP workflows that match invoices to POs.

shipping_carrierenum (UPS, FedEx, USPS, DHL, Other)

Constrain the LLM to a known set of carriers; useful when downstream systems require a strict enum and not a free-text guess.

ticket_idstring with cross-reference

Extract a ticket ID and validate it against an existing row in your CRM or Helpdesk via webhook before ingestion. If the validation fails, the parse is held in review.

urgencyenum (low, normal, high) — LLM-inferred

Classify body sentiment to triage incoming requests. Combine with intent classification to route urgent leads or VIP-customer support directly to a Slack channel.

Parser modes for email processing

The same field schema runs in four operating modes. Most accounts mix two or three depending on the workload — push for inbox automation, batch for backlogs, sandbox for tuning. Pricing is identical across modes; you pay per page extracted regardless of how the message arrived.

Real-time push processing

Gmail push notifications and SMTP delivery fire the parser within ~30 seconds of a message landing. The default mode for inbox-management workloads where freshness matters.

Scheduled batch processing

Cron-driven sweeps over a Gmail label, an IMAP folder, or a backlog of .eml files. Useful for week-end reconciliation or one-time migrations from a legacy email parser.

API-only processing

POST a raw email or .eml file to the REST endpoint, get the JSON envelope back. No inbox connection required — used by automated email-processing pipelines that already have message bytes in hand.

Free-tier sandbox processing

30 pages per month, no credit card, every feature on. Used by teams comparing AI extraction to a pattern-matching email parser before committing — same JSON output as paid tiers, identical confidence scoring.

Supported Formats

  • Gmail (native OAuth, read-only scope)
  • Microsoft 365 / Outlook (forwarding + webhook)
  • Forwarded emails to your unique parsli.co inbox address
  • MIME multipart/alternative, multipart/mixed, multipart/related
  • Attachments: PDF, PNG, JPG, TIFF, .docx, .xlsx, .csv

Frequently Asked Questions

Does Parsli parse the HTML body or the plain-text body?

By default Parsli parses `raw_plain` — the rendered plain-text version of the email. For multipart/alternative messages where the plain-text part is missing or malformed, the HTML is stripped of tags and parsed as text. Both `raw_html` and `raw_plain` are returned in the JSON output so you can audit which version drove the extraction.

How does Parsli handle email threads?

By default the parser extracts from the latest message in the thread and includes the prior messages as a `reply_chain` array for context. You can switch the schema to extract from the full thread (useful for support emails where the relevant data is buried in the customer's first message) — set `thread_mode: 'full'` on the parser config.

What happens with quoted/forwarded portions of the body?

Quoted lines (lines beginning with `>`) and `--- Forwarded message ---` blocks are detected and excluded from the primary extraction unless the schema explicitly requests them. Forwarded attachments are still processed in full.

Are email signatures parsed into structured fields?

Yes. The signature block is detected via positional heuristics + the LLM, then parsed into `signature.name`, `signature.title`, `signature.email`, `signature.phone`, and `signature.company`. Phone numbers are normalized to E.164 when a country code is present or inferable from `sender_email`.

What encodings and character sets are supported?

MIME-decoded headers and bodies are normalized to UTF-8 before extraction. Source encodings supported: UTF-8, ISO-8859-1 through -16, Windows-1250–1258, Shift_JIS, GB18030, EUC-KR. Right-to-left scripts (Arabic, Hebrew) are preserved through extraction and JSON output.

How are attachments that are image scans (not digital PDFs) handled?

Image-only PDFs and direct image attachments are routed through the OCR pipeline first, then the OCR output is fed into the same field-extraction step as a digital PDF. The `attachments[].extraction_mode` field in the output records which path was used (`digital`, `ocr`, or `mixed` for hybrid PDFs).

Start Extracting Data from Emails

Set up in minutes. No credit card required.

Get Started Free