Get OCR word bounding boxes

GET

vault

{id}

objects

{objectId}

ocr-words

{
  "objectId": "obj_abc123",
  "pageCount": 5,
  "totalWords": 2500,
  "pages": [
    {
      "page": 1,
      "words": [
        {
          "text": "The",
          "bbox": [
            0.12,
            0.71,
            0.15,
            0.75
          ],
          "confidence": 0.98,
          "wordIndex": 0
        },
        {
          "text": "witness",
          "bbox": [
            0.16,
            0.71,
            0.28,
            0.75
          ],
          "confidence": 0.99,
          "wordIndex": 1
        }
      ]
    }
  ],
  "createdAt": "2024-01-15T10:30:00Z"
}

Authorizations

Authorization

string

header

required

API key starting with sk_case_

Path Parameters

string

required

The vault ID

objectId

string

required

The object ID

Query Parameters

page

integer

Filter to a specific page number (1-indexed). If omitted, returns all pages.

wordStart

integer

Filter to words starting at this index (inclusive). Useful for retrieving words for a specific chunk.

wordEnd

integer

Filter to words ending at this index (inclusive). Useful for retrieving words for a specific chunk.

Response

Successfully retrieved OCR word data

objectId

string

The object ID

pageCount

integer

Total number of pages in the document

totalWords

integer

Total number of words extracted from the document

pages

object[]

Per-page word data with bounding boxes

Show child attributes

createdAt

string<date-time>

When the OCR data was extracted

Get object pagesRetrieves the raw text of a processed vault object split by page. The object must have completed ingestion before pages can be retrieved — for PDFs this requires the OCR pipeline to have finished writing the per-page sidecar, so freshly uploaded PDFs return 400 with the current `ingestionStatus` until processing completes. For PDFs this returns the per-page OCR text. For plain text files (txt, md, source code, court reporter transcripts) the text is split using right-aligned page-number markers when present (preserving the original document numbering, including continuations like Volume 2 starting at page 234), falling back to form-feed (\f) page-break characters, and finally a single page if neither signal is present. Use the optional `start` and `end` query parameters to fetch a specific inclusive page range. Pages with no text are omitted.

⌘I

{
  "objectId": "obj_abc123",
  "pageCount": 5,
  "totalWords": 2500,
  "pages": [
    {
      "page": 1,
      "words": [
        {
          "text": "The",
          "bbox": [
            0.12,
            0.71,
            0.15,
            0.75
          ],
          "confidence": 0.98,
          "wordIndex": 0
        },
        {
          "text": "witness",
          "bbox": [
            0.16,
            0.71,
            0.28,
            0.75
          ],
          "confidence": 0.99,
          "wordIndex": 1
        }
      ]
    }
  ],
  "createdAt": "2024-01-15T10:30:00Z"
}