API Reference

Top-level package

SciTeX Scholar – scientific paper search, enrichment, and management.

Quick Start:

from scitex_scholar import Scholar, Paper, Papers

scholar = Scholar() papers = scholar.search(“deep learning”) papers.save(“results.bib”)

Installation:

pip install scitex-scholar

This module uses PEP 562 lazy __getattr__ so import scitex_scholar stays under 500ms cold-start. Submodules are imported on first attribute access only.

class scitex_scholar.Scholar(config=None, project=None, project_description=None, browser_mode=None)[source]

Bases: EnricherMixin, URLFindingMixin, PDFDownloadMixin, LoaderMixin, SearchMixin, SaverMixin, ProjectHandlerMixin, LibraryHandlerMixin, PipelineMixin, ServiceMixin

Main interface for SciTeX Scholar - scientific literature management made simple.

By default, papers are automatically enriched with:

  • Journal impact factors from impact_factor package (2024 JCR data)

  • Citation counts from Semantic Scholar (via DOI/title matching)

Examples

Basic search with automatic enrichment:

scholar = Scholar()
papers = scholar.search("deep learning neuroscience")
# Papers now have impact_factor and citation_count populated
papers.save("my_pac.bib")

Disable automatic enrichment if needed:

config = ScholarConfig(enable_auto_enrich=False)
scholar = Scholar(config=config)

Search a specific source:

papers = scholar.search("transformer models", sources='arxiv')

Advanced workflow:

papers = (
    scholar.search("transformer models", year_min=2020)
           .filter(min_citations=50)
           .sort_by("impact_factor")
           .save("transformers.bib")
)

Local library:

scholar._index_local_pdfs("./my_papers")
local_papers = scholar.search_local("attention mechanism")
property name

Class name for logging.

__init__(config=None, project=None, project_description=None, browser_mode=None)[source]

Initialize Scholar with configuration.

Parameters:
  • config (Union[ScholarConfig, str, Path, None]) –

    One of:

    • ScholarConfig instance

    • Path to YAML config file (str or Path)

    • None (uses ScholarConfig.load() to find config)

  • project (Optional[str]) – Default project name for operations.

  • project_description (Optional[str]) – Optional description for the project.

  • browser_mode (Optional[str]) – Browser mode ('stealth', 'interactive', 'manual').

class scitex_scholar.Paper(*args, **kwargs)[source]

Bases: BaseModel

Complete paper with metadata and container.

model_dump(**kwargs)[source]

Custom serialization to ensure all nested models use aliases.

Return type:

Dict[str, Any]

classmethod from_dict(data)[source]

Create from dictionary (for loading from JSON).

Uses Pydantic’s model_validate which handles: - Type validation - Type coercion (e.g., “2024” -> 2024) - Field aliases (e.g., “2025” -> y2025)

Return type:

Paper

to_dict()[source]

Convert to dictionary for JSON serialization.

Alias for model_dump() for backward compatibility.

Return type:

Dict[str, Any]

detect_open_access(use_unpaywall=False, update_metadata=True)[source]

Detect open access status for this paper.

Uses identifiers (DOI, arXiv ID, PMCID) and known OA sources to determine if the paper is freely available.

Parameters:
  • use_unpaywall (bool) – If True, query Unpaywall API for uncertain cases

  • update_metadata (bool) – If True, update self.metadata.access with results

Return type:

OAResult

Returns:

OAResult with detection results

property is_open_access: bool

Check if paper is open access (quick check without API calls).

class scitex_scholar.Papers(papers=None, project=None, config=None)[source]

Bases: object

A simple collection of Paper objects.

This is a minimal collection class. Most business logic (loading, saving, enrichment, etc.) is handled by Scholar.

Methods have been reduced from 39 to ~15 for simplicity. Complex operations should use Scholar or utility functions.

__init__(papers=None, project=None, config=None)[source]

Initialize Papers collection.

Parameters:
__len__()[source]

Number of papers in collection.

Return type:

int

__iter__()[source]

Iterate over papers.

Return type:

Iterator[Paper]

__getitem__(index)[source]

Get paper(s) by index or slice.

Parameters:

index (Union[int, slice]) – Integer index or slice

Return type:

Union[Paper, Papers]

Returns:

Single Paper if integer index, Papers collection if slice

__repr__()[source]

String representation.

Return type:

str

__str__()[source]

Human-readable string.

Return type:

str

__dir__()[source]

Custom dir for better discoverability.

Return type:

List[str]

property papers: List[Paper]

Get the underlying papers list.

append(paper)[source]

Add a paper to the collection.

Parameters:

paper (Paper) – Paper to add

Return type:

None

extend(papers)[source]

Add multiple papers to the collection.

Parameters:

papers (Union[List[Paper], Papers]) – List of papers or another Papers collection

Return type:

None

to_list()[source]

Get papers as a list.

Return type:

List[Paper]

Returns:

List of Paper objects

filter(condition=None, year_min=None, year_max=None, has_doi=None, has_abstract=None, has_pdf=None, min_citations=None, max_citations=None, min_impact_factor=None, max_impact_factor=None, journal=None, author=None, keyword=None, publisher=None, **kwargs)[source]

Filter papers by condition or criteria.

Parameters:
  • condition (Optional[Callable[[Paper], bool]]) – Function that takes a Paper and returns bool.

  • year_min (Optional[int]) – Minimum year.

  • year_max (Optional[int]) – Maximum year.

  • has_doi (Optional[bool]) – Filter papers with/without DOI.

  • has_abstract (Optional[bool]) – Filter papers with/without abstract.

  • has_pdf (Optional[bool]) – Filter papers with/without PDF URL.

  • min_citations (Optional[int]) – Minimum citation count.

  • max_citations (Optional[int]) – Maximum citation count.

  • min_impact_factor (Optional[float]) – Minimum journal impact factor.

  • max_impact_factor (Optional[float]) – Maximum journal impact factor.

  • journal (Optional[str]) – Journal name (partial match).

  • author (Optional[str]) – Author name (partial match).

  • keyword (Optional[str]) – Keyword (searches in keywords, title, abstract).

  • publisher (Optional[str]) – Publisher name (partial match).

  • **kwargs – Additional keyword arguments for backward compatibility.

Returns:

New Papers collection with filtered papers.

Return type:

Papers

Examples

Filter using a lambda condition:

high_impact = papers.filter(lambda p: p.journal_impact_factor and p.journal_impact_factor > 10)
highly_cited = papers.filter(lambda p: p.citation_count and p.citation_count > 500)
recent = papers.filter(lambda p: p.year and p.year >= 2020)

Filter using built-in parameters:

high_impact_v2 = papers.filter(min_impact_factor=10.0)
highly_cited_v2 = papers.filter(min_citations=500)
recent_v2 = papers.filter(year_min=2020)

Combine multiple parameters:

filtered = papers.filter(
    min_impact_factor=5.0,
    min_citations=100,
    year_min=2015,
    year_max=2023,
    journal="Nature",
    has_doi=True,
)

Chain filters for AND logic:

elite_recent = papers.filter(min_impact_factor=10).filter(year_min=2020)
sort_by(*criteria, reverse=False, **kwargs)[source]

Sort papers by criteria.

Parameters:
  • *criteria – Field names (as strings) or lambda functions to sort by.

  • reverse (bool) – Sort in descending order (default: False).

  • **kwargs – Additional options.

Returns:

New sorted Papers collection.

Return type:

Papers

Notes

Available Paper fields for sorting:

  • title – Paper title

  • year – Publication year

  • citation_count – Number of citations

  • journal_impact_factor – Journal impact factor

  • journal – Journal name

  • publisher – Publisher name

  • doi – Digital Object Identifier

  • created_at – When record was created

  • updated_at – When record was last updated

Examples

Sort by a single field:

by_year = papers.sort_by('year')
by_citations_desc = papers.sort_by('citation_count', reverse=True)

Sort by multiple fields (primary, secondary, etc.):

by_year_then_citations = papers.sort_by('year', 'citation_count')

Sort using a lambda function:

by_citations = papers.sort_by(lambda p: p.citation_count or 0, reverse=True)
by_year_safe = papers.sort_by(lambda p: p.year if p.year else 9999)

Sort by a computed value:

by_citation_per_year = papers.sort_by(
    lambda p: (p.citation_count or 0) / (2024 - p.year) if p.year else 0,
    reverse=True,
)
classmethod from_bibtex(bibtex_input)[source]

Load papers from BibTeX.

DEPRECATED: Use Scholar.from_bibtex() instead. This method is kept for backward compatibility.

Parameters:

bibtex_input (Union[str, Path]) – Path to BibTeX file or BibTeX string

Return type:

Papers

Returns:

Papers collection

classmethod _from_bibtex_file(file_path)[source]

Load papers from BibTeX file.

Parameters:

file_path (Union[str, Path]) – Path to BibTeX file

Return type:

Papers

Returns:

Papers collection

classmethod _from_bibtex_text(bibtex_content)[source]

Load papers from BibTeX text.

Parameters:

bibtex_content (str) – BibTeX content as string

Return type:

Papers

Returns:

Papers collection

static _bibtex_entry_to_paper(entry)[source]

Convert BibTeX entry to Paper object.

Parameters:

entry (Dict[str, Any]) – BibTeX entry dictionary

Return type:

Paper

Returns:

Paper object

save(output_path, format='auto', **kwargs)[source]

Save papers to file.

DEPRECATED: Use Scholar.save_papers() or Scholar.export_bibtex() instead. This method is kept for backward compatibility.

Parameters:
  • output_path (Union[str, Path]) – Path to save file

  • format (Optional[str]) – Output format (auto, bibtex, json, csv)

  • **kwargs – Additional options

Return type:

None

to_dict()[source]

Convert to dictionary.

DEPRECATED: Use papers_utils.papers_to_dict() for new code.

Return type:

List[Dict[str, Any]]

Returns:

Dictionary representation

to_dataframe()[source]

Convert to pandas DataFrame.

DEPRECATED: Use papers_utils.papers_to_dataframe() for new code.

Return type:

Any

Returns:

DataFrame with papers data

summary()[source]

Get summary statistics.

DEPRECATED: Use papers_utils.papers_statistics() for new code.

Return type:

Dict[str, Any]

Returns:

Dictionary with statistics

class scitex_scholar.ScholarConfig(config_path=None, scholar_dir=None)[source]

Bases: object

__init__(config_path=None, scholar_dir=None)[source]

Initialize ScholarConfig.

Parameters:
  • config_path (Union[Path, str, None]) – Path to custom config YAML file

  • scholar_dir (Union[Path, str, None]) – Direct path to scholar directory (e.g., /data/users/alice/.scitex) This bypasses SCITEX_DIR env var for thread-safe multi-user usage. Use this in Django/multi-user environments to avoid race conditions.

__getattr__(name)[source]

Delegate all get_* methods to path_manager.

__dir__()[source]

Include path_manager’s get_* methods in dir() output.

resolve(key, direct_val=None, default=None, type=<class 'str'>, mask=None)[source]

Resolve configuration value with precedence: direct → config → env → default

get(key)[source]

Get value from config dict only

print()[source]

Print how each config was resolved

clear_log()[source]

Clear resolution log

load_yaml(path)[source]
Return type:

dict

classmethod load(path=None)[source]
property paths

Access to path manager for organized directory structure

class scitex_scholar.ScholarAuthManager(email_openathens=None, email_ezproxy=None, email_shibboleth=None, config=None)[source]

Bases: object

Manages multiple authentication providers.

This class coordinates between different authentication methods (OpenAthens, Lean Library, etc.) and provides a unified interface.

__init__(email_openathens=None, email_ezproxy=None, email_shibboleth=None, config=None)[source]

Initialize the authentication manager.

Parameters:
  • email_openathens (Optional[str]) – User’s institutional email for OpenAthens authentication

  • email_ezproxy (Optional[str]) – User’s institutional email for EZProxy authentication

  • email_shibboleth (Optional[str]) – User’s institutional email for Shibboleth authentication

  • config (Optional[ScholarConfig]) – ScholarConfig instance (creates new if None)

async ensure_authenticate_async(provider_name=None, verify_live=True, **kwargs)[source]
Return type:

bool

async is_authenticate_async(verify_live=True)[source]

Check if authenticate_async with any provider.

Return type:

bool

async authenticate_async(provider_name=None, **kwargs)[source]

Authenticate with specified or active provider.

Return type:

dict

async get_auth_headers_async()[source]

Get authentication headers from active provider.

Return type:

Dict[str, str]

async get_auth_options()[source]
Return type:

dict

async get_auth_cookies_async(essential_only=True)[source]

Get authentication cookies from active provider.

Return type:

List[Dict[str, Any]]

_register_provider(name, provider)[source]

Register an authentication provider with email context.

Return type:

None

set_active_provider(name)[source]

Set the active authentication provider.

Return type:

None

get_active_provider()[source]

Get the currently active provider.

Return type:

Optional[BaseAuthenticator]

async logout_async()[source]

Log out from all providers.

Return type:

None

list_providers()[source]

List all registered providers.

Return type:

List[str]

class scitex_scholar.ScholarBrowserManager(browser_mode=None, auth_manager=None, chrome_profile_name=None, config=None)[source]

Bases: BrowserMixin

Manages a local browser instance with stealth enhancements and invisible mode.

__init__(browser_mode=None, auth_manager=None, chrome_profile_name=None, config=None)[source]

Initialize ScholarBrowserManager with invisible browser capabilities.

Parameters:
  • auth_manager – Authentication manager instance

  • config (ScholarConfig) – Scholar configuration instance

async get_authenticated_browser_and_context_async(**context_options)[source]

Get browser context with authentication cookies and extensions loaded.

Return type:

tuple[Browser, BrowserContext]

async _new_context_async(browser, **context_options)[source]

Creates a new browser context with stealth options and invisible mode applied.

Return type:

BrowserContext

_verify_xvfb_running(_recursed=False)[source]

Verify Xvfb virtual display is running; auto-start if absent.

async _load_auth_cookies_to_persistent_context_async()[source]

Load authentication cookies into the persistent browser context.

async take_screenshot_async(page, path, timeout_sec=30.0, timeout_after_sec=30.0, full_page=False)[source]

Take screenshot without viewport changes.

async start_periodic_screenshots_async(page, output_dir, prefix='periodic', interval_seconds=1, duration_seconds=10, verbose=False)[source]

Start taking periodic screenshots in the background.

Parameters:
  • page – The page to screenshot

  • prefix (str) – Prefix for screenshot filenames

  • interval_seconds (int) – Seconds between screenshots

  • duration_seconds (int) – Total duration to take screenshots (0 = infinite)

  • verbose (bool) – Whether to log each screenshot

Returns:

asyncio.Task that can be cancelled to stop screenshots

async stop_periodic_screenshots_async(task)[source]

Stop periodic screenshots task.

async close()[source]

Close browser while preserving authentication and extension data.

class scitex_scholar.ScholarURLFinder(context, config=None)[source]

Bases: object

Find PDF URLs from web pages.

Simple, focused responsibility: - Input: Page or URL string - Output: List of PDF URLs

Authentication/DOI resolution should be handled BEFORE calling this.

PAGE_LOAD_TIMEOUT = 30000
__init__(context, config=None)[source]
async find_pdf_urls(page_or_url, base_url=None)[source]

Find PDF URLs from page or URL string.

Parameters:
  • page_or_url (Union[Page, str]) – Playwright Page object or URL string

  • base_url (Optional[str]) – Optional base URL for the page

Returns:

[{“url”: “…”, “source”: “zotero_translator”}]

Return type:

List of PDF URL dicts

async _find_pdf_urls_with_strategies(page, base_url=None)[source]

Try strategies in priority order.

Return type:

List[Dict]

_extract_doi(url)[source]

Extract DOI from string if present.

Parameters:

url (str) – URL string or DOI

Return type:

Optional[str]

Returns:

DOI string if found, None otherwise

Examples

>>> _extract_doi("10.1038/nature12345")
"10.1038/nature12345"
>>> _extract_doi("doi:10.1038/nature12345")
"10.1038/nature12345"
>>> _extract_doi("https://example.com")
None
async _find_from_url_string(url)[source]

Find PDFs from URL string or DOI.

Return type:

List[Dict]

async _find_from_page(page, base_url=None)[source]

Find PDFs from existing page.

Return type:

List[Dict]

_managed_page()[source]

Context manager for page lifecycle.

_as_pdf_dicts(urls, source)[source]

Convert URL strings to dict format with source.

Return type:

List[Dict]

class scitex_scholar.CitationGraphBuilder(db_path=None, api_url=None)[source]

Bases: object

Build citation network graphs for academic papers.

Auto-detects backend via crossref_local.Config (DB → HTTP).

Example (auto-detect):
>>> builder = CitationGraphBuilder()
>>> graph = builder.build("10.1038/s41586-020-2008-3", top_n=20)
Example (explicit SQLite):
>>> builder = CitationGraphBuilder(db_path="/path/to/crossref.db")
Example (explicit HTTP):
>>> builder = CitationGraphBuilder(api_url="http://localhost:31291")
__init__(db_path=None, api_url=None)[source]

Initialize builder with database path, HTTP API URL, or auto-detect.

When no args given, delegates to crossref_local.Config for auto-detection: 1. CROSSREF_LOCAL_MODE env var (explicit “db” or “http”) 2. CROSSREF_LOCAL_API_URL env var → HTTP mode 3. Local DB file existence → DB mode 4. Fallback to HTTP mode

Parameters:
  • db_path (str) – Path to CrossRef SQLite database (local mode)

  • api_url (str) – URL of crossref-local HTTP API (HTTP mode)

_auto_detect()[source]

Auto-detect backend via crossref_local.Config.

build(seed_doi, top_n=20, weight_coupling=2.0, weight_cocitation=2.0, weight_direct=1.0)[source]

Build citation network around a seed paper.

Parameters:
  • seed_doi (str) – DOI of the seed paper

  • top_n (int) – Number of most similar papers to include

  • weight_coupling (float) – Weight for bibliographic coupling

  • weight_cocitation (float) – Weight for co-citation

  • weight_direct (float) – Weight for direct citations

Return type:

CitationGraph

Returns:

CitationGraph object with nodes and edges

_create_paper_node(doi, similarity_score)[source]

Create a PaperNode with metadata from database.

Parameters:
  • doi (str) – DOI of the paper

  • similarity_score (float) – Calculated similarity score

Return type:

PaperNode

Returns:

PaperNode object

_build_citation_edges(dois)[source]

Build citation edges between papers in the network.

Parameters:

dois (List[str]) – List of DOIs in the network

Return type:

List[CitationEdge]

Returns:

List of CitationEdge objects

build_from_dois(dois, num_related_per_doi=20, weight_coupling=2.0, weight_cocitation=2.0, weight_direct=1.0)[source]

Build citation network from multiple seed DOIs.

Combines similarity scores from all seeds to find papers related to the entire set, producing a richer connected graph.

Parameters:
  • dois (List[str]) – List of seed DOIs

  • num_related_per_doi (int) – Number of related papers to discover per DOI

  • weight_coupling (float) – Weight for bibliographic coupling

  • weight_cocitation (float) – Weight for co-citation

  • weight_direct (float) – Weight for direct citations

Return type:

CitationGraph

Returns:

CitationGraph with all seeds + related papers + edges

build_from_query(query, num_related_per_doi=20, search_limit=10, weight_coupling=2.0, weight_cocitation=2.0, weight_direct=1.0)[source]

Build citation network from a text query.

Searches local databases, extracts DOIs from results, then delegates to build_from_dois().

Parameters:
  • query (str) – Search query (e.g. “hippocampal sharp wave ripples”)

  • num_related_per_doi (int) – Related papers per seed DOI

  • search_limit (int) – Max papers to fetch from search

  • weight_coupling (float) – Weight for bibliographic coupling

  • weight_cocitation (float) – Weight for co-citation

  • weight_direct (float) – Weight for direct citations

Return type:

CitationGraph

Returns:

CitationGraph with search-discovered seeds + related papers

export_json(graph, output_path)[source]

Export graph to JSON file for visualization.

Parameters:
  • graph (CitationGraph) – CitationGraph to export

  • output_path (str) – Path to output JSON file

get_paper_summary(doi)[source]

Get summary information for a paper.

Parameters:

doi (str) – DOI of the paper

Return type:

Optional[dict]

Returns:

Dictionary with paper summary

scitex_scholar.plot_citation_graph(graph, backend='auto', output=None, **kwargs)[source]

Visualize a citation graph with pluggable backends.

Parameters:
  • graph (CitationGraph or networkx.DiGraph) – Citation network to visualize. CitationGraph is auto-converted via to_networkx().

  • backend (str) – Rendering backend: ‘auto’, ‘figrecipe’, ‘scitex.plt’, ‘matplotlib’, or ‘pyvis’. Default ‘auto’ picks the best available.

  • output (str, optional) – Output file path. Required for ‘pyvis’ backend (HTML). For static backends, saves the figure to this path.

  • **kwargs – Backend-specific keyword arguments (layout, seed, figsize, etc.).

Returns:

Backend-specific result. Static backends return {'fig', 'ax', 'pos', 'backend'}. Pyvis returns {'output', 'backend'}.

Return type:

dict

scitex_scholar.to_bibtex(paper)[source]

Format a standard paper dict as a BibTeX entry.

Return type:

str

scitex_scholar.to_ris(paper)[source]

Format a standard paper dict as a RIS entry.

Return type:

str

scitex_scholar.to_endnote(paper)[source]

Format a standard paper dict as an EndNote entry.

Return type:

str

scitex_scholar.to_text_citation(paper, style='apa', doc_type='article')[source]

Format a paper dict as a text citation in the given style.

Parameters:
  • paper (dict) – Standard paper dict.

  • style (str) – One of apa, mla, chicago, vancouver.

  • doc_type (str) – One of article, dataset.

Returns:

Formatted citation string.

Return type:

str

scitex_scholar.papers_to_format(papers, fmt)[source]

Format a list of paper dicts to the given format string.

Return type:

str

scitex_scholar.generate_cite_key(paper)[source]

Generate a BibTeX citation key from a paper dict.

Return type:

str

scitex_scholar.make_citation_key(last_name, year=None)[source]

Generate a citation key from author last name and year.

Parameters:
  • last_name (str) – Author last name (special chars stripped).

  • year – Publication year (optional).

Return type:

str

Returns:

Citation key string, e.g. smith2024.

scitex_scholar.from_connected_papers(paper_id, *, cp_api_key=None, s2_api_key=None, output_format='citation_graph', dry_run=False)[source]

Import a Connected Papers graph into scitex.

Parameters:
  • paper_id (str) – Semantic Scholar paper ID (40-char SHA) for the seed paper.

  • cp_api_key (str, optional) – Connected Papers API key.

  • s2_api_key (str, optional) – Semantic Scholar API key for DOI resolution.

  • output_format (str) – “citation_graph” returns CitationGraph, “papers” returns Papers.

  • dry_run (bool) – If True, fetch and report stats without creating objects.

Returns:

{success: True, graph/papers, stats, warnings} or {success: False, error: str}.

Return type:

dict

scitex_scholar.to_connected_papers(graph, *, output=None)[source]

Export a CitationGraph as BibTeX/JSON for Connected Papers.

Parameters:
  • graph (CitationGraph) – Citation graph to export.

  • output (str or Path, optional) – Output directory. Defaults to current directory.

Returns:

{success, bibtex_path, json_path, paper_count} or {success: False, error}.

Return type:

dict

scitex_scholar.apply_filters(papers, filters=None, parsed_operators=None)[source]

Filter a list of paper dicts by various criteria.

Parameters:
  • papers (List[Dict[str, Any]]) – List of paper dicts. Each dict should contain the keys described in the module docstring; missing keys are treated as empty / zero values.

  • filters (Optional[Dict[str, Any]]) –

    Dict of filter criteria extracted from a search form or URL parameters. Supported keys:

    • year_from, year_to – year range (int)

    • min_citations, max_citations – citation range (int)

    • min_impact_factor – minimum IF (float)

    • max_impact_factor – maximum IF (float)

    • authors – list of author name strings (legacy)

    • journal – journal name substring (legacy, str)

    • open_access – bool

    • doc_type"review" | "preprint" | other

    • language – language string ("english" passes)

  • parsed_operators (Optional[Dict[str, Any]]) –

    Dict produced by SearchQueryParser.from_shell_syntax() or the equivalent parse_query_operators() function from scitex-cloud. Supported keys:

    • title_includes, title_excludes – list[str]

    • author_includes, author_excludes – list[str]

    • journal_includes, journal_excludes – list[str]

    • year_min, year_max – int

    • citations_min, citations_max – int

    • impact_factor_min, impact_factor_max – float

Returns:

Filtered list of paper dicts (same objects, not copies).

Return type:

list of dict

scitex_scholar.clean_abstract(text)

Strip HTML/JATS XML tags from a CrossRef-style abstract.

Return type:

str

Semantic Highlighter

Semantic PDF highlighter for academic papers.

Overlays rhetorical-role highlights (claim / method / limitation / supportive / contradictive) onto a copy of a PDF without modifying its underlying text. Highlights are standard PDF annotation objects compatible with any viewer.

class scitex_scholar.pdf_highlight.Block(id, page, bbox, text, category=None, confidence=0.0)[source]

Bases: object

A unit of classification — either a paragraph or a sentence.

bbox is always the paragraph-level clip rectangle. For sentence units this is used only as the search region when locating the sentence’s glyphs on the page at annotation time.

id: int
page: int
bbox: tuple[float, float, float, float]
text: str
category: str | None = None
confidence: float = 0.0
__init__(id, page, bbox, text, category=None, confidence=0.0)
class scitex_scholar.pdf_highlight.HighlightResult(input_path, output_path, blocks, pages, annotations_added)[source]

Bases: object

input_path: Path
output_path: Path | None
blocks: list[Block]
pages: int
annotations_added: int
counts()[source]
Return type:

dict[str, int]

__init__(input_path, output_path, blocks, pages, annotations_added)
scitex_scholar.pdf_highlight.apply_classifications(blocks, classifications)[source]

Assign offline-produced labels to already-extracted blocks.

Each entry must contain at least id and category; confidence is optional (defaults to 0.0). Categories outside CATEGORIES are silently dropped.

Returns the number of blocks that received a label.

Return type:

int

scitex_scholar.pdf_highlight.extract_blocks(pdf_path, min_chars=40, *, sentence_level=True)[source]

Open a PDF and return (document, units-of-classification).

sentence_level=True (default) yields one unit per sentence, which gives much tighter highlights — avoids painting a whole paragraph green when only its last two sentences state the claim. sentence_level=False yields one unit per paragraph.

Units shorter than min_chars are dropped (filters page numbers, running headers, short captions, and sentence fragments).

Return type:

tuple[Document, list[Block]]

scitex_scholar.pdf_highlight.highlight_pdf(pdf_path, output_path=None, *, model='claude-haiku-4-5-20251001', use_stub=False, dry_run=False, max_blocks=0, batch_size=25, min_chars=40, sentence_level=True, add_legend=True, min_confidence=0.0, concurrency=4, on_info=None, on_warning=None)[source]

Annotate a PDF with rhetorical-role highlights.

Parameters:
  • pdf_path (str | PathLike) – Input PDF path.

  • output_path (Union[str, PathLike, None]) – Output PDF. Defaults to <input>.highlighted.pdf.

  • model (str) – Anthropic model ID used by the LLM classifier.

  • use_stub (bool) – If True, classify with a keyword heuristic (no API call).

  • dry_run (bool) – If True, classify but do not write the output PDF.

  • max_blocks (int) – If >0, truncate to the first N extracted units.

  • batch_size (int) – Classifier batch size (units per API call).

  • min_chars (int) – Minimum text length for an extracted unit.

  • sentence_level (bool) – If True (default), classify and highlight at sentence granularity. If False, use paragraph-level (less precise but ~5× cheaper on long papers).

  • add_legend (bool) – If True, prepend a colour legend + signature page.

Return type:

HighlightResult

Returns:

HighlightResult with the classified units and annotation count.

scitex_scholar.pdf_highlight.save_with_highlights(doc, blocks, output_path, *, add_legend=True, signature=None, model_label=None, source_name=None, min_confidence=0.0, on_info=None)[source]

Write doc with highlight annotations for all labelled blocks.

When add_legend=True (default) a colour legend + signature page is prepended so readers can see which colour means what. min_confidence suppresses highlights below that confidence.

on_info (optional callable) receives progress messages.

The save deliberately uses garbage=0, deflate=False. The earlier garbage=3, deflate=True recompressed every stream of the source PDF, which on a large (20 MB+) image-heavy paper ran for minutes entirely inside pymupdf’s C code — and because CPython only delivers KeyboardInterrupt between bytecode ops, that made the run both slow and uninterruptible (Ctrl-C queued but never fired). Appending the annotation objects without recompression is near-instant and keeps the C calls short enough to stay responsive to signals.

Returns the number of highlight annotations added (not counting the legend page).

Return type:

int

Colour Scheme

5-category rhetorical colour scheme for the semantic highlighter.

Block Extraction

Block extraction — paragraph-level layout + sentence-level splitting.

class scitex_scholar.pdf_highlight._blocks.Block(id, page, bbox, text, category=None, confidence=0.0)[source]

A unit of classification — either a paragraph or a sentence.

bbox is always the paragraph-level clip rectangle. For sentence units this is used only as the search region when locating the sentence’s glyphs on the page at annotation time.

id: int
page: int
bbox: tuple[float, float, float, float]
text: str
category: str | None = None
confidence: float = 0.0
__init__(id, page, bbox, text, category=None, confidence=0.0)
scitex_scholar.pdf_highlight._blocks._split_sentences(text)[source]

Naive academic-aware sentence splitter.

Splits on sentence-ending punctuation followed by whitespace and a capital/digit/opening quote, then re-joins splits that follow common abbreviations (Fig., e.g., et al., single-initial J.).

Return type:

list[str]

scitex_scholar.pdf_highlight._blocks.extract_blocks(pdf_path, min_chars=40, *, sentence_level=True)[source]

Open a PDF and return (document, units-of-classification).

sentence_level=True (default) yields one unit per sentence, which gives much tighter highlights — avoids painting a whole paragraph green when only its last two sentences state the claim. sentence_level=False yields one unit per paragraph.

Units shorter than min_chars are dropped (filters page numbers, running headers, short captions, and sentence fragments).

Return type:

tuple[Document, list[Block]]

Classifier

LLM and offline classifiers for the semantic highlighter.

scitex_scholar.pdf_highlight._classifier._available_models(client)[source]

Best-effort list of model IDs the account can call. Empty on failure.

Return type:

list[str]

scitex_scholar.pdf_highlight._classifier._model_not_found_error(model, client)[source]

Build a helpful error that hints the available model IDs.

Return type:

RuntimeError

scitex_scholar.pdf_highlight._classifier._retry_wait_seconds(exc, attempt, base, cap)[source]

Seconds to wait before the next retry.

Prefers the server’s Retry-After header; otherwise uses exponential backoff (base * 2**attempt, capped) with full jitter so concurrent callers don’t retry in lockstep.

Return type:

float

scitex_scholar.pdf_highlight._classifier._classify_one_batch(client, anthropic, model, batch, retryable, max_retries, backoff_base, backoff_cap, info)[source]

Call the API for one batch, retrying transient errors with backoff.

Returns the message on success, or None if it stays unrecoverable after max_retries. Raises for non-retryable errors (bad model, other 4xx) so the whole run aborts on those.

Return type:

Optional[Any]

scitex_scholar.pdf_highlight._classifier._apply_predictions(batch, raw)[source]

Parse the model’s JSON reply and write categories onto batch.

Return type:

None

scitex_scholar.pdf_highlight._classifier.classify_llm(blocks, model, batch_size=25, on_warning=None, on_info=None, max_retries=8, backoff_base=2.0, backoff_cap=60.0, concurrency=4)[source]

Classify blocks in-place by calling the Anthropic Messages API.

Batches are sent concurrently (up to concurrency in flight) to cut wall-clock time, while each batch independently retries rate-limit (429) and transient server/connection errors with exponential backoff that honors any Retry-After header. A batch that stays unrecoverable is skipped (its units stay unclassified) so the run still produces a partial result. Per-batch progress is reported via on_info.

Return type:

None

scitex_scholar.pdf_highlight._classifier.classify_stub(blocks)[source]

Offline keyword heuristic. No API calls. Useful for smoke tests.

Return type:

None

Annotator

PDF annotation — tight per-sentence highlights + legend/signature page.

scitex_scholar.pdf_highlight._annotator._chunk_quads_for_sentence(page, rect, sentence)[source]

Locate a sentence piecewise via short word-window probes.

A whole-sentence search_for usually fails when the sentence wraps across lines (the on-page text has line breaks / hyphenation that the whitespace-normalised sentence string does not). We instead walk the sentence in ~60-char word windows; each window almost always lives on a single line, so search_for matches it and returns a tight quad. Concatenating the windows’ quads highlights only the sentence’s glyphs — never the surrounding paragraph.

Return type:

list[Quad]

scitex_scholar.pdf_highlight._annotator._search_quads_for_sentence(page, rect, sentence)[source]

Locate a sentence’s glyphs, tightest-match first.

  1. Whole-sentence probes (fast path for single-line sentences).

  2. Word-window chunks (handles sentences that wrap across lines).

Returns an empty list when the sentence cannot be located at all; the caller decides whether to skip it. We deliberately do NOT fall back to the paragraph’s line boxes — that paints the entire paragraph and is the source of the over-large block highlights.

Return type:

list[Quad]

scitex_scholar.pdf_highlight._annotator.apply_highlights(doc, blocks, *, min_confidence=0.0, on_info=None)[source]

Overlay one highlight annotation per classified block. Returns count.

min_confidence skips any classified block whose confidence is below the threshold, so a reader can thin out low-certainty highlights.

on_info (optional callable) receives periodic progress messages while the per-sentence text search runs — this phase is CPU-bound and otherwise silent, so without it a long PDF looks hung.

Return type:

int

scitex_scholar.pdf_highlight._annotator._corner_rect(page, corner, w=210, h=112)[source]

Return a rect anchored to corner of page (“lr”, “ll”, “lc”).

Return type:

Rect

scitex_scholar.pdf_highlight._annotator._draw_legend_overlay(page, rect, *, signature, model_label, source_name)[source]

Paint a small opaque legend panel into rect on page.

Opaque white background so the panel remains readable even if it overlays text underneath. Kept intentionally small — the information density is high and the goal is unobtrusive reference.

Return type:

None

scitex_scholar.pdf_highlight._annotator.add_legend(doc, *, signature, model_label, source_name, corner='lr')[source]

Stamp a compact legend overlay in a corner of the last page.

Default corner is lower-right (“lr”); valid alternatives are lower-left (“ll”) and lower-centre (“lc”). No new pages are added — the overlay sits on top of any existing content (opaque background).

Return type:

None

scitex_scholar.pdf_highlight._annotator.add_legend_page(doc, *, signature, model_label, source_name, corner='lr')

Stamp a compact legend overlay in a corner of the last page.

Default corner is lower-right (“lr”); valid alternatives are lower-left (“ll”) and lower-centre (“lc”). No new pages are added — the overlay sits on top of any existing content (opaque background).

Return type:

None

Stamp a compact legend overlay in a corner of the last page.

Default corner is lower-right (“lr”); valid alternatives are lower-left (“ll”) and lower-centre (“lc”). No new pages are added — the overlay sits on top of any existing content (opaque background).

Return type:

None

MCP Tool Spec

MCP tool spec for the semantic PDF highlighter.

Provides a JSON-schema tool definition and a handler function that any MCP server implementation can import and register. The handler itself is synchronous; wrap with asyncio.to_thread inside the server.

scitex_scholar.pdf_highlight._mcp.run_tool(arguments)[source]

Execute the highlighter and return a summary suitable for an MCP reply.

The returned dict has output_path, pages, annotations_added, and counts (per-category). Raises FileNotFoundError or RuntimeError as the highlighter does; MCP servers should translate these into tool-level errors.

Return type:

dict[str, Any]