API Reference
Top-level package
SciTeX Scholar – scientific paper search, enrichment, and management.
- Quick Start:
from scitex_scholar import Scholar, Paper, Papers
scholar = Scholar() papers = scholar.search(“deep learning”) papers.save(“results.bib”)
- Installation:
pip install scitex-scholar
This module uses PEP 562 lazy __getattr__ so import scitex_scholar stays under 500ms cold-start. Submodules are imported on first attribute access only.
- class scitex_scholar.Scholar(config=None, project=None, project_description=None, browser_mode=None)[source]
Bases:
EnricherMixin,URLFindingMixin,PDFDownloadMixin,LoaderMixin,SearchMixin,SaverMixin,ProjectHandlerMixin,LibraryHandlerMixin,PipelineMixin,ServiceMixinMain interface for SciTeX Scholar - scientific literature management made simple.
By default, papers are automatically enriched with:
Journal impact factors from impact_factor package (2024 JCR data)
Citation counts from Semantic Scholar (via DOI/title matching)
Examples
Basic search with automatic enrichment:
scholar = Scholar() papers = scholar.search("deep learning neuroscience") # Papers now have impact_factor and citation_count populated papers.save("my_pac.bib")
Disable automatic enrichment if needed:
config = ScholarConfig(enable_auto_enrich=False) scholar = Scholar(config=config)
Search a specific source:
papers = scholar.search("transformer models", sources='arxiv')
Advanced workflow:
papers = ( scholar.search("transformer models", year_min=2020) .filter(min_citations=50) .sort_by("impact_factor") .save("transformers.bib") )
Local library:
scholar._index_local_pdfs("./my_papers") local_papers = scholar.search_local("attention mechanism")
- property name
Class name for logging.
- __init__(config=None, project=None, project_description=None, browser_mode=None)[source]
Initialize Scholar with configuration.
- Parameters:
config (
Union[ScholarConfig,str,Path,None]) –One of:
ScholarConfiginstancePath to YAML config file (str or Path)
None(usesScholarConfig.load()to find config)
project (
Optional[str]) – Default project name for operations.project_description (
Optional[str]) – Optional description for the project.browser_mode (
Optional[str]) – Browser mode ('stealth','interactive','manual').
- class scitex_scholar.Paper(*args, **kwargs)[source]
Bases:
BaseModelComplete paper with metadata and container.
- model_dump(**kwargs)[source]
Custom serialization to ensure all nested models use aliases.
- classmethod from_dict(data)[source]
Create from dictionary (for loading from JSON).
Uses Pydantic’s model_validate which handles: - Type validation - Type coercion (e.g., “2024” -> 2024) - Field aliases (e.g., “2025” -> y2025)
- Return type:
- to_dict()[source]
Convert to dictionary for JSON serialization.
Alias for model_dump() for backward compatibility.
- detect_open_access(use_unpaywall=False, update_metadata=True)[source]
Detect open access status for this paper.
Uses identifiers (DOI, arXiv ID, PMCID) and known OA sources to determine if the paper is freely available.
- property is_open_access: bool
Check if paper is open access (quick check without API calls).
- class scitex_scholar.Papers(papers=None, project=None, config=None)[source]
Bases:
objectA simple collection of Paper objects.
This is a minimal collection class. Most business logic (loading, saving, enrichment, etc.) is handled by Scholar.
Methods have been reduced from 39 to ~15 for simplicity. Complex operations should use Scholar or utility functions.
- __init__(papers=None, project=None, config=None)[source]
Initialize Papers collection.
- __getitem__(index)[source]
Get paper(s) by index or slice.
- append(paper)[source]
Add a paper to the collection.
- extend(papers)[source]
Add multiple papers to the collection.
- filter(condition=None, year_min=None, year_max=None, has_doi=None, has_abstract=None, has_pdf=None, min_citations=None, max_citations=None, min_impact_factor=None, max_impact_factor=None, journal=None, author=None, keyword=None, publisher=None, **kwargs)[source]
Filter papers by condition or criteria.
- Parameters:
condition (
Optional[Callable[[Paper],bool]]) – Function that takes a Paper and returns bool.has_abstract (
Optional[bool]) – Filter papers with/without abstract.has_pdf (
Optional[bool]) – Filter papers with/without PDF URL.min_impact_factor (
Optional[float]) – Minimum journal impact factor.max_impact_factor (
Optional[float]) – Maximum journal impact factor.keyword (
Optional[str]) – Keyword (searches in keywords, title, abstract).**kwargs – Additional keyword arguments for backward compatibility.
- Returns:
New Papers collection with filtered papers.
- Return type:
Examples
Filter using a lambda condition:
high_impact = papers.filter(lambda p: p.journal_impact_factor and p.journal_impact_factor > 10) highly_cited = papers.filter(lambda p: p.citation_count and p.citation_count > 500) recent = papers.filter(lambda p: p.year and p.year >= 2020)
Filter using built-in parameters:
high_impact_v2 = papers.filter(min_impact_factor=10.0) highly_cited_v2 = papers.filter(min_citations=500) recent_v2 = papers.filter(year_min=2020)
Combine multiple parameters:
filtered = papers.filter( min_impact_factor=5.0, min_citations=100, year_min=2015, year_max=2023, journal="Nature", has_doi=True, )
Chain filters for AND logic:
elite_recent = papers.filter(min_impact_factor=10).filter(year_min=2020)
- sort_by(*criteria, reverse=False, **kwargs)[source]
Sort papers by criteria.
- Parameters:
*criteria – Field names (as strings) or lambda functions to sort by.
reverse (
bool) – Sort in descending order (default: False).**kwargs – Additional options.
- Returns:
New sorted Papers collection.
- Return type:
Notes
Available Paper fields for sorting:
title– Paper titleyear– Publication yearcitation_count– Number of citationsjournal_impact_factor– Journal impact factorjournal– Journal namepublisher– Publisher namedoi– Digital Object Identifiercreated_at– When record was createdupdated_at– When record was last updated
Examples
Sort by a single field:
by_year = papers.sort_by('year') by_citations_desc = papers.sort_by('citation_count', reverse=True)
Sort by multiple fields (primary, secondary, etc.):
by_year_then_citations = papers.sort_by('year', 'citation_count')
Sort using a lambda function:
by_citations = papers.sort_by(lambda p: p.citation_count or 0, reverse=True) by_year_safe = papers.sort_by(lambda p: p.year if p.year else 9999)
Sort by a computed value:
by_citation_per_year = papers.sort_by( lambda p: (p.citation_count or 0) / (2024 - p.year) if p.year else 0, reverse=True, )
- classmethod from_bibtex(bibtex_input)[source]
Load papers from BibTeX.
DEPRECATED: Use Scholar.from_bibtex() instead. This method is kept for backward compatibility.
- classmethod _from_bibtex_file(file_path)[source]
Load papers from BibTeX file.
- classmethod _from_bibtex_text(bibtex_content)[source]
Load papers from BibTeX text.
- static _bibtex_entry_to_paper(entry)[source]
Convert BibTeX entry to Paper object.
- save(output_path, format='auto', **kwargs)[source]
Save papers to file.
DEPRECATED: Use Scholar.save_papers() or Scholar.export_bibtex() instead. This method is kept for backward compatibility.
- to_dict()[source]
Convert to dictionary.
DEPRECATED: Use papers_utils.papers_to_dict() for new code.
- to_dataframe()[source]
Convert to pandas DataFrame.
DEPRECATED: Use papers_utils.papers_to_dataframe() for new code.
- Return type:
- Returns:
DataFrame with papers data
- class scitex_scholar.ScholarConfig(config_path=None, scholar_dir=None)[source]
Bases:
object- __init__(config_path=None, scholar_dir=None)[source]
Initialize ScholarConfig.
- Parameters:
config_path (
Union[Path,str,None]) – Path to custom config YAML filescholar_dir (
Union[Path,str,None]) – Direct path to scholar directory (e.g., /data/users/alice/.scitex) This bypasses SCITEX_DIR env var for thread-safe multi-user usage. Use this in Django/multi-user environments to avoid race conditions.
- __getattr__(name)[source]
Delegate all
get_*methods topath_manager.
- __dir__()[source]
Include
path_manager’sget_*methods indir()output.
- resolve(key, direct_val=None, default=None, type=<class 'str'>, mask=None)[source]
Resolve configuration value with precedence: direct → config → env → default
- get(key)[source]
Get value from config dict only
- print()[source]
Print how each config was resolved
- clear_log()[source]
Clear resolution log
- classmethod load(path=None)[source]
- property paths
Access to path manager for organized directory structure
- class scitex_scholar.ScholarAuthManager(email_openathens=None, email_ezproxy=None, email_shibboleth=None, config=None)[source]
Bases:
objectManages multiple authentication providers.
This class coordinates between different authentication methods (OpenAthens, Lean Library, etc.) and provides a unified interface.
- __init__(email_openathens=None, email_ezproxy=None, email_shibboleth=None, config=None)[source]
Initialize the authentication manager.
- Parameters:
email_openathens (
Optional[str]) – User’s institutional email for OpenAthens authenticationemail_ezproxy (
Optional[str]) – User’s institutional email for EZProxy authenticationemail_shibboleth (
Optional[str]) – User’s institutional email for Shibboleth authenticationconfig (
Optional[ScholarConfig]) – ScholarConfig instance (creates new if None)
- async ensure_authenticate_async(provider_name=None, verify_live=True, **kwargs)[source]
- Return type:
- async is_authenticate_async(verify_live=True)[source]
Check if authenticate_async with any provider.
- Return type:
- async authenticate_async(provider_name=None, **kwargs)[source]
Authenticate with specified or active provider.
- Return type:
- async get_auth_headers_async()[source]
Get authentication headers from active provider.
- async get_auth_cookies_async(essential_only=True)[source]
Get authentication cookies from active provider.
- _register_provider(name, provider)[source]
Register an authentication provider with email context.
- Return type:
- class scitex_scholar.ScholarBrowserManager(browser_mode=None, auth_manager=None, chrome_profile_name=None, config=None)[source]
Bases:
BrowserMixinManages a local browser instance with stealth enhancements and invisible mode.
- __init__(browser_mode=None, auth_manager=None, chrome_profile_name=None, config=None)[source]
Initialize ScholarBrowserManager with invisible browser capabilities.
- Parameters:
auth_manager – Authentication manager instance
config (
ScholarConfig) – Scholar configuration instance
- async get_authenticated_browser_and_context_async(**context_options)[source]
Get browser context with authentication cookies and extensions loaded.
- Return type:
tuple[Browser,BrowserContext]
- async _new_context_async(browser, **context_options)[source]
Creates a new browser context with stealth options and invisible mode applied.
- Return type:
BrowserContext
- _verify_xvfb_running(_recursed=False)[source]
Verify Xvfb virtual display is running; auto-start if absent.
- async _load_auth_cookies_to_persistent_context_async()[source]
Load authentication cookies into the persistent browser context.
- async take_screenshot_async(page, path, timeout_sec=30.0, timeout_after_sec=30.0, full_page=False)[source]
Take screenshot without viewport changes.
- async start_periodic_screenshots_async(page, output_dir, prefix='periodic', interval_seconds=1, duration_seconds=10, verbose=False)[source]
Start taking periodic screenshots in the background.
- Parameters:
- Returns:
asyncio.Task that can be cancelled to stop screenshots
- async stop_periodic_screenshots_async(task)[source]
Stop periodic screenshots task.
- async close()[source]
Close browser while preserving authentication and extension data.
- class scitex_scholar.ScholarURLFinder(context, config=None)[source]
Bases:
objectFind PDF URLs from web pages.
Simple, focused responsibility: - Input: Page or URL string - Output: List of PDF URLs
Authentication/DOI resolution should be handled BEFORE calling this.
- PAGE_LOAD_TIMEOUT = 30000
- __init__(context, config=None)[source]
- async find_pdf_urls(page_or_url, base_url=None)[source]
Find PDF URLs from page or URL string.
- async _find_pdf_urls_with_strategies(page, base_url=None)[source]
Try strategies in priority order.
- _extract_doi(url)[source]
Extract DOI from string if present.
- Parameters:
url (
str) – URL string or DOI- Return type:
- Returns:
DOI string if found, None otherwise
Examples
>>> _extract_doi("10.1038/nature12345") "10.1038/nature12345" >>> _extract_doi("doi:10.1038/nature12345") "10.1038/nature12345" >>> _extract_doi("https://example.com") None
- async _find_from_page(page, base_url=None)[source]
Find PDFs from existing page.
- _managed_page()[source]
Context manager for page lifecycle.
- class scitex_scholar.CitationGraphBuilder(db_path=None, api_url=None)[source]
Bases:
objectBuild citation network graphs for academic papers.
Auto-detects backend via crossref_local.Config (DB → HTTP).
- Example (auto-detect):
>>> builder = CitationGraphBuilder() >>> graph = builder.build("10.1038/s41586-020-2008-3", top_n=20)
- Example (explicit SQLite):
>>> builder = CitationGraphBuilder(db_path="/path/to/crossref.db")
- Example (explicit HTTP):
>>> builder = CitationGraphBuilder(api_url="http://localhost:31291")
- __init__(db_path=None, api_url=None)[source]
Initialize builder with database path, HTTP API URL, or auto-detect.
When no args given, delegates to crossref_local.Config for auto-detection: 1. CROSSREF_LOCAL_MODE env var (explicit “db” or “http”) 2. CROSSREF_LOCAL_API_URL env var → HTTP mode 3. Local DB file existence → DB mode 4. Fallback to HTTP mode
- _auto_detect()[source]
Auto-detect backend via crossref_local.Config.
- build(seed_doi, top_n=20, weight_coupling=2.0, weight_cocitation=2.0, weight_direct=1.0)[source]
Build citation network around a seed paper.
- Parameters:
- Return type:
- Returns:
CitationGraph object with nodes and edges
- _create_paper_node(doi, similarity_score)[source]
Create a PaperNode with metadata from database.
- _build_citation_edges(dois)[source]
Build citation edges between papers in the network.
- Parameters:
- Return type:
- Returns:
List of CitationEdge objects
- build_from_dois(dois, num_related_per_doi=20, weight_coupling=2.0, weight_cocitation=2.0, weight_direct=1.0)[source]
Build citation network from multiple seed DOIs.
Combines similarity scores from all seeds to find papers related to the entire set, producing a richer connected graph.
- Parameters:
- Return type:
- Returns:
CitationGraph with all seeds + related papers + edges
- build_from_query(query, num_related_per_doi=20, search_limit=10, weight_coupling=2.0, weight_cocitation=2.0, weight_direct=1.0)[source]
Build citation network from a text query.
Searches local databases, extracts DOIs from results, then delegates to build_from_dois().
- Parameters:
query (
str) – Search query (e.g. “hippocampal sharp wave ripples”)num_related_per_doi (
int) – Related papers per seed DOIsearch_limit (
int) – Max papers to fetch from searchweight_coupling (
float) – Weight for bibliographic couplingweight_cocitation (
float) – Weight for co-citationweight_direct (
float) – Weight for direct citations
- Return type:
- Returns:
CitationGraph with search-discovered seeds + related papers
- export_json(graph, output_path)[source]
Export graph to JSON file for visualization.
- Parameters:
graph (
CitationGraph) – CitationGraph to exportoutput_path (
str) – Path to output JSON file
- scitex_scholar.plot_citation_graph(graph, backend='auto', output=None, **kwargs)[source]
Visualize a citation graph with pluggable backends.
- Parameters:
graph (CitationGraph or networkx.DiGraph) – Citation network to visualize. CitationGraph is auto-converted via
to_networkx().backend (str) – Rendering backend: ‘auto’, ‘figrecipe’, ‘scitex.plt’, ‘matplotlib’, or ‘pyvis’. Default ‘auto’ picks the best available.
output (str, optional) – Output file path. Required for ‘pyvis’ backend (HTML). For static backends, saves the figure to this path.
**kwargs – Backend-specific keyword arguments (layout, seed, figsize, etc.).
- Returns:
Backend-specific result. Static backends return
{'fig', 'ax', 'pos', 'backend'}. Pyvis returns{'output', 'backend'}.- Return type:
- scitex_scholar.to_bibtex(paper)[source]
Format a standard paper dict as a BibTeX entry.
- Return type:
- scitex_scholar.to_endnote(paper)[source]
Format a standard paper dict as an EndNote entry.
- Return type:
- scitex_scholar.to_text_citation(paper, style='apa', doc_type='article')[source]
Format a paper dict as a text citation in the given style.
- scitex_scholar.papers_to_format(papers, fmt)[source]
Format a list of paper dicts to the given format string.
- Return type:
- scitex_scholar.generate_cite_key(paper)[source]
Generate a BibTeX citation key from a paper dict.
- Return type:
- scitex_scholar.make_citation_key(last_name, year=None)[source]
Generate a citation key from author last name and year.
- scitex_scholar.from_connected_papers(paper_id, *, cp_api_key=None, s2_api_key=None, output_format='citation_graph', dry_run=False)[source]
Import a Connected Papers graph into scitex.
- Parameters:
paper_id (str) – Semantic Scholar paper ID (40-char SHA) for the seed paper.
cp_api_key (str, optional) – Connected Papers API key.
s2_api_key (str, optional) – Semantic Scholar API key for DOI resolution.
output_format (str) – “citation_graph” returns CitationGraph, “papers” returns Papers.
dry_run (bool) – If True, fetch and report stats without creating objects.
- Returns:
{success: True, graph/papers, stats, warnings} or {success: False, error: str}.
- Return type:
- scitex_scholar.to_connected_papers(graph, *, output=None)[source]
Export a CitationGraph as BibTeX/JSON for Connected Papers.
- Parameters:
graph (CitationGraph) – Citation graph to export.
output (str or Path, optional) – Output directory. Defaults to current directory.
- Returns:
{success, bibtex_path, json_path, paper_count} or {success: False, error}.
- Return type:
- scitex_scholar.apply_filters(papers, filters=None, parsed_operators=None)[source]
Filter a list of paper dicts by various criteria.
- Parameters:
papers (
List[Dict[str,Any]]) – List of paper dicts. Each dict should contain the keys described in the module docstring; missing keys are treated as empty / zero values.filters (
Optional[Dict[str,Any]]) –Dict of filter criteria extracted from a search form or URL parameters. Supported keys:
year_from,year_to– year range (int)min_citations,max_citations– citation range (int)min_impact_factor– minimum IF (float)max_impact_factor– maximum IF (float)authors– list of author name strings (legacy)journal– journal name substring (legacy, str)open_access– booldoc_type–"review"|"preprint"| otherlanguage– language string ("english"passes)
parsed_operators (
Optional[Dict[str,Any]]) –Dict produced by
SearchQueryParser.from_shell_syntax()or the equivalentparse_query_operators()function from scitex-cloud. Supported keys:title_includes,title_excludes– list[str]author_includes,author_excludes– list[str]journal_includes,journal_excludes– list[str]year_min,year_max– intcitations_min,citations_max– intimpact_factor_min,impact_factor_max– float
- Returns:
Filtered list of paper dicts (same objects, not copies).
- Return type:
- scitex_scholar.clean_abstract(text)
Strip HTML/JATS XML tags from a CrossRef-style abstract.
- Return type:
Semantic Highlighter
Semantic PDF highlighter for academic papers.
Overlays rhetorical-role highlights (claim / method / limitation / supportive / contradictive) onto a copy of a PDF without modifying its underlying text. Highlights are standard PDF annotation objects compatible with any viewer.
- class scitex_scholar.pdf_highlight.Block(id, page, bbox, text, category=None, confidence=0.0)[source]
Bases:
objectA unit of classification — either a paragraph or a sentence.
bboxis always the paragraph-level clip rectangle. For sentence units this is used only as the search region when locating the sentence’s glyphs on the page at annotation time.- __init__(id, page, bbox, text, category=None, confidence=0.0)
- class scitex_scholar.pdf_highlight.HighlightResult(input_path, output_path, blocks, pages, annotations_added)[source]
Bases:
object- __init__(input_path, output_path, blocks, pages, annotations_added)
- scitex_scholar.pdf_highlight.apply_classifications(blocks, classifications)[source]
Assign offline-produced labels to already-extracted blocks.
Each entry must contain at least
idandcategory;confidenceis optional (defaults to 0.0). Categories outsideCATEGORIESare silently dropped.Returns the number of blocks that received a label.
- Return type:
- scitex_scholar.pdf_highlight.extract_blocks(pdf_path, min_chars=40, *, sentence_level=True)[source]
Open a PDF and return (document, units-of-classification).
sentence_level=True(default) yields one unit per sentence, which gives much tighter highlights — avoids painting a whole paragraph green when only its last two sentences state the claim.sentence_level=Falseyields one unit per paragraph.Units shorter than
min_charsare dropped (filters page numbers, running headers, short captions, and sentence fragments).
- scitex_scholar.pdf_highlight.highlight_pdf(pdf_path, output_path=None, *, model='claude-haiku-4-5-20251001', use_stub=False, dry_run=False, max_blocks=0, batch_size=25, min_chars=40, sentence_level=True, add_legend=True, min_confidence=0.0, concurrency=4, on_info=None, on_warning=None)[source]
Annotate a PDF with rhetorical-role highlights.
- Parameters:
output_path (
Union[str,PathLike,None]) – Output PDF. Defaults to<input>.highlighted.pdf.model (
str) – Anthropic model ID used by the LLM classifier.use_stub (
bool) – If True, classify with a keyword heuristic (no API call).dry_run (
bool) – If True, classify but do not write the output PDF.max_blocks (
int) – If >0, truncate to the first N extracted units.batch_size (
int) – Classifier batch size (units per API call).min_chars (
int) – Minimum text length for an extracted unit.sentence_level (
bool) – If True (default), classify and highlight at sentence granularity. If False, use paragraph-level (less precise but ~5× cheaper on long papers).add_legend (
bool) – If True, prepend a colour legend + signature page.
- Return type:
- Returns:
HighlightResultwith the classified units and annotation count.
- scitex_scholar.pdf_highlight.save_with_highlights(doc, blocks, output_path, *, add_legend=True, signature=None, model_label=None, source_name=None, min_confidence=0.0, on_info=None)[source]
Write
docwith highlight annotations for all labelled blocks.When
add_legend=True(default) a colour legend + signature page is prepended so readers can see which colour means what.min_confidencesuppresses highlights below that confidence.on_info(optional callable) receives progress messages.The save deliberately uses
garbage=0, deflate=False. The earliergarbage=3, deflate=Truerecompressed every stream of the source PDF, which on a large (20 MB+) image-heavy paper ran for minutes entirely inside pymupdf’s C code — and because CPython only deliversKeyboardInterruptbetween bytecode ops, that made the run both slow and uninterruptible (Ctrl-C queued but never fired). Appending the annotation objects without recompression is near-instant and keeps the C calls short enough to stay responsive to signals.Returns the number of highlight annotations added (not counting the legend page).
- Return type:
Colour Scheme
5-category rhetorical colour scheme for the semantic highlighter.
Block Extraction
Block extraction — paragraph-level layout + sentence-level splitting.
- class scitex_scholar.pdf_highlight._blocks.Block(id, page, bbox, text, category=None, confidence=0.0)[source]
A unit of classification — either a paragraph or a sentence.
bboxis always the paragraph-level clip rectangle. For sentence units this is used only as the search region when locating the sentence’s glyphs on the page at annotation time.- __init__(id, page, bbox, text, category=None, confidence=0.0)
- scitex_scholar.pdf_highlight._blocks._split_sentences(text)[source]
Naive academic-aware sentence splitter.
Splits on sentence-ending punctuation followed by whitespace and a capital/digit/opening quote, then re-joins splits that follow common abbreviations (Fig., e.g., et al., single-initial J.).
- scitex_scholar.pdf_highlight._blocks.extract_blocks(pdf_path, min_chars=40, *, sentence_level=True)[source]
Open a PDF and return (document, units-of-classification).
sentence_level=True(default) yields one unit per sentence, which gives much tighter highlights — avoids painting a whole paragraph green when only its last two sentences state the claim.sentence_level=Falseyields one unit per paragraph.Units shorter than
min_charsare dropped (filters page numbers, running headers, short captions, and sentence fragments).
Classifier
LLM and offline classifiers for the semantic highlighter.
- scitex_scholar.pdf_highlight._classifier._available_models(client)[source]
Best-effort list of model IDs the account can call. Empty on failure.
- scitex_scholar.pdf_highlight._classifier._model_not_found_error(model, client)[source]
Build a helpful error that hints the available model IDs.
- Return type:
- scitex_scholar.pdf_highlight._classifier._retry_wait_seconds(exc, attempt, base, cap)[source]
Seconds to wait before the next retry.
Prefers the server’s
Retry-Afterheader; otherwise uses exponential backoff (base * 2**attempt, capped) with full jitter so concurrent callers don’t retry in lockstep.- Return type:
- scitex_scholar.pdf_highlight._classifier._classify_one_batch(client, anthropic, model, batch, retryable, max_retries, backoff_base, backoff_cap, info)[source]
Call the API for one batch, retrying transient errors with backoff.
Returns the message on success, or
Noneif it stays unrecoverable aftermax_retries. Raises for non-retryable errors (bad model, other 4xx) so the whole run aborts on those.
- scitex_scholar.pdf_highlight._classifier._apply_predictions(batch, raw)[source]
Parse the model’s JSON reply and write categories onto
batch.- Return type:
- scitex_scholar.pdf_highlight._classifier.classify_llm(blocks, model, batch_size=25, on_warning=None, on_info=None, max_retries=8, backoff_base=2.0, backoff_cap=60.0, concurrency=4)[source]
Classify blocks in-place by calling the Anthropic Messages API.
Batches are sent concurrently (up to
concurrencyin flight) to cut wall-clock time, while each batch independently retries rate-limit (429) and transient server/connection errors with exponential backoff that honors anyRetry-Afterheader. A batch that stays unrecoverable is skipped (its units stay unclassified) so the run still produces a partial result. Per-batch progress is reported viaon_info.- Return type:
Annotator
PDF annotation — tight per-sentence highlights + legend/signature page.
- scitex_scholar.pdf_highlight._annotator._chunk_quads_for_sentence(page, rect, sentence)[source]
Locate a sentence piecewise via short word-window probes.
A whole-sentence
search_forusually fails when the sentence wraps across lines (the on-page text has line breaks / hyphenation that the whitespace-normalised sentence string does not). We instead walk the sentence in ~60-char word windows; each window almost always lives on a single line, sosearch_formatches it and returns a tight quad. Concatenating the windows’ quads highlights only the sentence’s glyphs — never the surrounding paragraph.- Return type:
list[Quad]
- scitex_scholar.pdf_highlight._annotator._search_quads_for_sentence(page, rect, sentence)[source]
Locate a sentence’s glyphs, tightest-match first.
Whole-sentence probes (fast path for single-line sentences).
Word-window chunks (handles sentences that wrap across lines).
Returns an empty list when the sentence cannot be located at all; the caller decides whether to skip it. We deliberately do NOT fall back to the paragraph’s line boxes — that paints the entire paragraph and is the source of the over-large block highlights.
- Return type:
list[Quad]
- scitex_scholar.pdf_highlight._annotator.apply_highlights(doc, blocks, *, min_confidence=0.0, on_info=None)[source]
Overlay one highlight annotation per classified block. Returns count.
min_confidenceskips any classified block whose confidence is below the threshold, so a reader can thin out low-certainty highlights.on_info(optional callable) receives periodic progress messages while the per-sentence text search runs — this phase is CPU-bound and otherwise silent, so without it a long PDF looks hung.- Return type:
- scitex_scholar.pdf_highlight._annotator._corner_rect(page, corner, w=210, h=112)[source]
Return a rect anchored to
cornerofpage(“lr”, “ll”, “lc”).- Return type:
Rect
- scitex_scholar.pdf_highlight._annotator._draw_legend_overlay(page, rect, *, signature, model_label, source_name)[source]
Paint a small opaque legend panel into
rectonpage.Opaque white background so the panel remains readable even if it overlays text underneath. Kept intentionally small — the information density is high and the goal is unobtrusive reference.
- Return type:
- scitex_scholar.pdf_highlight._annotator.add_legend(doc, *, signature, model_label, source_name, corner='lr')[source]
Stamp a compact legend overlay in a corner of the last page.
Default corner is lower-right (“lr”); valid alternatives are lower-left (“ll”) and lower-centre (“lc”). No new pages are added — the overlay sits on top of any existing content (opaque background).
- Return type:
- scitex_scholar.pdf_highlight._annotator.add_legend_page(doc, *, signature, model_label, source_name, corner='lr')
Stamp a compact legend overlay in a corner of the last page.
Default corner is lower-right (“lr”); valid alternatives are lower-left (“ll”) and lower-centre (“lc”). No new pages are added — the overlay sits on top of any existing content (opaque background).
- Return type:
Stamp a compact legend overlay in a corner of the last page.
Default corner is lower-right (“lr”); valid alternatives are lower-left (“ll”) and lower-centre (“lc”). No new pages are added — the overlay sits on top of any existing content (opaque background).
- Return type:
MCP Tool Spec
MCP tool spec for the semantic PDF highlighter.
Provides a JSON-schema tool definition and a handler function that
any MCP server implementation can import and register. The handler
itself is synchronous; wrap with asyncio.to_thread inside the server.
- scitex_scholar.pdf_highlight._mcp.run_tool(arguments)[source]
Execute the highlighter and return a summary suitable for an MCP reply.
The returned dict has
output_path,pages,annotations_added, andcounts(per-category). RaisesFileNotFoundErrororRuntimeErroras the highlighter does; MCP servers should translate these into tool-level errors.