scitex_scholar.core

class scitex_scholar.core.Paper(*args, **kwargs)[source]

Bases: BaseModel

Complete paper with metadata and container.

model_dump(**kwargs)[source]

Custom serialization to ensure all nested models use aliases.

Return type:: Dict[str, Any]

classmethod from_dict(data)[source]

Create from dictionary (for loading from JSON).

Uses Pydantic’s model_validate which handles: - Type validation - Type coercion (e.g., “2024” -> 2024) - Field aliases (e.g., “2025” -> y2025)

Return type:: Paper

to_dict()[source]

Convert to dictionary for JSON serialization.

Alias for model_dump() for backward compatibility.

Return type:: Dict[str, Any]

detect_open_access(use_unpaywall=False, update_metadata=True)[source]

Detect open access status for this paper.

Uses identifiers (DOI, arXiv ID, PMCID) and known OA sources to determine if the paper is freely available.

Parameters:

use_unpaywall (bool) – If True, query Unpaywall API for uncertain cases
update_metadata (bool) – If True, update self.metadata.access with results

Return type:

OAResult

Returns:

OAResult with detection results

property is_open_access: bool: Check if paper is open access (quick check without API calls).

class scitex_scholar.core.Papers(papers=None, project=None, config=None)[source]

Bases: object

A simple collection of Paper objects.

This is a minimal collection class. Most business logic (loading, saving, enrichment, etc.) is handled by Scholar.

Methods have been reduced from 39 to ~15 for simplicity. Complex operations should use Scholar or utility functions.

__init__(papers=None, project=None, config=None)[source]

Initialize Papers collection.

Parameters:

papers (Union[List[Paper], List[Dict], None]) – List of Paper objects or dicts to convert to Papers
project (Optional[str]) – Project name for organizing papers
config (Optional[ScholarConfig]) – Scholar configuration

__len__()[source]

Number of papers in collection.

Return type:: int

__iter__()[source]

Iterate over papers.

Return type:: Iterator[Paper]

__getitem__(index)[source]

Get paper(s) by index or slice.

Parameters:: index (Union[int, slice]) – Integer index or slice
Return type:: Union[Paper, Papers]
Returns:: Single Paper if integer index, Papers collection if slice

__repr__()[source]

String representation.

Return type:: str

__str__()[source]

Human-readable string.

Return type:: str

__dir__()[source]

Custom dir for better discoverability.

Return type:: List[str]

property papers: List[Paper]: Get the underlying papers list.

append(paper)[source]

Add a paper to the collection.

Parameters:: paper (Paper) – Paper to add
Return type:: None

extend(papers)[source]

Add multiple papers to the collection.

Parameters:: papers (Union[List[Paper], Papers]) – List of papers or another Papers collection
Return type:: None

to_list()[source]

Get papers as a list.

Return type:: List[Paper]
Returns:: List of Paper objects

filter(condition=None, year_min=None, year_max=None, has_doi=None, has_abstract=None, has_pdf=None, min_citations=None, max_citations=None, min_impact_factor=None, max_impact_factor=None, journal=None, author=None, keyword=None, publisher=None, **kwargs)[source]

Filter papers by condition or criteria.

Parameters:

condition (Optional[Callable[[Paper], bool]]) – Function that takes a Paper and returns bool.
year_min (Optional[int]) – Minimum year.
year_max (Optional[int]) – Maximum year.
has_doi (Optional[bool]) – Filter papers with/without DOI.
has_abstract (Optional[bool]) – Filter papers with/without abstract.
has_pdf (Optional[bool]) – Filter papers with/without PDF URL.
min_citations (Optional[int]) – Minimum citation count.
max_citations (Optional[int]) – Maximum citation count.
min_impact_factor (Optional[float]) – Minimum journal impact factor.
max_impact_factor (Optional[float]) – Maximum journal impact factor.
journal (Optional[str]) – Journal name (partial match).
author (Optional[str]) – Author name (partial match).
keyword (Optional[str]) – Keyword (searches in keywords, title, abstract).
publisher (Optional[str]) – Publisher name (partial match).
**kwargs – Additional keyword arguments for backward compatibility.

Returns:

New Papers collection with filtered papers.

Return type:

Papers

Examples

Filter using a lambda condition:

high_impact = papers.filter(lambda p: p.journal_impact_factor and p.journal_impact_factor > 10)
highly_cited = papers.filter(lambda p: p.citation_count and p.citation_count > 500)
recent = papers.filter(lambda p: p.year and p.year >= 2020)

Filter using built-in parameters:

high_impact_v2 = papers.filter(min_impact_factor=10.0)
highly_cited_v2 = papers.filter(min_citations=500)
recent_v2 = papers.filter(year_min=2020)

Combine multiple parameters:

filtered = papers.filter(
    min_impact_factor=5.0,
    min_citations=100,
    year_min=2015,
    year_max=2023,
    journal="Nature",
    has_doi=True,
)

Chain filters for AND logic:

elite_recent = papers.filter(min_impact_factor=10).filter(year_min=2020)

sort_by(*criteria, reverse=False, **kwargs)[source]

Sort papers by criteria.

Parameters:

*criteria – Field names (as strings) or lambda functions to sort by.
reverse (bool) – Sort in descending order (default: False).
**kwargs – Additional options.

Returns:

New sorted Papers collection.

Return type:

Papers

Notes

Available Paper fields for sorting:

title – Paper title
year – Publication year
citation_count – Number of citations
journal_impact_factor – Journal impact factor
journal – Journal name
publisher – Publisher name
doi – Digital Object Identifier
created_at – When record was created
updated_at – When record was last updated

Examples

Sort by a single field:

by_year = papers.sort_by('year')
by_citations_desc = papers.sort_by('citation_count', reverse=True)

Sort by multiple fields (primary, secondary, etc.):

by_year_then_citations = papers.sort_by('year', 'citation_count')

Sort using a lambda function:

by_citations = papers.sort_by(lambda p: p.citation_count or 0, reverse=True)
by_year_safe = papers.sort_by(lambda p: p.year if p.year else 9999)

Sort by a computed value:

by_citation_per_year = papers.sort_by(
    lambda p: (p.citation_count or 0) / (2024 - p.year) if p.year else 0,
    reverse=True,
)

classmethod from_bibtex(bibtex_input)[source]

Load papers from BibTeX.

DEPRECATED: Use Scholar.from_bibtex() instead. This method is kept for backward compatibility.

Parameters:: bibtex_input (Union[str, Path]) – Path to BibTeX file or BibTeX string
Return type:: Papers
Returns:: Papers collection

classmethod _from_bibtex_file(file_path)[source]

Load papers from BibTeX file.

Parameters:: file_path (Union[str, Path]) – Path to BibTeX file
Return type:: Papers
Returns:: Papers collection

classmethod _from_bibtex_text(bibtex_content)[source]

Load papers from BibTeX text.

Parameters:: bibtex_content (str) – BibTeX content as string
Return type:: Papers
Returns:: Papers collection

static _bibtex_entry_to_paper(entry)[source]

Convert BibTeX entry to Paper object.

Parameters:: entry (Dict[str, Any]) – BibTeX entry dictionary
Return type:: Paper
Returns:: Paper object

save(output_path, format='auto', **kwargs)[source]

Save papers to file.

DEPRECATED: Use Scholar.save_papers() or Scholar.export_bibtex() instead. This method is kept for backward compatibility.

Parameters:

output_path (Union[str, Path]) – Path to save file
format (Optional[str]) – Output format (auto, bibtex, json, csv)
**kwargs – Additional options

Return type:

None

to_dict()[source]

Convert to dictionary.

DEPRECATED: Use papers_utils.papers_to_dict() for new code.

Return type:: List[Dict[str, Any]]
Returns:: Dictionary representation

to_dataframe()[source]

Convert to pandas DataFrame.

DEPRECATED: Use papers_utils.papers_to_dataframe() for new code.

Return type:: Any
Returns:: DataFrame with papers data

summary()[source]

Get summary statistics.

DEPRECATED: Use papers_utils.papers_statistics() for new code.

Return type:: Dict[str, Any]
Returns:: Dictionary with statistics

class scitex_scholar.core.Scholar(config=None, project=None, project_description=None, browser_mode=None)[source]

Bases: EnricherMixin, URLFindingMixin, PDFDownloadMixin, LoaderMixin, SearchMixin, SaverMixin, ProjectHandlerMixin, LibraryHandlerMixin, PipelineMixin, ServiceMixin

Main interface for SciTeX Scholar - scientific literature management made simple.

By default, papers are automatically enriched with:

Journal impact factors from impact_factor package (2024 JCR data)
Citation counts from Semantic Scholar (via DOI/title matching)

Examples

Basic search with automatic enrichment:

scholar = Scholar()
papers = scholar.search("deep learning neuroscience")
# Papers now have impact_factor and citation_count populated
papers.save("my_pac.bib")

Disable automatic enrichment if needed:

config = ScholarConfig(enable_auto_enrich=False)
scholar = Scholar(config=config)

Search a specific source:

papers = scholar.search("transformer models", sources='arxiv')

Advanced workflow:

papers = (
    scholar.search("transformer models", year_min=2020)
           .filter(min_citations=50)
           .sort_by("impact_factor")
           .save("transformers.bib")
)

Local library:

scholar._index_local_pdfs("./my_papers")
local_papers = scholar.search_local("attention mechanism")

property name: Class name for logging.

__init__(config=None, project=None, project_description=None, browser_mode=None)[source]

Initialize Scholar with configuration.

Parameters:

config (Union[ScholarConfig, str, Path, None]) –
One of:
- ScholarConfig instance
- Path to YAML config file (str or Path)
- None (uses ScholarConfig.load() to find config)
project (Optional[str]) – Default project name for operations.
project_description (Optional[str]) – Optional description for the project.
browser_mode (Optional[str]) – Browser mode ('stealth', 'interactive', 'manual').

class scitex_scholar.core.OAStatus(value)[source]

Bases: Enum

Open Access status categories (aligned with Unpaywall).

GOLD = 'gold'

GREEN = 'green'

HYBRID = 'hybrid'

BRONZE = 'bronze'

CLOSED = 'closed'

UNKNOWN = 'unknown'

class scitex_scholar.core.OAResult(is_open_access, status, oa_url=None, source=None, license=None, confidence=1.0)[source]

Bases: object

Result of open access detection.

is_open_access: bool

status: OAStatus

oa_url: str | None = None

source: str | None = None

license: str | None = None

confidence: float = 1.0

__init__(is_open_access, status, oa_url=None, source=None, license=None, confidence=1.0)

scitex_scholar.core.detect_oa_from_identifiers(doi=None, arxiv_id=None, pmcid=None, source=None, journal=None, is_open_access_flag=None)[source]

Detect open access status from paper identifiers without API calls.

This is fast but may miss some OA papers (e.g., hybrid articles). For comprehensive detection, use check_oa_status_async() with Unpaywall.

Parameters:

doi (Optional[str]) – Paper DOI
arxiv_id (Optional[str]) – arXiv identifier
pmcid (Optional[str]) – PubMed Central ID (starts with PMC)
source (Optional[str]) – Source database (arxiv, pmc, biorxiv, etc.)
journal (Optional[str]) – Journal name
is_open_access_flag (Optional[bool]) – Pre-existing OA flag from search API

Return type:

OAResult

Returns:

OAResult with detection results

scitex_scholar.core.check_oa_status(doi=None, arxiv_id=None, pmcid=None, source=None, journal=None, is_open_access_flag=None, use_unpaywall=False)[source]

Synchronous wrapper for OA detection.

By default only uses local detection (no API calls). Set use_unpaywall=True to use Unpaywall API (requires event loop).

Return type:: OAResult

async scitex_scholar.core.check_oa_status_async(doi=None, arxiv_id=None, pmcid=None, source=None, journal=None, is_open_access_flag=None, use_unpaywall=True, unpaywall_email=None)[source]

Comprehensive open access detection.

First tries fast local detection, then falls back to Unpaywall API if the status is uncertain.

Parameters:

doi (Optional[str]) – Paper DOI
arxiv_id (Optional[str]) – arXiv identifier
pmcid (Optional[str]) – PubMed Central ID
source (Optional[str]) – Source database
journal (Optional[str]) – Journal name
is_open_access_flag (Optional[bool]) – Pre-existing OA flag
use_unpaywall (bool) – Whether to query Unpaywall for uncertain cases
unpaywall_email (str) – Email for Unpaywall API

Return type:

OAResult

Returns:

OAResult with best available OA information

scitex_scholar.core.is_open_access_source(source)[source]

Check if source is a known open access repository.

Sources are loaded from config/default.yaml → OPENACCESS_SOURCES

Return type:: bool

scitex_scholar.core.is_open_access_journal(journal_name, use_cache=True)[source]

Check if journal is a known open access journal.

Uses three-tier lookup: 1. Fast check against config/default.yaml → OPENACCESS_JOURNALS (pattern matching) 2. Comprehensive check against cached OpenAlex OA sources (exact match, 62K+ journals) 3. Journal normalizer check (handles abbreviations, variants, historical names)

Parameters:

journal_name (str) – Journal name to check
use_cache (bool) – Whether to use OpenAlex cache (default True)

Return type:

bool

Returns:

True if journal is known to be Open Access

scitex_scholar.core.is_arxiv_id(identifier)[source]

Check if identifier looks like an arXiv ID.

Return type:: bool

class scitex_scholar.core.OASourcesCache(cache_dir=None)[source]

Bases: object

Manages cached Open Access sources from OpenAlex.

Features: - Lazy loading on first access - 1-day TTL with automatic refresh - Thread-safe singleton pattern - Fallback to config YAML if API fails - Journal name normalization via ISSN-L - Handles abbreviations, variants, and historical names

__init__(cache_dir=None)[source]

classmethod get_instance(cache_dir=None)[source]

Get singleton instance.

Return type:: OASourcesCache

_is_cache_valid()[source]

Check if cache exists and is within TTL.

Return type:: bool

_load_from_cache()[source]

Load cached data from file.

Return type:: bool

_save_to_cache()[source]

Save current data to cache file.

Return type:: None

async _fetch_oa_sources_async(max_pages=100)[source]

Fetch OA sources from OpenAlex API.

Parameters:: max_pages (int) – Maximum pages to fetch (200 sources per page)
Return type:: None

_fetch_oa_sources_sync(max_pages=100)[source]

Synchronous wrapper for fetching OA sources.

Return type:: None

ensure_loaded(force_refresh=False)[source]

Ensure cache is loaded, fetching from API if needed.

Parameters:: force_refresh (bool) – Force refresh even if cache is valid
Return type:: None

is_oa_source(source_name)[source]

Check if a source/journal name is in the OA list.

Parameters:: source_name (str) – Journal or source name to check
Return type:: bool
Returns:: True if source is known to be Open Access

is_oa_issn(issn)[source]

Check if an ISSN belongs to an OA journal.

Parameters:: issn (str) – ISSN to check
Return type:: bool
Returns:: True if ISSN belongs to an OA journal

property source_count: int: Get number of cached OA sources.

property cache_age_hours: float: Get cache age in hours.

scitex_scholar.core.get_oa_cache(cache_dir=None)[source]

Get the OA sources cache singleton.

Return type:: OASourcesCache

scitex_scholar.core.is_oa_journal_cached(journal_name)[source]

Check if journal is OA using cached OpenAlex data.

Return type:: bool

scitex_scholar.core.refresh_oa_cache()[source]

Force refresh the OA sources cache.

Return type:: None

class scitex_scholar.core.JournalNormalizer(cache_dir=None)[source]

Bases: object

Journal name normalizer using ISSN-L as unique identifier.

Handles: - Full names ↔ abbreviations - Name variants (spelling, punctuation, capitalization) - Historical/former names - Publisher variations

Data is cached locally with daily refresh from OpenAlex.

__init__(cache_dir=None)[source]

classmethod get_instance(cache_dir=None)[source]

Get singleton instance.

Return type:: JournalNormalizer

_is_cache_valid()[source]

Check if cache exists and is within TTL.

Return type:: bool

_load_from_cache()[source]

Load cached data from file.

Return type:: bool

_save_to_cache()[source]

Save current data to cache file.

Return type:: None

_add_journal(source_data)[source]

Add a journal to the normalizer from OpenAlex source data.

Parameters:: source_data (Dict[str, Any]) – OpenAlex source object with display_name, issn_l, etc.
Return type:: None

async _fetch_journals_async(max_pages=500, filter_oa_only=False)[source]

Fetch journal data from OpenAlex API.

Parameters:

max_pages (int) – Maximum pages to fetch (200 per page)
filter_oa_only (bool) – If True, only fetch OA journals

Return type:

None

_fetch_journals_sync(max_pages=500, filter_oa_only=False)[source]

Synchronous wrapper for fetching journals (handles nested event loops).

Return type:: None

ensure_loaded(force_refresh=False, max_pages=500)[source]

Ensure cache is loaded, fetching from API if needed.

Parameters:

force_refresh (bool) – Force refresh even if cache is valid
max_pages (int) – Max pages to fetch if refreshing

Return type:

None

get_issn_l(journal_name)[source]

Get ISSN-L for a journal name.

Parameters:: journal_name (str) – Any journal name variant, abbreviation, or ISSN
Return type:: Optional[str]
Returns:: ISSN-L if found, None otherwise

normalize(journal_name)[source]

Normalize journal name to canonical form.

Parameters:: journal_name (str) – Any journal name variant
Return type:: Optional[str]
Returns:: Canonical journal name, or original if not found

get_abbreviation(journal_name)[source]

Get abbreviated title for a journal.

Parameters:: journal_name (str) – Any journal name variant
Return type:: Optional[str]
Returns:: Abbreviated title if available

get_journal_info(journal_name)[source]

Get full journal metadata.

Parameters:: journal_name (str) – Any journal name variant
Return type:: Optional[Dict[str, Any]]
Returns:: Dict with canonical_name, abbreviated_title, alternate_titles, issns, is_oa, publisher

is_same_journal(name1, name2)[source]

Check if two names refer to the same journal.

Parameters:

name1 (str) – First journal name
name2 (str) – Second journal name

Return type:

bool

Returns:

True if both names resolve to the same ISSN-L

is_open_access(journal_name)[source]

Check if journal is Open Access.

Parameters:: journal_name (str) – Any journal name variant
Return type:: bool
Returns:: True if journal is OA

search(query, limit=10)[source]

Search for journals by name (prefix/substring match).

Parameters:

query (str) – Search query
limit (int) – Maximum results

Return type:

List[Dict[str, Any]]

Returns:

List of matching journal info dicts

property journal_count: int: Get number of cached journals.

property cache_age_hours: float: Get cache age in hours.

scitex_scholar.core.get_journal_normalizer(cache_dir=None)[source]

Get the journal normalizer singleton.

Return type:: JournalNormalizer

scitex_scholar.core.normalize_journal_name(name)[source]

Normalize journal name to canonical form.

Return type:: Optional[str]

scitex_scholar.core.get_journal_issn_l(name)[source]

Get ISSN-L for a journal name.

Return type:: Optional[str]

scitex_scholar.core.is_same_journal(name1, name2)[source]

Check if two names refer to the same journal.

Return type:: bool

scitex_scholar.core.refresh_journal_cache()[source]

Force refresh the journal normalizer cache.

Return type:: None