scitex_scholar.core
- class scitex_scholar.core.Paper(*args, **kwargs)[source]
Bases:
BaseModelComplete paper with metadata and container.
- classmethod from_dict(data)[source]
Create from dictionary (for loading from JSON).
Uses Pydantic’s model_validate which handles: - Type validation - Type coercion (e.g., “2024” -> 2024) - Field aliases (e.g., “2025” -> y2025)
- Return type:
- to_dict()[source]
Convert to dictionary for JSON serialization.
Alias for model_dump() for backward compatibility.
- class scitex_scholar.core.Papers(papers=None, project=None, config=None)[source]
Bases:
objectA simple collection of Paper objects.
This is a minimal collection class. Most business logic (loading, saving, enrichment, etc.) is handled by Scholar.
Methods have been reduced from 39 to ~15 for simplicity. Complex operations should use Scholar or utility functions.
- filter(condition=None, year_min=None, year_max=None, has_doi=None, has_abstract=None, has_pdf=None, min_citations=None, max_citations=None, min_impact_factor=None, max_impact_factor=None, journal=None, author=None, keyword=None, publisher=None, **kwargs)[source]
Filter papers by condition or criteria.
- Parameters:
condition (
Optional[Callable[[Paper],bool]]) – Function that takes a Paper and returns bool.has_abstract (
Optional[bool]) – Filter papers with/without abstract.has_pdf (
Optional[bool]) – Filter papers with/without PDF URL.min_impact_factor (
Optional[float]) – Minimum journal impact factor.max_impact_factor (
Optional[float]) – Maximum journal impact factor.keyword (
Optional[str]) – Keyword (searches in keywords, title, abstract).**kwargs – Additional keyword arguments for backward compatibility.
- Returns:
New Papers collection with filtered papers.
- Return type:
Examples
Filter using a lambda condition:
high_impact = papers.filter(lambda p: p.journal_impact_factor and p.journal_impact_factor > 10) highly_cited = papers.filter(lambda p: p.citation_count and p.citation_count > 500) recent = papers.filter(lambda p: p.year and p.year >= 2020)
Filter using built-in parameters:
high_impact_v2 = papers.filter(min_impact_factor=10.0) highly_cited_v2 = papers.filter(min_citations=500) recent_v2 = papers.filter(year_min=2020)
Combine multiple parameters:
filtered = papers.filter( min_impact_factor=5.0, min_citations=100, year_min=2015, year_max=2023, journal="Nature", has_doi=True, )
Chain filters for AND logic:
elite_recent = papers.filter(min_impact_factor=10).filter(year_min=2020)
- sort_by(*criteria, reverse=False, **kwargs)[source]
Sort papers by criteria.
- Parameters:
*criteria – Field names (as strings) or lambda functions to sort by.
reverse (
bool) – Sort in descending order (default: False).**kwargs – Additional options.
- Returns:
New sorted Papers collection.
- Return type:
Notes
Available Paper fields for sorting:
title– Paper titleyear– Publication yearcitation_count– Number of citationsjournal_impact_factor– Journal impact factorjournal– Journal namepublisher– Publisher namedoi– Digital Object Identifiercreated_at– When record was createdupdated_at– When record was last updated
Examples
Sort by a single field:
by_year = papers.sort_by('year') by_citations_desc = papers.sort_by('citation_count', reverse=True)
Sort by multiple fields (primary, secondary, etc.):
by_year_then_citations = papers.sort_by('year', 'citation_count')
Sort using a lambda function:
by_citations = papers.sort_by(lambda p: p.citation_count or 0, reverse=True) by_year_safe = papers.sort_by(lambda p: p.year if p.year else 9999)
Sort by a computed value:
by_citation_per_year = papers.sort_by( lambda p: (p.citation_count or 0) / (2024 - p.year) if p.year else 0, reverse=True, )
- classmethod from_bibtex(bibtex_input)[source]
Load papers from BibTeX.
DEPRECATED: Use Scholar.from_bibtex() instead. This method is kept for backward compatibility.
- save(output_path, format='auto', **kwargs)[source]
Save papers to file.
DEPRECATED: Use Scholar.save_papers() or Scholar.export_bibtex() instead. This method is kept for backward compatibility.
- to_dict()[source]
Convert to dictionary.
DEPRECATED: Use papers_utils.papers_to_dict() for new code.
- class scitex_scholar.core.Scholar(config=None, project=None, project_description=None, browser_mode=None)[source]
Bases:
EnricherMixin,URLFindingMixin,PDFDownloadMixin,LoaderMixin,SearchMixin,SaverMixin,ProjectHandlerMixin,LibraryHandlerMixin,PipelineMixin,ServiceMixinMain interface for SciTeX Scholar - scientific literature management made simple.
By default, papers are automatically enriched with:
Journal impact factors from impact_factor package (2024 JCR data)
Citation counts from Semantic Scholar (via DOI/title matching)
Examples
Basic search with automatic enrichment:
scholar = Scholar() papers = scholar.search("deep learning neuroscience") # Papers now have impact_factor and citation_count populated papers.save("my_pac.bib")
Disable automatic enrichment if needed:
config = ScholarConfig(enable_auto_enrich=False) scholar = Scholar(config=config)
Search a specific source:
papers = scholar.search("transformer models", sources='arxiv')
Advanced workflow:
papers = ( scholar.search("transformer models", year_min=2020) .filter(min_citations=50) .sort_by("impact_factor") .save("transformers.bib") )
Local library:
scholar._index_local_pdfs("./my_papers") local_papers = scholar.search_local("attention mechanism")
- property name
Class name for logging.
- __init__(config=None, project=None, project_description=None, browser_mode=None)[source]
Initialize Scholar with configuration.
- Parameters:
config (
Union[ScholarConfig,str,Path,None]) –One of:
ScholarConfiginstancePath to YAML config file (str or Path)
None(usesScholarConfig.load()to find config)
project (
Optional[str]) – Default project name for operations.project_description (
Optional[str]) – Optional description for the project.browser_mode (
Optional[str]) – Browser mode ('stealth','interactive','manual').
- class scitex_scholar.core.OAStatus(value)[source]
Bases:
EnumOpen Access status categories (aligned with Unpaywall).
- GOLD = 'gold'
- GREEN = 'green'
- HYBRID = 'hybrid'
- BRONZE = 'bronze'
- CLOSED = 'closed'
- UNKNOWN = 'unknown'
- class scitex_scholar.core.OAResult(is_open_access, status, oa_url=None, source=None, license=None, confidence=1.0)[source]
Bases:
objectResult of open access detection.
- __init__(is_open_access, status, oa_url=None, source=None, license=None, confidence=1.0)
- scitex_scholar.core.detect_oa_from_identifiers(doi=None, arxiv_id=None, pmcid=None, source=None, journal=None, is_open_access_flag=None)[source]
Detect open access status from paper identifiers without API calls.
This is fast but may miss some OA papers (e.g., hybrid articles). For comprehensive detection, use check_oa_status_async() with Unpaywall.
- scitex_scholar.core.check_oa_status(doi=None, arxiv_id=None, pmcid=None, source=None, journal=None, is_open_access_flag=None, use_unpaywall=False)[source]
Synchronous wrapper for OA detection.
By default only uses local detection (no API calls). Set use_unpaywall=True to use Unpaywall API (requires event loop).
- Return type:
- async scitex_scholar.core.check_oa_status_async(doi=None, arxiv_id=None, pmcid=None, source=None, journal=None, is_open_access_flag=None, use_unpaywall=True, unpaywall_email=None)[source]
Comprehensive open access detection.
First tries fast local detection, then falls back to Unpaywall API if the status is uncertain.
- scitex_scholar.core.is_open_access_source(source)[source]
Check if source is a known open access repository.
Sources are loaded from config/default.yaml → OPENACCESS_SOURCES
- Return type:
- scitex_scholar.core.is_open_access_journal(journal_name, use_cache=True)[source]
Check if journal is a known open access journal.
Uses three-tier lookup: 1. Fast check against config/default.yaml → OPENACCESS_JOURNALS (pattern matching) 2. Comprehensive check against cached OpenAlex OA sources (exact match, 62K+ journals) 3. Journal normalizer check (handles abbreviations, variants, historical names)
- scitex_scholar.core.is_arxiv_id(identifier)[source]
Check if identifier looks like an arXiv ID.
- Return type:
- class scitex_scholar.core.OASourcesCache(cache_dir=None)[source]
Bases:
objectManages cached Open Access sources from OpenAlex.
Features: - Lazy loading on first access - 1-day TTL with automatic refresh - Thread-safe singleton pattern - Fallback to config YAML if API fails - Journal name normalization via ISSN-L - Handles abbreviations, variants, and historical names
- _fetch_oa_sources_sync(max_pages=100)[source]
Synchronous wrapper for fetching OA sources.
- Return type:
- scitex_scholar.core.get_oa_cache(cache_dir=None)[source]
Get the OA sources cache singleton.
- Return type:
- scitex_scholar.core.is_oa_journal_cached(journal_name)[source]
Check if journal is OA using cached OpenAlex data.
- Return type:
- class scitex_scholar.core.JournalNormalizer(cache_dir=None)[source]
Bases:
objectJournal name normalizer using ISSN-L as unique identifier.
Handles: - Full names ↔ abbreviations - Name variants (spelling, punctuation, capitalization) - Historical/former names - Publisher variations
Data is cached locally with daily refresh from OpenAlex.
- async _fetch_journals_async(max_pages=500, filter_oa_only=False)[source]
Fetch journal data from OpenAlex API.
- _fetch_journals_sync(max_pages=500, filter_oa_only=False)[source]
Synchronous wrapper for fetching journals (handles nested event loops).
- Return type:
- ensure_loaded(force_refresh=False, max_pages=500)[source]
Ensure cache is loaded, fetching from API if needed.
- scitex_scholar.core.get_journal_normalizer(cache_dir=None)[source]
Get the journal normalizer singleton.
- Return type: