# International Student Fees Extractor - Standalone Version

This notebook is a standalone version of the international fees extraction strategy from the AI Data Extractor project.
It includes all the necessary components: vector search, reranking, LLM extraction, and result processing.

## Requirements
- Conda environment: `rag_env` (activate before running: `conda activate rag_env`)
- Required API keys: OPENAI_API_KEY, GROQ_API_KEY, COHERE_API_KEY
- Vector database with indexed university data


## 1- import libraries

In [1]:
#!/usr/bin/env python3
import os
import sys
import json
import time
import datetime
import json_repair
from typing import List, Dict
from pathlib import Path
from pydantic import BaseModel, Field
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# LangChain imports
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_groq import ChatGroq
from langchain_cerebras import ChatCerebras
from langchain_deepseek import ChatDeepSeek





# Optional imports
try:
    from sentence_transformers import CrossEncoder
    CROSSENCODER_AVAILABLE = True
except ImportError:
    CROSSENCODER_AVAILABLE = False
    print("‚ö†Ô∏è CrossEncoder not available. Install sentence-transformers for local reranking.")

try:
    import cohere
    COHERE_AVAILABLE = True
    print("‚úÖ Cohere library imported successfully")
except ImportError:
    COHERE_AVAILABLE = False
    print("‚ö†Ô∏è Cohere not available. Install cohere for API-based reranking.")
    print("üí° Run: pip install cohere")
    print("üí° Or in conda: conda install -c conda-forge cohere")

print("‚úÖ All basic dependencies imported successfully!")


‚úÖ Cohere library imported successfully
‚úÖ All basic dependencies imported successfully!


## 2- configuration

In [None]:
class Config:
    """Standalone configuration based on the original config.py"""
    
    # ============================================================================
    # UNIVERSITY SETTINGS
    # ============================================================================
    # Set this to your target university for testing
    UNIVERSITY_NAME = "cyberjaya.edu.my_urls_cleaned.txt"  # Change this for different university
    
    # Project paths (adjust if needed)
    _PROJECT_ROOT = Path.cwd()  # Current working directory
    CLEANED_DATA_DIR = str(_PROJECT_ROOT / "Cleaned Data")
    OUTPUT_DIRECTORY = str(_PROJECT_ROOT / "programs_names_output_test")
    
    # ============================================================================
    # LLM SETTINGS
    # ============================================================================
    LLM_PROVIDER = os.getenv("LLM_PROVIDER", "deepseek").lower()  # "groq", "deepseek", "cerebras", or "openai"
    LLM_TEMPERATURE = float(os.getenv("LLM_TEMPERATURE", "0.01"))
    LLM_PROVIDER ="deepseek"
    # ============================================================================
    # PROVIDER-SPECIFIC SETTINGS
    # ============================================================================
    
    # GROQ SETTINGS
    GROQ_LLM_MODEL = os.getenv("GROQ_LLM_MODEL", "openai/gpt-oss-20b")
    GROQ_API_KEY = os.getenv("GROQ_API_KEY")
    GROQ_API_BASE = os.getenv("GROQ_API_BASE", "https://api.groq.com/openai/v1")
    
    # CEREBRAS SETTINGS 
    CEREBRAS_LLM_MODEL = os.getenv("CEREBRAS_LLM_MODEL", "qwen-3-235b-a22b-thinking-2507")
    CEREBRAS_API_KEY = os.getenv("CEREBRAS_API_KEY")
    
    # DEEPSEEK SETTINGS 
    DEEPSEEK_LLM_MODEL = os.getenv("DEEPSEEK_LLM_MODEL", "deepseek-reasoner")
    DEEPSEEK_API_KEY = os.getenv("DEEPSEEK_API_KEY")
    DEEPSEEK_API_BASE = os.getenv("DEEPSEEK_API_BASE", "https://api.deepseek.com/v1")
    
    # OPENAI SETTINGS
    OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
    OPENAI_LLM_MODEL = os.getenv("OPENAI_LLM_MODEL", "gpt-4o")
    
    # ============================================================================
    # VECTOR STORE SETTINGS
    # ============================================================================
    EMBEDDING_MODEL = os.getenv("OPENAI_EMBEDDING_MODEL", "text-embedding-3-large")
    VECTOR_DB_PATH = str(_PROJECT_ROOT / "agents" / "chroma_langchain_db")
    
    # ============================================================================
    # RERANKER SETTINGS
    # ============================================32b================================
    CROSSENCODER_MODEL = "cross-encoder/ms-marco-MiniLM-L-6-v2"
    COHERE_MODEL = "rerank-english-v3.0"
    COHERE_API_KEY = os.getenv("CO_API_KEY") 
    USE_COHERE_RERANKER = True  # Set to False to use CrossEncoder
    
    # Debug Cohere settings
    print(f"üîç Debug - COHERE_API_KEY exists: {bool(COHERE_API_KEY)}")
    if COHERE_API_KEY:
        print(f"üîç Debug - COHERE_API_KEY length: {len(COHERE_API_KEY)}")
    print(f"üîç Debug - USE_COHERE_RERANKER: {USE_COHERE_RERANKER}")
    
    # ============================================================================
    # PROCESSING SETTINGS
    # ============================================================================
    DEFAULT_BATCH_SIZE = 20
    DEFAULT_VECTOR_SEARCH_K = 30
    DEFAULT_RERANK_TOP_N = 5
    BATCH_DELAY_SECONDS = 2
    
    @classmethod
    def get_collection_name(cls, university_name=None):
        """Get collection name for specified university"""
        university = university_name or cls.UNIVERSITY_NAME
        clean_name = university.replace('_urls_cleaned.txt', '')
        return f"{clean_name}_collection"
    
    @classmethod
    def get_programs_file(cls, university_name=None):
        """Get the programs file path for specified university"""
        university = university_name or cls.UNIVERSITY_NAME
        clean_domain = university.replace('_urls_cleaned.txt', '')
        filename = f"{university}_programs.json"
        return Path(cls.OUTPUT_DIRECTORY) / clean_domain / filename
    
    @classmethod
    def setup_environment(cls):
        """Setup environment variables"""
        if cls.OPENAI_API_KEY:
            os.environ['OPENAI_API_KEY'] = cls.OPENAI_API_KEY
        print(f"ü§ñ LLM Provider: {cls.LLM_PROVIDER}")
        print(f"üè´ University: {cls.UNIVERSITY_NAME}")
        
        # Debug provider-specific settings
        provider = cls.LLM_PROVIDER.lower()
        if provider == "groq":
            print(f"üìä GROQ Model: {cls.GROQ_LLM_MODEL}")
        elif provider == "cerebras":
            print(f"üìä CEREBRAS Model: {cls.CEREBRAS_LLM_MODEL}")
        elif provider == "deepseek":
            print(f"üìä DEEPSEEK Model: {cls.DEEPSEEK_LLM_MODEL}")
        elif provider == "openai":
            print(f"üìä OPENAI Model: {cls.OPENAI_LLM_MODEL}")

# Initialize configuration
Config.setup_environment()
print("‚úÖ Configuration initialized successfully!")


üîç Debug - COHERE_API_KEY exists: True
üîç Debug - COHERE_API_KEY length: 40
üîç Debug - USE_COHERE_RERANKER: True
ü§ñ LLM Provider: deepseek
üè´ University: cyberjaya.edu.my_urls_cleaned.txt
üìä DEEPSEEK Model: deepseek-reasoner
‚úÖ Configuration initialized successfully!


## 3. Pydantic Models for International Student Fees

In [4]:
# Pydantic models for international student fees (from international_fees_strategy.py)
class DiscountItem(BaseModel):
    name: str = Field(description="The name/label of the discount (e.g., 'Siblings', 'Early admission')")
    type: str = Field(description="Discount type: 'percentage' or 'exact'")
    amount: float = Field(description="Numeric amount. For percentage, use numbers like 5 for 5%.")

class InternationalStudentFees(BaseModel):
    program_name: str = Field(description="The name of the academic program")
    international_tuition_fees: Dict = Field(
        description="Structured international tuition fees with keys: _unit, _currency, _official_amount, _discounted_amount"
    )
    international_deposit_fees: Dict = Field(
        description="Structured international deposit fees with keys: _currency, _amount"
    )
    international_application_fees: Dict = Field(
        description="Structured international application fees with keys: _currency, _amount"
    )
    other_international_fees: str = Field(default="", description="Any other fees not categorized as tuition/deposit/application (free text)")
    discounts: List[DiscountItem] = Field(default_factory=list, description="List of named discounts with type and amount")
    confidence: float = Field(description="Confidence score (0.0 to 1.0)", ge=0.0, le=1.0)

class BatchInternationalStudentFees(BaseModel):
    programs: List[InternationalStudentFees] = Field(description="List of international student fees information")

print("‚úÖ Pydantic models defined successfully!")


‚úÖ Pydantic models defined successfully!


## 4- UniversityServices class

In [5]:
class UniversityServices:
    """Services for vector search, reranking, and LLM calls"""
    
    def __init__(self, collection_name=None):
        print("üîÑ Initializing services...")
        self.collection_name = collection_name or Config.get_collection_name()
        self.embeddings = self._init_embeddings()
        self.vector_store = self._init_vector_store()
        self.cross_encoder = self._init_cross_encoder()
        self.cohere_client = self._init_cohere()
        self.llm = self._init_llm()
        print("‚úÖ All services initialized successfully!")
    
    def _init_embeddings(self):
        """Initialize OpenAI embeddings"""
        return OpenAIEmbeddings(model=Config.EMBEDDING_MODEL)
    
    def _init_vector_store(self):
        """Initialize Chroma vector store"""
        print(f"üè´ Using collection: {self.collection_name}")
        return Chroma(
            collection_name=self.collection_name,
            embedding_function=self.embeddings,
            persist_directory=Config.VECTOR_DB_PATH
        )
    
    def _init_cross_encoder(self):
        """Initialize CrossEncoder for reranking"""
        if not CROSSENCODER_AVAILABLE:
            return None
        try:
            cross_encoder = CrossEncoder(Config.CROSSENCODER_MODEL)
            print(f"‚úÖ CrossEncoder ({Config.CROSSENCODER_MODEL}) initialized successfully.")
            return cross_encoder
        except Exception as e:
            print(f"‚ö†Ô∏è CrossEncoder initialization error: {e}")
            return None
    
    def _init_cohere(self):
        """Initialize Cohere client for reranking"""
        if not COHERE_AVAILABLE or not Config.COHERE_API_KEY:
            return None
            
        if not Config.COHERE_API_KEY:
            print("‚ö†Ô∏è COHERE_API_KEY not found in environment variables")
            return None
            
        try:
            cohere_client = cohere.Client(Config.COHERE_API_KEY)
            print(f"‚úÖ Cohere client ({Config.COHERE_MODEL}) initialized successfully.")
            return cohere_client
        except Exception as e:
            print(f"‚ö†Ô∏è Cohere initialization error: {e}")
            return None
    
    def _init_llm(self):
        """Initialize LLM based on provider"""
        provider = Config.LLM_PROVIDER.lower()
        
        print(f"üîç Debug - Initializing LLM with provider: {provider}")
        
        if provider == "openai":
            print(f"üîç Debug - OPENAI_API_KEY exists: {bool(Config.OPENAI_API_KEY)}")
            llm = ChatOpenAI(
                model=Config.OPENAI_LLM_MODEL,
                api_key=Config.OPENAI_API_KEY,
                temperature=Config.LLM_TEMPERATURE
            )
            print("‚úÖ OpenAI LLM initialized successfully.")
            return llm
        
        elif provider == "cerebras":
            print(f"üîç Debug - CEREBRAS_API_KEY exists: {bool(Config.CEREBRAS_API_KEY)}")
            llm = ChatCerebras(
                model=Config.CEREBRAS_LLM_MODEL,
                api_key=Config.CEREBRAS_API_KEY,
                temperature=Config.LLM_TEMPERATURE,
                max_tokens=40000,
            )
            print("‚úÖ Cerebras LLM initialized successfully.")
            return llm
        
        elif provider == "deepseek":
            print(f"üîç Debug - DEEPSEEK_API_KEY exists: {bool(Config.DEEPSEEK_API_KEY)}")
            if Config.DEEPSEEK_API_KEY:
                print(f"üîç Debug - DEEPSEEK_API_KEY length: {len(Config.DEEPSEEK_API_KEY)}")
            else:
                print("‚ö†Ô∏è DEEPSEEK_API_KEY is not set in environment!")
            
            llm = ChatDeepSeek(
                model=Config.DEEPSEEK_LLM_MODEL,
                api_key=Config.DEEPSEEK_API_KEY,
                temperature=Config.LLM_TEMPERATURE,
                max_tokens=40000,
            )
            print("‚úÖ DeepSeek LLM initialized successfully.")
            return llm
        
        else:  # Default to Groq
            print(f"üîç Debug - GROQ_API_KEY exists: {bool(Config.GROQ_API_KEY)}")
            llm = ChatGroq(
                model=Config.GROQ_LLM_MODEL,
                api_key=Config.GROQ_API_KEY,
                temperature=Config.LLM_TEMPERATURE,
                max_tokens=40000,
                reasoning_format="hidden",
                timeout=None
            )
            print("‚úÖ Groq LLM initialized successfully.")
            return llm
    
    
    def vector_search(self, query: str, k: int = None) -> List[Dict]:
        """Perform vector search and return contexts"""
        if k is None:
            k = Config.DEFAULT_VECTOR_SEARCH_K
        
        print("üìö Performing vector search...")
        docs = self.vector_store.similarity_search(query, k=k)
        
        # Convert to contexts format
        contexts = []
        for doc in docs:
            contexts.append({
                'text': doc.page_content,
                'metadata': doc.metadata
            })
        
        print(f"‚úÖ Found {len(contexts)} documents")
        return contexts
    
    def rerank_with_cross_encoder(self, query: str, contexts: List[Dict], top_n: int = None) -> List[Dict]:
        """Rerank contexts using CrossEncoder"""
        if top_n is None:
            top_n = Config.DEFAULT_RERANK_TOP_N
        
        if not contexts or not self.cross_encoder:
            return contexts[:top_n]
        
        print(f"üîÑ Reranking {len(contexts)} documents with CrossEncoder...")
        
        try:
            query_context_pairs = [[query, context['text']] for context in contexts]
            scores = self.cross_encoder.predict(query_context_pairs)
            
            for i, score in enumerate(scores):
                contexts[i]['rerank_score'] = float(score)
            
            reranked_contexts = sorted(contexts, key=lambda x: x['rerank_score'], reverse=True)[:top_n]
            print(f"‚úÖ Reranked to top {len(reranked_contexts)} documents")
            return reranked_contexts
            
        except Exception as e:
            print(f"‚ùå CrossEncoder Error: {e}")
            return contexts[:top_n]
    
    def rerank_with_cohere(self, query: str, contexts: List[Dict], top_n: int = None) -> List[Dict]:
        """Rerank contexts using Cohere Rerank API"""
        if top_n is None:
            top_n = Config.DEFAULT_RERANK_TOP_N
        
        if not contexts or not self.cohere_client:
            return contexts[:top_n]
        
        documents_to_rerank = [context['text'] for context in contexts]
        print(f"üîÑ Sending {len(documents_to_rerank)} documents to Cohere Rerank API...")
        
        try:
            rerank_results = self.cohere_client.rerank(
                query=query,
                documents=documents_to_rerank,
                top_n=top_n,
                model=Config.COHERE_MODEL
            )
            
            reranked_contexts = []
            for result in rerank_results.results:
                original_context = contexts[result.index]
                original_context['rerank_score'] = result.relevance_score
                reranked_contexts.append(original_context)
            
            print(f"‚úÖ Reranked to top {len(reranked_contexts)} documents")
            return reranked_contexts
            
        except Exception as e:
            print(f"‚ùå Cohere API Error: {e}")
            return contexts[:top_n]

print("‚úÖ UniversityServices class defined successfully!")


‚úÖ UniversityServices class defined successfully!


## 5. International Fees Strategy Implementation


In [6]:
class InternationalFeesStrategy:
    """International Student Fees Extraction Strategy - Standalone Implementation"""
    
    def get_description(self) -> str:
        return "International Student Fees Information Extractor (BATCHED)"
    
    def get_search_description(self) -> str:
        return "international student fees"
    
    def get_search_query(self, program_batch: List[str]) -> str:
        """Create comprehensive search query for international student fees"""
        all_programs_query = " OR ".join([f'"{program}"' for program in program_batch])
        return f"international student tuition fees foreign overseas global deposit application ({all_programs_query})"
    
    def get_extraction_messages(self, context: str, program_batch: List[str]) -> List[Dict]:
        """Create messages for international student fees extraction"""
        schema_str = json.dumps(BatchInternationalStudentFees.model_json_schema(), ensure_ascii=False)
        programs_list = "\\n".join([f"- {program}" for program in program_batch])
        
        system_instructions = [
            "Your task is to find INTERNATIONAL STUDENT tuition and fees for ALL the programs mentioned.",
            "IMPORTANT: Focus ONLY on INTERNATIONAL/FOREIGN student fees, NOT local/domestic student fees.",
            "CRITICAL: You must return ACTUAL DATA, not schema definitions or examples.",
            "Do NOT return the JSON schema structure - return the actual extracted data.",
            "OUTPUT FORMAT: For each program, return OBJECTS with STRICT keys:",
            "international_tuition_fees: { '_unit': 'Years|Semesters|Annual|Per credit', '_currency': 'USD|EUR|TRY|GBP|...', '_official_amount': 5000, '_discounted_amount': 4500|null }",
            "international_deposit_fees: { '_currency': 'USD|EUR|TRY|GBP|...', '_amount': 1000|null }",
            "international_application_fees: { '_currency': 'USD|EUR|TRY|GBP|...', '_amount': 100|null }",
            "other_international_fees: 'Free text summary of any other fees (e.g., lab fees, insurance); include amounts/currency if available'",
            "discounts: [ { 'name': 'Siblings', 'type': 'exact', 'amount': 100 }, { 'name': 'Early admission', 'type': 'percentage', 'amount': 5 } ]",
            "Amounts must be NUMBERS (no symbols). Use currency CODES. If not found, set the field to null and keep the object.",
            "IMPORTANT: '_unit' and '_currency' MUST be STRINGS. If unknown, use empty string '', NOT false, true, or null.",
            "DEFINITION: '_discounted_amount' means the price AFTER discount and MUST be less than or equal to '_official_amount'.",
            "CONSTRAINTS: '_discounted_amount' MUST use the SAME '_unit' and '_currency' as '_official_amount'. If you are not certain or values are inconsistent, set '_discounted_amount' to null. Do NOT guess or swap meanings.",
            "KEY DEFINITIONS - Understand the difference:",
            "",
            "‚Ä¢ APPLICATION FEES:",
            "  - Timing: Paid BEFORE acceptance/enrollment (during application submission)",
            "  - Purpose: To process and review your application",
            "  - Common terms: 'application fee', 'processing fee', 'registration fee' (when applying)",
            "  - Example: '$100 application fee to submit your application'",
            "",
            "‚Ä¢ DEPOSIT FEES:",
            "  - Timing: Paid AFTER acceptance (to confirm enrollment)",
            "  - Purpose: To reserve/secure your spot, partial payment toward tuition",
            "  - Common terms: 'enrollment deposit', 'seat deposit', 'confirmation fee', 'reservation fee'",
            "  - Example: '$1000 deposit to secure your place, deducted from first semester tuition'",
            "",
            "‚ö†Ô∏è DISAMBIGUATION RULES:",
            "  - If fee is paid 'to apply' or 'with application' ‚Üí APPLICATION FEE",
            "  - If fee is paid 'after acceptance' or 'to confirm enrollment' ‚Üí DEPOSIT FEE",
            "  - If fee is 'deducted from tuition' or 'refundable' ‚Üí DEPOSIT FEE",
            "  - If unclear or ambiguous ‚Üí set to null",
            "OTHER FEES:",
            "  - Any fee that is not clearly tuition/application/deposit (e.g., lab/materials, insurance, administrative fees, Student Services and Amenities Fees, Visa validation fee, student services) ‚Üí put details under other_international_fees (free text)",
            "  - IMPORTANT: If no information is found for a field, return empty string (\\\"\\\") - do NOT return null, 'Not found', 'N/A', or any placeholder text.",
            ""
        ]
        
        search_terms = [
            "IMPORTANT DISTINCTIONS:",
            "- Look for terms like 'international', 'foreign', 'overseas', 'non-citizen', 'global'",
            "- Exclude 'local', 'domestic', 'national', 'citizen' student fees",
            "- Identify units (annual, per year, per semester) and convert to a normalized label",
            "- Extract currency CODE and numeric AMOUNT",
            "DISCOUNTS: Look for 'discount', 'early bird', 'merit discount', 'family/siblings', 'alumni', 'early admission'",
            "DISCOUNTS PARSING: classify '5%' as percentage (amount: 5). Classify '100 USD' as exact (amount: 100)."
        ]
        
        guidelines = [
            "- Extract numeric amounts only (e.g., 5,000 USD ‚Üí amount: 5000, currency: USD)",
            "- Normalize units to one of: 'Years', 'Semesters', 'Annual', 'Per credit'",
            "- If only one tuition amount is present, set it as _official_amount and leave _discounted_amount null",
            "- Ensure '_discounted_amount' ‚â§ '_official_amount'. If not, set '_discounted_amount' to null",
            "- Ensure both amounts (if present) share the SAME unit and currency; otherwise set '_discounted_amount' to null",
            "- If fees are 'free' or 'waived', set amount to 0",
            "- If information spans multiple sections, combine logically; prefer the most common values"
        ]
        
        fields_to_extract = [
            "International Tuition Fees (unit, currency, official amount, discounted amount)",
            "International Deposit Fees (currency, amount)",
            "International Application Fees (currency, amount)",
            "Other International Fees (any non tuition/deposit/application fees such as administrative fees, Student Services and Amenities Fees, Visa validation fee, lab/materials, insurance; free text including numeric amounts if known)",
            "Discount Programs (name, type: percentage|exact, amount numeric)"
        ]
        
        return self._create_base_messages(
            schema_str, programs_list, context, 
            system_instructions, search_terms, guidelines, fields_to_extract
        )
    
    def _create_base_messages(self, schema_str: str, programs_list: str, context: str, 
                             system_instructions: List[str], search_terms: List[str], 
                             guidelines: List[str], fields_to_extract: List[str]) -> List[Dict]:
        """Helper method to create standardized messages"""
        
        system_content = [
            "You are an expert at extracting academic program information.",
            f"Your task is to find {', '.join(fields_to_extract).upper()} for ALL the programs mentioned.",
            "IMPORTANT: Process ALL programs in the list and return information for each one.",
            "LANGUAGE POLICY: Always respond in English. If the provided context is not in English, translate all extracted information and labels into clear English. Preserve numeric values and currency codes; transliterate proper nouns if needed.",
            "no_think: Only return the final answer. Do not include chain-of-thought or <think> tags."
            ""
        ] + system_instructions + [
            "",
            "SEARCH TERMS TO LOOK FOR:"
        ] + search_terms + [
            "",
            "IMPORTANT GUIDELINES:"
        ] + guidelines + [
            "",
            "If specific information is not found for a program, return empty strings for those fields.",
            "Assess your confidence based on how complete and relevant the found information is for each program.",
            "",
            "Please respond in exactly the following JSON format:",
            "",
            "```json",
            schema_str,
            "```"
        ]
        
        human_content = [
            "## Programs to Process:",
            programs_list,
            "",
            "## Context:",
            context,
            "",
            "## Output Schema:",
            schema_str,
            "",
            f"## Find {', '.join(fields_to_extract).upper()} for ALL programs listed above:",
            "Extract the following information for each program:"
        ] + [f"{i+1}. {field}" for i, field in enumerate(fields_to_extract)] + [
            "",
            "Return the information for ALL programs in the specified JSON format with a 'programs' array."
        ]
        
        return [
            {
                "role": "system",
                "content": "\\n".join(system_content)
            },
            {
                "role": "user", 
                "content": "\\n".join(human_content)
            }
        ]
    
    def create_result_from_data(self, found_data: Dict, program_name: str) -> Dict:
        """Create result dictionary from found data"""
        result = {
            "program_name": found_data.get("program_name", program_name),
            "international_tuition_fees": found_data.get("international_tuition_fees", {
                "_unit": "",
                "_currency": "",
                "_official_amount": None,
                "_discounted_amount": None
            }),
            "international_deposit_fees": found_data.get("international_deposit_fees", {
                "_currency": "",
                "_amount": None
            }),
            "international_application_fees": found_data.get("international_application_fees", {
                "_currency": "",
                "_amount": None
            }),
            "other_international_fees": found_data.get("other_international_fees", found_data.get("other_fees", "")),
            "discounts": found_data.get("discounts", []),
            "confidence": float(found_data.get("confidence", 0.0))
        }

        # Post-validation: Clean up invalid types and validate business rules
        try:
            it = result.get("international_tuition_fees", {})
            
            # Fix _unit if it's not a string (e.g., boolean false)
            if not isinstance(it.get("_unit"), str):
                it["_unit"] = ""
            
            # Fix _currency if it's not a string
            if not isinstance(it.get("_currency"), str):
                it["_currency"] = ""
            
            # Fix amounts if they're not numbers
            if not isinstance(it.get("_official_amount"), (int, float, type(None))):
                it["_official_amount"] = None
            if not isinstance(it.get("_discounted_amount"), (int, float, type(None))):
                it["_discounted_amount"] = None
            
            # Validate discounted amount business rules
            off = it.get("_official_amount")
            disc = it.get("_discounted_amount")
            unit = it.get("_unit")
            cur = it.get("_currency")

            if isinstance(off, (int, float)) and isinstance(disc, (int, float)):
                same_unit = isinstance(unit, str) and unit.strip() != ""
                same_currency = isinstance(cur, str) and cur.strip() != ""
                if disc > off or not (same_unit and same_currency):
                    it["_discounted_amount"] = None
        except Exception:
            pass
        
        # Clean up deposit and application fees
        try:
            id = result.get("international_deposit_fees", {})
            if not isinstance(id.get("_currency"), str):
                id["_currency"] = ""
            if not isinstance(id.get("_amount"), (int, float, type(None))):
                id["_amount"] = None
                
            ia = result.get("international_application_fees", {})
            if not isinstance(ia.get("_currency"), str):
                ia["_currency"] = ""
            if not isinstance(ia.get("_amount"), (int, float, type(None))):
                ia["_amount"] = None
        except Exception:
            pass

        return result
    
    def create_empty_result(self, program_name: str) -> Dict:
        """Create empty result when no data found"""
        return {
            "program_name": program_name,
            "international_tuition_fees": {"_unit": "", "_currency": "", "_official_amount": None, "_discounted_amount": None},
            "international_deposit_fees": {"_currency": "", "_amount": None},
            "international_application_fees": {"_currency": "", "_amount": None},
            "other_international_fees": "",
            "discounts": [],
            "confidence": 0.0
        }
    
    def create_empty_results(self, program_batch: List[str]) -> List[Dict]:
        """Create empty results for entire batch when no data found"""
        return [self.create_empty_result(program) for program in program_batch]
    
    def display_result(self, result: Dict) -> None:
        """Display result in a formatted way"""
        print(f"\\nüåç INTERNATIONAL STUDENT FEES for {result['program_name']}:")
        
        # Tuition fees - handle None values
        tuition = result.get('international_tuition_fees') or {}
        if tuition.get('_official_amount'):
            official = f"{tuition.get('_official_amount', 'N/A')} {tuition.get('_currency', '')} per {tuition.get('_unit', '')}"
            print(f"   Official Tuition: {official}")
            if tuition.get('_discounted_amount'):
                discounted = f"{tuition.get('_discounted_amount', 'N/A')} {tuition.get('_currency', '')} per {tuition.get('_unit', '')}"
                print(f"   Discounted Tuition: {discounted}")
        else:
            print(f"   Tuition: Not found")
        
        # Deposit fees - handle None values
        deposit = result.get('international_deposit_fees') or {}
        if deposit.get('_amount'):
            deposit_str = f"{deposit.get('_amount', 'N/A')} {deposit.get('_currency', '')}"
            print(f"   Deposit Fees: {deposit_str}")
        else:
            print(f"   Deposit Fees: Not found")
        
        # Application fees - handle None values
        application = result.get('international_application_fees') or {}
        if application.get('_amount'):
            app_str = f"{application.get('_amount', 'N/A')} {application.get('_currency', '')}"
            print(f"   Application Fees: {app_str}")
        else:
            print(f"   Application Fees: Not found")
        
        print(f"   Confidence: {result.get('confidence', 0.0):.2f}")

print("‚úÖ InternationalFeesStrategy class defined successfully!")


‚úÖ InternationalFeesStrategy class defined successfully!


## 6- load_programs_from_json function

In [7]:
import re
import json
import json_repair

# IMPROVED JSON PARSER - Handles empty responses
def _parse_json_safely(raw: str):
    """Improved JSON parser that handles empty responses and better error detection"""
    try:
        # Handle empty or None input
        if not raw or not raw.strip():
            print(f"[JSON Parser] Empty or None response received")
            return None
            
        # Strip whitespace and common patterns
        text = raw.strip()
        
        # Check for very short responses that are likely empty
        if len(text) < 10:
            print(f"[JSON Parser] Response too short: '{text}'")
            return None
        
        # Handle multiple code fence variations
        if text.startswith("```json"):
            text = text[7:].strip()
            if text.endswith("```"):
                text = text[:-3].strip()
        elif text.startswith("```"):
            lines = text.split("\\n")
            if len(lines) > 1:
                text = "\\n".join(lines[1:])
            else:
                text = text[3:]
            if text.endswith("```"):
                text = text[:-3].strip()
        
        # Remove thinking tags
        if "<think>" in text and "</think>" in text:
            import re
            text = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL)
        
        # Extract JSON object region
        text = text.strip()
        
        # Check if we still have content after cleaning
        if not text or len(text) < 5:
            print(f"[JSON Parser] No content after cleaning")
            return None
            
        if "{" in text and "}" in text:
            start = text.find("{")
            end = text.rfind("}") + 1
            text = text[start:end]
        else:
            print(f"[JSON Parser] No JSON braces found in text")
            return None
        
        # Try to parse with json_repair
        result = json_repair.loads(text)
        print(f"[JSON Parser] Successfully parsed {type(result)}")
        return result
        
    except Exception as e:
        print(f"[JSON Parser] Parsing failed: {e}")
        print(f"[JSON Parser] Raw text (first 200 chars): {raw[:200] if raw else 'None'}")
        return None

print("‚úÖ Utility functions defined successfully!")
print("‚úÖ Improved JSON parser added!")



def load_programs_from_json(filepath: str = None) -> List[str]:
    """Load programs from university-specific programs file"""
    if filepath is None:
        filepath = Config.get_programs_file()
    
    try:
        with open(filepath, 'r', encoding='utf-8') as f:
            data = json.load(f)
        
        programs = data.get('programs', [])
        print(f"‚úÖ Loaded {len(programs)} programs from {filepath}")
        return programs
        
    except FileNotFoundError:
        print(f"‚ùå File {filepath} not found!")
        return []
    except json.JSONDecodeError as e:
        print(f"‚ùå Error parsing JSON file: {e}")
        return []
    except Exception as e:
        print(f"‚ùå Error loading programs: {e}")
        return []

def extract_url_from_text(text: str) -> str:
    """Extract the actual URL from text content"""
    import re
    url_pattern = r'https?://[^\\s\\n]+'
    urls = re.findall(url_pattern, text)
    return urls[0] if urls else 'unknown'

def prepare_shared_sources(reranked_contexts: List[Dict], top_n: int = 3) -> List[Dict]:
    """Prepare shared source information from reranked contexts"""
    shared_sources = []
    for i, chunk in enumerate(reranked_contexts[:top_n], 1):
        metadata = chunk.get('metadata', {})
        text = chunk.get('text', '')
        actual_url = extract_url_from_text(text)
        
        shared_sources.append({
            "rank": i,
            "score": chunk.get('rerank_score', 0.0),
            "source": metadata.get('source', 'unknown'),
            "url": actual_url,
            "chunk_id": metadata.get('chunk_id', 'unknown'),
            "text_preview": text[:200] + "..." if len(text) > 200 else text
        })
    
    return shared_sources

def match_program_in_results(program: str, parsed_results: Dict) -> Dict:
    """Find matching program in parsed results using fuzzy matching"""
    program_clean = program.strip()
    
    # Try to find exact match or partial match
    for parsed_name, parsed_data in parsed_results.items():
        if (parsed_name.lower() == program_clean.lower() or 
            program_clean.lower() in parsed_name.lower() or
            parsed_name.lower() in program_clean.lower()):
            return parsed_data
    
    return None

def save_results_to_json(results: List[Dict], university_name: str, output_filename: str = None) -> Dict:
    """Save results to JSON file"""
    if output_filename is None:
        timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
        output_filename = f"international_fees_results_{timestamp}.json"
    
    # Prepare output data  
    output_data = {
        "timestamp": datetime.datetime.now().isoformat(),
        "university": university_name,
        "agent_type": "international_student_fees",
        "total_programs": len(results),
        "average_confidence": sum(r.get('confidence', 0) for r in results) / len(results) if results else 0,
        "programs_with_tuition": len([r for r in results if r.get('international_tuition_fees', {}).get('_official_amount')]),
        "programs_with_deposit": len([r for r in results if r.get('international_deposit_fees', {}).get('_amount')]),
        "programs_with_application": len([r for r in results if r.get('international_application_fees', {}).get('_amount')]),
        "results": results
    }
    
    # Save to file
    with open(output_filename, 'w', encoding='utf-8') as f:
        json.dump(output_data, f, indent=2, ensure_ascii=False)
    
    print(f"\\nüíæ Results saved to {output_filename}")
    return output_data



‚úÖ Utility functions defined successfully!
‚úÖ Improved JSON parser added!


## 7- extract_international_fees_batch function

In [8]:
# UPDATED VERSION WITH SIMPLE RETRY LOGIC
def extract_international_fees_batch(services: UniversityServices, strategy: InternationalFeesStrategy, 
                                   program_batch: List[str]) -> List[Dict]:
    """Extract international fees for a batch of programs with simple retry logic"""
    
    print(f"\\nüîç Searching {strategy.get_search_description()} for batch of {len(program_batch)} programs")
    print(f"Programs: {', '.join(program_batch[:3])}{'...' if len(program_batch) > 3 else ''}")
    
    # Get search query from strategy
    search_query = strategy.get_search_query(program_batch)
    print(f"üîç Search query: {search_query[:100]}...")
    
    # Single vector search for all programs
    contexts = services.vector_search(search_query)
    
    if not contexts:
        print("‚ùå No documents found in vector search")
        return strategy.create_empty_results(program_batch)
    
    # Reranking - choose between CrossEncoder or Cohere
    if Config.USE_COHERE_RERANKER and services.cohere_client:
        print("üöÄ Applying Cohere reranking for batch...")
        reranked_contexts = services.rerank_with_cohere(search_query, contexts)
    else:
        print("üöÄ Applying CrossEncoder reranking for batch...")
        if not Config.USE_COHERE_RERANKER:
            print("   Reason: USE_COHERE_RERANKER is False")
        elif not services.cohere_client:
            print("   Reason: Cohere client is not available")
        reranked_contexts = services.rerank_with_cross_encoder(search_query, contexts)
    
    if not reranked_contexts:
        print("‚ùå No documents returned from reranking")
        return strategy.create_empty_results(program_batch)
    
    # Combine contexts for batch processing
    combined_context = "\\n".join([ctx['text'] for ctx in reranked_contexts])
    print(f"üìù Combined context length: {len(combined_context):,} characters")
    
    # Single LLM call for ALL programs in the batch - WITH RETRY LOGIC
    print(f"ü§ñ Processing ALL {len(program_batch)} programs with SINGLE LLM call...")
    
    # Get LLM response for ALL programs at once using strategy
    messages = strategy.get_extraction_messages(combined_context, program_batch)
    
    # SIMPLE RETRY LOGIC - Only retry if schema found
    max_retries = 2
    programs_data = []
    
    for attempt in range(max_retries + 1):
        try:
            if attempt > 0:
                print(f"üîÑ Retry attempt {attempt + 1}/{max_retries + 1}")
            
            response = services.llm.invoke(messages)
            raw_content = response.content
            print(f"Raw response preview: {raw_content[:200]}...")
            
            # Check if response contains schema instead of actual data
            if raw_content.strip().startswith('{"$defs":') or '"$defs":' in raw_content[:100]:
                print("‚ö†Ô∏è Got schema response instead of data")
                if attempt < max_retries:
                    print("üîÑ Retrying...")
                    time.sleep(2)
                    continue
                else:
                    print("‚ö†Ô∏è Final attempt - processing schema response")
            
            # ‚úÖ Log the raw response to file
            llm_logger.info("=== Raw LLM Response for Batch ===")
            llm_logger.info(f"Programs: {program_batch}")
            llm_logger.info(raw_content)
            llm_logger.info("=================================")
            
            # Parse response for all programs using improved JSON parser
            try:
                print("üîß Using improved JSON parser for LLM response...")
                data = _parse_json_safely(response.content)
                
                if data is None:
                    print("‚ùå JSON parsing returned None - creating empty results")
                    programs_data = []
                else:
                    # Handle different response formats
                    if isinstance(data, dict):
                        # Expected format: {"programs": [...]}
                        programs_data = data.get("programs", [])
                        print(f"üìä Found 'programs' key with {len(programs_data)} entries")
                    elif isinstance(data, list):
                        # Direct list format: [...]
                        programs_data = data
                        print(f"üìä Found direct list with {len(programs_data)} entries")
                    else:
                        # Unexpected format
                        print(f"‚ö†Ô∏è Unexpected response format: {type(data)}")
                        programs_data = []
                    
                    if not programs_data:
                        print("‚ö†Ô∏è No programs data found in response, creating empty results")
                        programs_data = []
                    
                    print(f"‚úÖ Successfully parsed data for {len(programs_data)} programs")
                
            except Exception as e:
                print(f"‚ùå Error in improved JSON parsing: {e}")
                programs_data = []
            
            # Break out of retry loop if we got here
            break
        
        except Exception as e:
            print(f"‚ùå Error calling LLM (attempt {attempt + 1}): {e}")
            if attempt < max_retries:
                print("üîÑ Retrying...")
                time.sleep(2)
                continue
            else:
                print("‚ùå All attempts failed")
                programs_data = []
    
    # Process results and ensure we have data for all programs
    return process_batch_results(strategy, program_batch, programs_data, reranked_contexts)

print("‚úÖ Updated function with simple retry logic created!")



def process_batch_results(strategy: InternationalFeesStrategy, program_batch: List[str], 
                         programs_data: List[Dict], reranked_contexts: List[Dict]) -> List[Dict]:
    """Process batch results and ensure we have data for all programs"""
    batch_results = []
    
    # Create a mapping of parsed results
    parsed_results = {}
    for prog_data in programs_data:
        if isinstance(prog_data, dict):
            prog_name = prog_data.get("program_name", "").strip()
            if prog_name:
                parsed_results[prog_name] = prog_data
        else:
            print(f"‚ö†Ô∏è Skipping non-dictionary program data: {type(prog_data)} - {prog_data}")
            continue
    
    # Prepare shared source information
    shared_sources = prepare_shared_sources(reranked_contexts)
    
    # Ensure we have results for all requested programs
    for program in program_batch:
        program_clean = program.strip()
        
        # Try to find matching result
        found_result = match_program_in_results(program_clean, parsed_results)
        
        if found_result:
            result = strategy.create_result_from_data(found_result, program_clean)
        else:
            # No result found for this program
            result = strategy.create_empty_result(program_clean)
        
        # Add shared sources to each program
        result['sources'] = shared_sources.copy()
        batch_results.append(result)
        
        # Display results immediately using strategy
        strategy.display_result(result)
    
    return batch_results

print("‚úÖ Main extraction functions defined successfully!")


‚úÖ Updated function with simple retry logic created!
‚úÖ Main extraction functions defined successfully!


## 8- Initialize services and strategy

In [10]:
# Initialize services and strategy
print("üöÄ Initializing International Student Fees Extractor...")
print("=" * 80)

# Initialize services
services = UniversityServices()

# Initialize strategy
strategy = InternationalFeesStrategy()

print(f"\\n‚úÖ Initialized: {strategy.get_description()}")
print(f"üìä Collection: {services.collection_name}")
# print(f"ü§ñ LLM: {Config.LLM_PROVIDER} - {Config.LLM_MODEL}")
print(f"üîÑ Reranker: {'Cohere' if Config.USE_COHERE_RERANKER else 'CrossEncoder'}")


üöÄ Initializing International Student Fees Extractor...
üîÑ Initializing services...
üè´ Using collection: cyberjaya.edu.my_collection
‚úÖ CrossEncoder (cross-encoder/ms-marco-MiniLM-L-6-v2) initialized successfully.
‚úÖ Cohere client (rerank-english-v3.0) initialized successfully.
üîç Debug - Initializing LLM with provider: deepseek
üîç Debug - DEEPSEEK_API_KEY exists: True
üîç Debug - DEEPSEEK_API_KEY length: 35
‚úÖ DeepSeek LLM initialized successfully.
‚úÖ All services initialized successfully!
\n‚úÖ Initialized: International Student Fees Information Extractor (BATCHED)
üìä Collection: cyberjaya.edu.my_collection
üîÑ Reranker: Cohere


## 9- Load programs

In [11]:
# Load programs from file
print("\\nüìö Loading programs...")
programs = load_programs_from_json()

if not programs:
    print("‚ùå No programs found! Please check the programs file.")
else:
    print(f"‚úÖ Loaded {len(programs)} programs for processing")
    print(f"Sample programs: {programs[:3]}")
    
    # For testing, let's process a small batch first
    test_batch_size = 5  # Adjust this number as needed
    test_programs = programs[:test_batch_size]
    
    print(f"\\nüß™ Testing with {len(test_programs)} programs:")
    for i, prog in enumerate(test_programs, 1):
        print(f"  {i}. {prog}")


\nüìö Loading programs...
‚úÖ Loaded 85 programs from d:\AI_Data_Extractor\programs_names_output_test\cyberjaya.edu.my\cyberjaya.edu.my_urls_cleaned.txt_programs.json
‚úÖ Loaded 85 programs for processing
Sample programs: ['Foundation in Science (Leading to Medicine) (English)', 'Foundation in Science (Lead to Pharmacy) (English)', 'Foundation in Allied Health Science (English)']
\nüß™ Testing with 5 programs:
  1. Foundation in Science (Leading to Medicine) (English)
  2. Foundation in Science (Lead to Pharmacy) (English)
  3. Foundation in Allied Health Science (English)
  4. Foundation in Arts (English)
  5. Diploma in Information Technology (English)


## 9.1. Enhanced Debug Functions for Batch Processing

These functions provide detailed debugging capabilities to analyze LLM responses 


In [12]:
import logging
import datetime

# Configure logger for LLM responses
llm_logger = logging.getLogger("llm_responses")
llm_logger.setLevel(logging.INFO)

# Create file handler with timestamp in filename
log_filename = f"international_fees_llm_responses_{datetime.datetime.now().strftime("%Y%m%d_%H%M%S")}.log"
file_handler = logging.FileHandler(log_filename, mode="w", encoding="utf-8")
file_handler.setFormatter(logging.Formatter("%(asctime)s - %(levelname)s - %(message)s"))

llm_logger.addHandler(file_handler)

print(f"üìÇ LLM responses will be logged to: {log_filename}")


üìÇ LLM responses will be logged to: international_fees_llm_responses_20251023_131050.log


## 10. Execute Extraction (test)


In [None]:
# Execute the extraction
if programs:
    print("\\n" + "=" * 80)
    print("üöÄ STARTING INTERNATIONAL STUDENT FEES EXTRACTION")
    print("=" * 80)
    
    # Extract fees for the test batch
    results = extract_international_fees_batch(services, strategy, test_programs)
    
    print("\\n" + "=" * 80)
    print("üìä EXTRACTION COMPLETED")
    print("=" * 80)
    
    # Display summary statistics
    total_programs = len(results)
    avg_confidence = sum(r.get('confidence', 0) for r in results) / total_programs if results else 0

    # Handle None values in statistics
    programs_with_tuition = len([r for r in results if (r.get('international_tuition_fees') or {}).get('_official_amount')])
    programs_with_deposit = len([r for r in results if (r.get('international_deposit_fees') or {}).get('_amount')])
    programs_with_application = len([r for r in results if (r.get('international_application_fees') or {}).get('_amount')])
        
    print(f"\\nüìà SUMMARY STATISTICS:")
    print(f"   Total programs processed: {total_programs}")
    print(f"   Average confidence: {avg_confidence:.2f}")
    print(f"   Programs with tuition fees: {programs_with_tuition}")
    print(f"   Programs with deposit fees: {programs_with_deposit}")
    print(f"   Programs with application fees: {programs_with_application}")
    
    # Save results to JSON file
    output_data = save_results_to_json(results, Config.UNIVERSITY_NAME)
    
    print(f"\\n‚úÖ International student fees extraction completed successfully!")
else:
    print("‚ùå Cannot proceed without programs data.")


üöÄ STARTING INTERNATIONAL STUDENT FEES EXTRACTION
\nüîç Searching international student fees for batch of 5 programs
Programs: Certificate in Visual Design, Foundation in Science, Bachelor in Mass Communication (Honours)...
üîç Search query: international student tuition fees foreign overseas global deposit application ("Certificate in Visu...
üìö Performing vector search...
‚úÖ Found 30 documents
üöÄ Applying Cohere reranking for batch...
üîÑ Sending 30 documents to Cohere Rerank API...
‚úÖ Reranked to top 4 documents
üìù Combined context length: 30,368 characters
ü§ñ Processing ALL 5 programs with SINGLE LLM call...
Raw response preview: ```json
{
  "$defs": {
    "InternationalStudentFees": {
      "properties": {
        "program_name": {
          "description": "The name of the academic program",
          "title": "Program Name",...
üîß Using improved JSON parser for LLM response...
üìä Found direct list with 2 entries
‚úÖ Successfully parsed data for 2 programs
‚ö†Ô∏

## 11- Process all programs

In [13]:
# Uncomment and run this cell to process ALL programs in PARALLEL

from concurrent.futures import ThreadPoolExecutor, as_completed
import threading

if programs and len(programs) > test_batch_size:
    print(f"\\nüîÑ Processing ALL {len(programs)} programs in PARALLEL batches...")
    
    batch_size = Config.DEFAULT_BATCH_SIZE
    total_batches = (len(programs) + batch_size - 1) // batch_size
    
    # Create batches
    batches = []
    for i in range(0, len(programs), batch_size):
        batch_num = (i // batch_size) + 1
        batch_programs = programs[i:i + batch_size]
        batches.append({
            'batch_num': batch_num,
            'total_batches': total_batches,
            'programs': batch_programs
        })
    
    print(f"üì¶ Total batches: {total_batches}")
    print(f"üöÄ Starting PARALLEL processing with {min(4, total_batches)} concurrent workers...")
    print("=" * 80)
    
    # Lock for thread-safe printing
    print_lock = threading.Lock()
    
    def process_batch_parallel(batch_info):
        """Process a single batch with thread-safe output"""
        batch_num = batch_info['batch_num']
        total_batches = batch_info['total_batches']
        batch_programs = batch_info['programs']
        
        with print_lock:
            print(f"\\nüì¶ BATCH {batch_num}/{total_batches} [STARTED]")
            print(f"Programs in this batch: {len(batch_programs)}")
            print("-" * 60)
        
        try:
            batch_results = extract_international_fees_batch(services, strategy, batch_programs)
            
            with print_lock:
                print(f"\\n‚úÖ BATCH {batch_num}/{total_batches} [COMPLETED]")
                print(f"Results from this batch: {len(batch_results)}")
            
            return batch_results
        except Exception as e:
            with print_lock:
                print(f"\\n‚ùå BATCH {batch_num}/{total_batches} [FAILED]: {e}")
            return []
    
    # Process batches in parallel
    all_results = []
    max_workers = min(5, total_batches)  # Use up to 4 concurrent workers
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Submit all batch jobs
        future_to_batch = {executor.submit(process_batch_parallel, batch): batch for batch in batches}
        
        # Collect results as they complete
        completed = 0
        for future in as_completed(future_to_batch):
            completed += 1
            batch_results = future.result()
            all_results.extend(batch_results)
            print(f"\\nüìä Progress: {completed}/{total_batches} batches completed")
    
    print("\\n" + "=" * 80)
    print(f"\\nüéâ ALL PROGRAMS PROCESSED SUCCESSFULLY IN PARALLEL!")
    print(f"Total results: {len(all_results)}")
    
    # Save final results
    final_output_data = save_results_to_json(all_results, Config.UNIVERSITY_NAME, "international_fees_all_programs_parallel.json")
    print(f"‚úÖ Results saved!")
else:
    print("\\nüí° To process all programs in parallel, uncomment and run this cell.")

print("\\nüí° Parallel processing enabled - all batches are processed concurrently!")


\nüîÑ Processing ALL 85 programs in PARALLEL batches...
üì¶ Total batches: 5
üöÄ Starting PARALLEL processing with 4 concurrent workers...
\nüì¶ BATCH 1/5 [STARTED]
Programs in this batch: 20
------------------------------------------------------------
\nüîç Searching international student fees for batch of 20 programs
Programs: Foundation in Science (Leading to Medicine) (English), Foundation in Science (Lead to Pharmacy) (English), Foundation in Allied Health Science (English)...
üîç Search query: international student tuition fees foreign overseas global deposit application ("Foundation in Scien...
üìö Performing vector search...
\nüì¶ BATCH 2/5 [STARTED]
Programs in this batch: 20
------------------------------------------------------------
\nüîç Searching international student fees for batch of 20 programs
Programs: Bachelor of International Business Management (Honours) (English), Bachelor of Occupational Safety and Health (Honours) (English), Bachelor of Occupational Sa

In [160]:
%pip install -qU langchain-cerebras

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
from langchain_cerebras import ChatCerebras

llm = ChatCerebras(
    model="qwen-3-32b",
    # other params...
)

In [2]:
import getpass
import os

if "CEREBRAS_API_KEY" not in os.environ:
    os.environ["CEREBRAS_API_KEY"] = getpass.getpass("Enter your Cerebras API key: ")

In [3]:
messages = [
    (
        "system",
        "You are a helpful assistant that translates English to French. Translate the user sentence.",
    ),
    ("human", "I love programming."),
]
ai_msg = llm.invoke(messages)
ai_msg

AIMessage(content='<think>\nOkay, the user wants to translate "I love programming." into French. Let me start by breaking down the sentence. "I love" is straightforward. In French, that\'s "J\'adore" or "Je love" isn\'t used. Programming is "programming" which in French can be "la programmation" or sometimes "le d√©veloppement" depending on the context. The user\'s sentence is pretty general, so "la programmation" is probably safer. Now, combining them: "J\'adore la programmation." Wait, let me check the contraction. Since it\'s "Je adore", but in French, "j\'adore" (with the accent) is correct because of the vowel sound. No other articles are needed. The sentence structure in French is subject + verb + noun. So the translation should be "J\'adore la programmation." I should make sure there\'s no typo. \'Adore\' is spelled correctly here. Yeah, that\'s right. The user might be looking for a simple, direct translation, so I don\'t need to add anything extra. Just the translated sentence