Integrating my AI assistant with StartPage

Today I was fixing an issue with my assistant having a hard time accessing web search. I fixed that but I was not entirelly satisfied with the result. Browsh is great for that used but I am using StartPage on my PC and on my phone. Ability to see similar or even the same web search result for me and my assistant would be great.

How hard it can be to actually fetch the result od search via CLI? Actually it is not that hard. I was able to figurę it out though it requires very specific set od parameters, headers being sent to very specific address.

Inspecting Start Page search I noticed that there is POST http call being made to address: https://www.startpage.com/sp/search with form parameters like below.

"query={query}&t=device&lui=polski&sc=mgAAkVBCMhaz20&cat=web&abd=0&abe=0&qsr=all&qadf=moderate&with_date="

I was able to recreate the result of this call in Rider HTTP client with the same headers as in the browser. This means that it is usually OK to run it programmatically. I am bit afraid about sc form parameter. But StartPage does not requires you to login or anything like that, so it is most probably settings cookie if at all.

I was playing at the same time with Junie, Jet Brains code assistant. I asked it to write some script for me to call that ebdpeoint and then extract the data from HTML. It was quite capable but I am bit said that I was unable to force it to work with my local model instead.

This is what it was able to came up with in the end:

import urllib.request
import urllib.parse
import re
import html
import json
import sys

def clean_html(text):
    # Remove style blocks
    text = re.sub(r'<style.*?>.*?</style>', '', text, flags=re.DOTALL)
    # Remove all HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    # Decode HTML entities
    text = html.unescape(text)
    # Trim whitespace
    text = text.strip()
    return text

def extract_links(query):
    url = "https://www.startpage.com/sp/search"
    headers = {
        "Host": "www.startpage.com",
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:140.0) Gecko/20100101 Firefox/140.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "pl,en-US;q=0.7,en;q=0.3",
        "Referer": "https://www.startpage.com/",
        "Content-Type": "application/x-www-form-urlencoded",
        "Origin": "https://www.startpage.com",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "same-origin",
        "Sec-Fetch-User": "?1",
        "Startpage-Extension": "ext-ff",
        "Startpage-Extension-Version": "2.0.3",
        "Startpage-Extension-Segment": "startpage.defaultffx",
        "Priority": "u=0, i",
        "TE": "trailers"
    }

    # URL-encode the query
    query_encoded = urllib.parse.quote_plus(query)
    body = f"query={query_encoded}&t=device&lui=polski&sc=mgAAkVBCMhaz20&cat=web&abd=0&abe=0&qsr=all&qadf=moderate&with_date="
    data = body.encode('utf-8')

    req = urllib.request.Request(url, data=data, headers=headers, method='POST')
    
    try:
        with urllib.request.urlopen(req) as response:
            content = response.read().decode('utf-8', errors='replace')
    except Exception as e:
        print(f"Error fetching page: {e}")
        return []

    results = []
    
    # We want to find result containers to match title, link and description correctly.
    # Looking at the HTML, each result seems to be in a div with class "result"
    # But regex might be easier if we look for the title link and then the following description.
    
    # Pattern to find the title link and its content
    title_pattern = re.compile(r'<a[^>]+class=[^>]*result-title[^>]*href=["\']([^"\']+)["\'][^>]*>(.*?)</a>', re.DOTALL)
    # Pattern to find the description after the title link
    desc_pattern = re.compile(r'<p[^>]+class=[^>]*description[^>]*>(.*?)</p>', re.DOTALL)
    
    # Let's find all occurrences of result-title links
    for match in title_pattern.finditer(content):
        link = match.group(1)
        title_raw = match.group(2)
        title = clean_html(title_raw)
        
        # Look for description starting from the end of the current title match
        search_start = match.end()
        desc_match = desc_pattern.search(content, search_start)
        
        description = ""
        if desc_match:
            # Check if this description belongs to this result (not the next one)
            # Typically descriptions follow titles closely.
            # We can also check if there's another result-title between them.
            next_title_match = title_pattern.search(content, search_start)
            if not next_title_match or desc_match.start() < next_title_match.start():
                description = clean_html(desc_match.group(1))
        
        results.append({
            "link": link,
            "title": title,
            "description": description
        })
        
    return results

if __name__ == "__main__":
    if len(sys.argv) > 1:
        search_query = sys.argv[1]
    else:
        search_query = "nvidia blackwell cena"
        
    links = extract_links(search_query)
    print(json.dumps(links, indent=2, ensure_ascii=False))

A bit messy I would say but the flow is as below:

read the HTML from the endpoint
find a tags with result-link class.
extract title of the link and address of the link
extract description from the p tag with description class.
glue it all together in JSON

After that chatbot is sending results to the chat. Where it can be further refined or used for something else.

Leave a Reply Cancel reply