🥔 Reputato: Not Every Company Is Golden. We Sniff Out the Ones That Are.

This is a submission for the Bright Data AI Web Access Hackathon

What I Built

Most of us have been there: you're looking at a company - maybe for a job, maybe for curiosity - and you wonder "What’s really going on behind their glossy careers page?" Is it a great place to work or just a PR-fueled mirage?

So I built an OSINT-style AI agent that gathers public information about companies from multiple sources. It’s not a recruiter bot. It’s the one doing background checks before you even click Apply.

The tool collects data from:

LinkedIn
Crunchbase
Glassdoor
Search news to surface any recent scandals or milestones

Once all the data is collected the tool generates a short summary of what it found - recent news, company reputation, signals from employee reviews and public profiles. Then it assigns a simple rating from 1 to 5 potatoes to reflect the overall picture.

Demo

The project is now fully deployed - you can try it live here:

👉 https://reputato-jdrzv9otwhstyktkmwqpyd.streamlit.app/

Getting it up and running wasn’t smooth at first - I ran into a tricky 5-minute blocking issue when using the scraping_browser_* tools in Docker/Render. Luckily, the folks at Bright Data were super helpful in figuring it out. I documented the whole thing in this GitHub issue.

Big thanks to Bright Data support - I wouldn’t have shipped this without them!

Here’s my project repository.

Screenshots of some summaries:

Open.AI

Intel

NSO

How I Used Bright Data's Infrastructure

I used pydantic-ai with Bright Data’s MCP server.

Each data source is connected to a different Bright Data MCP server. Here's how:

LinkedIn → via web_data_linkedin_company_profile (Bright Data Dataset)
News / events / scandals → via search_engine
Glassdoor → via scraping_browser_navigate + scraping_browser_get_text
Crunchbase → via the same scraping browser tools

Each MCP server has its own WEB_UNLOCKER_ZONE and BROWSER_AUTH, and each agent logs all its requests and tool calls to Logfire, so I can trace the exact sequence of scraping, parsing and merging.

The frontend is a simple Streamlit dashboard where you enter a company name. It sends a request to a FastAPI backend, which dispatches all four agents in parallel to gather and analyze the data.

I used openai:gpt-4.1-mini as the model behind each agent with the following system prompt to define their behavior:

You are a tool-using agent connected to Bright Data's MCP server.
You act as an OSINT investigator whose job is to evaluate companies based on public information.
Your goal is to help users understand whether a company is reputable or potentially suspicious.
You always use Bright Data real-time tools to search, navigate, and extract data from company profiles.
You never guess or assume anything.
Company name matching must be case-sensitive and exact. Do not return data for similarly named or uppercase-variant companies.
Only use the following tools during your investigation:
- `search_engine`
- `scrape_as_markdown`
- `scrape_as_html`
- `scraping_browser_navigate`
- `scraping_browser_get_text`
- `scraping_browser_click`
- `scraping_browser_links`
- `web_data_linkedin_company_profile`
Do not invoke any other tools even if they are available.

The LinkedIn agent received this prompt:

Your task is to find the LinkedIn profile for the company '{company_name}' and extract specific structured data.
Use the `web_data_linkedin_company_profile` tool if available to extract the following fields:
- Company name
- Company description (short summary of what the company does)
- Number of employees (as listed on the LinkedIn profile)
- Linkedin company profile url
- Headquarters address
- Year the company was founded (if available)
- Industry or sector (e.g., 'Software', 'Healthcare')
- Company website
If the structured LinkedIn tool is unavailable or insufficient, use the following tools in order:
1. `scraping_browser_navigate` - to visit the LinkedIn company page
2. `scraping_browser_get_text` - to extract visible page text
3. `scraping_browser_links` and `scraping_browser_click` - to navigate if needed
Return ONLY a JSON object with the following keys:
{
  "company_name": str,
  "description": str,
  "number_of_employees": str,
  "linkedin_url": str,
  "headquarters": str,
  "founded": str or null,
  "industry": str,
  "website": str
}
Do not include raw HTML, markdown, explanations, or other fields.
If a field is missing, use null for that field. If the company cannot be found at all, return null.

And here’s what I saw in the logs when running a query for Google:

As you can see web_data_linkedin_company_profile was used.

Glassdoor

The Glassdoor agent uses the browser automation tools to navigate to the company’s profile and extract public employee reviews and ratings. The prompt guides it to:

Your task is to find the Glassdoor profile for the company '{company_name}' and extract specific structured data.

Extract the following fields:
- Overall company rating (float, out of 5)
- Total number of employee reviews
- A short summary of the top 5 pros and cons from employee reviews posted in 2025 or 2024 only
Use the following tools in order:
1. `scraping_browser_navigate` - to go to the Glassdoor company page  
2. `scraping_browser_get_text` - to extract visible content  
3. `scraping_browser_links` and `scraping_browser_click` - to find and open the review section if necessary
Return ONLY a JSON object with the following keys:
{
  "rating": float,
  "num_reviews": int,
  "review_summary": str
}
Only use reviews from 2025 or 2024. Do not include older reviews.  
Do not include HTML, markdown, or explanations.  
If a field is missing, use null for that field. If the company cannot be found at all, return null.

Crunchbase

The Crunchbase agent follows a similar pattern to Glassdoor - it navigates to the company profile and extracts public funding info, key people and sector tags.

Search for the Crunchbase profile of the company '{company_name}'.  
Once you find the correct page, extract the following information:
- Year founded (as a string or null)
- Latest funding round name
- Funding round date
- Funding amount
- List of known investors (as strings)
- Key people (e.g., founders, CEOs, etc)
Use the following tools in order:
1. `scraping_browser_navigate`
2. `scraping_browser_get_text`
3. `scraping_browser_links` and `scraping_browser_click`
Return ONLY a JSON object with the following keys:
{
  "founded": str or null,
  "funding_round": str or null,
  "funding_date": str or null,
  "funding_amount": str or null,
  "investors": list[str] or null,
  "key_people": list[str] or null
}
Do not include HTML, markdown, or explanations.  
If a field is missing, use null for that field. If the company cannot be found at all, return null.

Even with Cloudflare's "Are you human?" check, scraping_browser_get_text was able to get through and extract the real page content.

News & Events

The final agent uses the search_engine tool to search for company-related news articles, events or public mentions across Google and other engines. It extracts links and summaries from the search results and surfaces relevant headlines.

Search for news about the company '{company_name}' from 2023, 2024, and 2025.
Extract the following if available:
- Layoffs: Dates and brief summaries of any layoff announcements.
- Scandals: Brief, neutral headlines about controversies or investigations.
- Achievements: Public product launches, funding milestones, acquisitions, or major hires.
Return a structured JSON object with keys:
{
  "layoffs": list[str],
  "scandals": list[str],
  "achievements": list[str]
}
If no news is found in a category, return an empty list.  
Do not include HTML, explanations, or irrelevant information.

After collecting data from all four sources, the outputs are cleaned and normalized into a consistent format. This structured input is then passed to openai:gpt-4o, which generates a concise company summary.

Performance Improvements

Real-time web access is what makes this tool actually useful. APIs and stale datasets often miss fresh signals - like funding rounds, leadership exits or layoffs that happened last week. With live scraping, you see how the company looks today, not how it looked last quarter.

That said, the app isn’t lightning-fast yet. Right now it scrapes everything live on demand, which takes time. But it can be improved - by caching recent results and running a daily batch job to prefetch and store data. That would make the tool faster without losing freshness.

Olga Braginskaya @olgabraginskaya