Skip to content

BOSS294/Atlas-Engine

Repository files navigation

Atlas Engine Logo

Atlas Engine

Business lead generation, lead intelligence, and local business data enrichment platform

Python desktop app + scraping engine + local database + browser signals

[THE SYSTEM IS STILL UNDER DEVELOPEMENT]

License Python PySide6 SQLite Last commit


Table of Contents


Overview

Atlas Engine is a desktop lead generation and lead enrichment system that collects business data from public sources, scores lead quality, and stores everything in a local SQLite database for review and export. It targets local businesses, enriches contact data (phone, email, website, socials), and provides a lead intelligence UI for approval, rejection, and notes. This repository also includes multiple version folders that show the evolution from API-based discovery to public data scraping and a Streamlit-based UI.


Current Working Version (Root)

This is the active version in the root folder (main.py, app/, extension/).

What it includes (based on current code):

  • Desktop UI (PySide6) with login screen and admin credential update.
  • Lead discovery engine that builds niche/city/area query sets.
  • Public web discovery using Google search HTML + DuckDuckGo + JustDial category pages.
  • Website crawling & extraction (JSON-LD + HTML text patterns for email/phone/socials/address).
  • Lead scoring & tiers (Hot / Warm / Cold) with configurable filters.
  • Website health checks (HTTPS, missing title/description, thin content, block detection).
  • Local SQLite database (atlas.sqlite3) for leads, statuses, notes, and page signals.
  • Review tools: approve/reject, copy JSON, open source page, export CSV.
  • Local HTTP bridge on 127.0.0.1:8765 to receive browser extension signals.
  • Browser extension that detects scrapable business pages and reports signals.

How Atlas Engine Works

  1. User defines niche + location in the UI.
  2. Query generator expands search phrases and sources.
  3. Discovery engine gathers candidate URLs from public search & directories.
  4. Scraper & extractor parse pages, JSON-LD, and link signals.
  5. Lead scoring calculates a quality score and tier.
  6. Storage writes leads and signals to SQLite.
  7. Review & export in the dashboard: approve, reject, add notes, export CSV.

System Architecture Diagrams

1) High-Level Architecture

flowchart LR
  UI[PySide6 Desktop UI] --> Engine[Scrape Worker]
  Engine --> Sources[Public Sources]
  Sources -->|Google HTML + DuckDuckGo| WebSearch
  Sources -->|JustDial category pages| Directory
  Engine --> Extract[HTML + JSON-LD Extractors]
  Extract --> Score[Lead Scoring]
  Score --> DB[(SQLite: atlas.sqlite3)]
  DB --> UI
  Extension[Chrome Extension] --> Bridge[Local Bridge :8765]
  Bridge --> DB
Loading

2) Data Pipeline Flow

sequenceDiagram
  participant User
  participant UI as Atlas UI
  participant Engine as Scrape Worker
  participant Web as Public Web
  participant DB as SQLite

  User->>UI: Enter niche/city/area + filters
  UI->>Engine: Start scraping
  Engine->>Web: Search + directory queries
  Web-->>Engine: Candidate pages
  Engine->>Web: Crawl target pages
  Web-->>Engine: HTML + JSON-LD
  Engine->>Engine: Extract & score
  Engine->>DB: Save leads + status
  UI->>DB: Load dashboard & filters
Loading

Key Features (Current Version)

  • Multi-source discovery: Google search HTML, DuckDuckGo fallback, JustDial pages.
  • Contact enrichment: phones, emails, socials, website detection.
  • Lead scoring: quality scoring + Hot/Warm/Cold tiers.
  • Lead management UI: approve/reject, notes, quick actions.
  • Website health analysis: HTTPS checks, thin-content detection.
  • Local-first: no cloud dependency; data stays on the device.
  • Browser signals: extension detects business-like pages and sends signals to the app.
  • CSV export: built-in export for pipelines or CRM imports.

Version Folders & Differences

Comparison Table

Folder UI Data Sources Enrichment Storage/Export Notes
Root (Current) PySide6 desktop + browser signals Google HTML + DuckDuckGo + JustDial JSON-LD + HTML scraping + website health SQLite + CSV export Includes local bridge + extension signals
V1_COLLECTTS_FULL_DATA PySide6 desktop Google Places API + Google CSE API + OSM (Overpass/Nominatim) API + website crawl SQLite + run tracking Requires API keys; more API-driven
USA_SCRAPPER_VERSION_2 PySide6 desktop OpenStreetMap (Overpass + Nominatim) Optional website crawl SQLite + CSV/JSON Focused on USA/Canada public data
USA_SCRAPPER_VERSION_3 Streamlit web UI OpenStreetMap (Overpass + Nominatim) Website crawl + scoring CSV + Excel Dashboard filters, Streamlit export

Key Differences vs V1

  • V1 = API-first (Google Places + CSE + OSM), requires keys and quotas.
  • Current = scraping-first (public search + JustDial), no API keys required, but higher risk of blocks.
  • V1 has run tracking + settings table; current has browser extension signals and website health checks.
  • Current adds review workflows (approve/reject, notes) and live dashboard widgets.

Problems, Risks & Disadvantages

  • Scraping fragility: Google/JustDial HTML changes can break extraction.
  • Anti-bot risk: search engines may throttle or block requests.
  • No proxy/rotation layer: large-scale scraping can fail or be rate-limited.
  • Local-only storage: no built-in cloud sync or multi-user access.
  • Security concern: default login (root / 1234) is weak unless changed.
  • Extension issue: popup.html references popup.js, but the file is missing, so popup UI cannot function fully.
  • Limited compliance tooling: no built-in consent, GDPR workflows, or audit logs.
  • No automated tests: regression risk when modifying scraping logic.
  • Platform bias: UI tested mainly on Windows; macOS/Linux may need tweaks.

Why This System

  • Local-first lead intelligence: keep sensitive lead data on your machine.
  • Rapid lead discovery: generate and score leads without paid APIs.
  • Modular pipeline: easy to swap or extend sources and extraction logic.
  • UI-driven workflow: validate, score, and export in one desktop app.

Who Can Benefit

  • Sales & lead gen teams building outbound lists.
  • Local marketing agencies targeting SMBs.
  • Business development reps needing quick prospect research.
  • Freelancers or growth teams without budget for paid lead APIs.
  • Researchers working with local business datasets.

Install & Run (Current Version)

# 1) Install dependencies
py -3.13 -m pip install -r requirements.txt

# 2) Run the app
py -3.13 main.py

Default login: root / 1234 (change in Settings after first run)


Running Older Versions

# V1
cd V1_COLLECTTS_FULL_DATA
pip install -r requirements.txt
python main.py

# USA Scrapper v2
cd USA_SCRAPPER_VERSION_2
pip install -r requirements.txt
python main.py

# USA Scrapper v3 (Streamlit)
cd USA_SCRAPPER_VERSION_3
pip install -r requirements.txt
streamlit run app.py

Browser Extension

The Chrome extension lives in extension/ and posts page signals to the local bridge.

  • Bridge endpoint: http://127.0.0.1:8765/page-signal
  • Purpose: detect business-like pages and mark them as scrapable

Note: the popup references popup.js which is missing, so the popup UI is currently incomplete.


Data, Privacy & Compliance

  • All leads are stored locally in SQLite.
  • Data is gathered from public sources; ensure compliance with website terms and local regulations.
  • You are responsible for ethical use, consent, and compliance (GDPR/CCPA/etc).

SEO Keywords

Business lead generation, lead intelligence software, local business scraper, lead enrichment tool, B2B lead collection, business directory scraping, offline lead database, PySide6 lead management app.


License

Apache License 2.0. See LICENSE.

About

Business lead generation, lead intelligence, and local business data enrichment platform.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Contributors