Skip to content

Add web viewer feature for browsing downloaded messages#23

Open
jakemuk wants to merge 16 commits into
tsale:mainfrom
jakemuk:main
Open

Add web viewer feature for browsing downloaded messages#23
jakemuk wants to merge 16 commits into
tsale:mainfrom
jakemuk:main

Conversation

@jakemuk

@jakemuk jakemuk commented Nov 11, 2025

Copy link
Copy Markdown

Overview

This PR adds a modern web-based viewer that provides a Telegram-style interface for browsing and searching downloaded Telegram messages. The viewer makes it easier to analyze collected intelligence from Telegram channels.

Features Added

Web Viewer Interface

  • 📱 Telegram-style UI - Beautiful dark theme matching Telegram's native design
  • 🔄 Dynamic Chat Discovery - Automatically detects and displays all chats from the Downloads folder
  • 📝 Message Display - Shows message text, metadata, entities (URLs, emails, mentions), and forwarded messages
  • 🔍 Search Functionality - Real-time search through messages by text, sender name, or forwarded content with visual highlighting
  • Infinite Scroll - Automatically loads more messages as you scroll (100 messages per page)
  • 🎨 Responsive Design - Works seamlessly on desktop and mobile devices
  • 💾 Performance Optimized - Caching and pagination for fast loading of large message files

Integration

  • Added menu option 8 to TeleGatherer.py to launch the web viewer directly
  • Viewer can also be launched manually: python web_viewer.py
  • Automatically opens browser to http://localhost:5000

Technical Details

New Files

  • web_viewer.py - Flask backend server with REST API endpoints
  • web_viewer_static/index.html - Main HTML page
  • web_viewer_static/styles.css - Telegram-style CSS styling
  • web_viewer_static/app.js - Frontend JavaScript with infinite scroll and search

Modified Files

  • TeleGatherer.py - Added menu option 8 to launch web viewer
  • helpers/TeleViewer.py - Fixed UTF-8 encoding issues for Windows compatibility (handles emojis and special characters)
  • requirements.txt - Added Flask==3.0.0 and flask-cors==4.0.0
  • README.md - Added comprehensive web viewer documentation
  • .gitignore - Added Downloads folder exclusion

API Endpoints

  • GET /api/chats - List all available chats
  • GET /api/chats/<chat_id>/messages - Get messages with pagination
  • GET /api/chats/<chat_id>/info - Get chat information

Key Improvements

  1. Smart Chat Grouping - Messages are grouped by actual chat ID (from message data), not folder names, so messages from different chats stored in the same folder are displayed separately

  2. Search Capabilities - Search across:

    • Message text content
    • Sender usernames and first names
    • Forwarded message sources
  3. Performance -

    • Message caching based on file modification time
    • Efficient JSON parsing for concatenated message objects
    • Pagination to load 100 messages at a time
  4. Windows Compatibility - Fixed encoding issues to properly handle emojis and non-ASCII characters on Windows

Usage

  1. Download messages using TeleGatherer.py (option 6)
  2. Launch web viewer:
    • From menu: Run TeleGatherer.py and select option 8
    • Manually: Run python web_viewer.py
  3. Open browser to http://localhost:5000
  4. Select a chat from the sidebar to view messages
  5. Use the search bar to find specific messages

Dependencies

  • Flask==3.0.0
  • flask-cors==4.0.0

Notes

  • No breaking changes to existing functionality
  • Web viewer works with existing Downloads folder structure
  • All existing features remain unchanged

- Add Flask-based web viewer with Telegram-style UI
- Implement infinite scroll and search functionality
- Add menu option 8 to TeleGatherer.py to launch viewer
- Fix UTF-8 encoding issues for Windows compatibility
- Update documentation and requirements
Removed duplicate downloads directory exclusion

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a web-based viewer for browsing downloaded Telegram messages with a Telegram-style dark UI. The feature includes a Flask backend API, responsive frontend with search and infinite scroll, and integration into the main TeleGatherer menu.

Key Changes

  • New web viewer with Flask REST API for serving messages with pagination and caching
  • Telegram-style responsive UI with search functionality and infinite scroll
  • UTF-8 encoding fixes for Windows compatibility with emojis and special characters

Reviewed Changes

Copilot reviewed 7 out of 9 changed files in this pull request and generated 26 comments.

Show a summary per file
File Description
web_viewer_static/styles.css Telegram-themed CSS with dark color scheme and responsive layout
web_viewer_static/index.html Single-page HTML structure for the viewer interface
web_viewer_static/app.js Frontend JavaScript handling chat loading, search, and infinite scroll
web_viewer.py Flask backend with API endpoints for chats and messages, includes caching
TeleGatherer.py Added menu option 8 to launch web viewer via subprocess
helpers/TeleViewer.py Fixed UTF-8 encoding for file operations
requirements.txt Added Flask 3.0.0 and flask-cors 4.0.0 dependencies
.gitignore Added log file exclusion for web viewer static directory
README.md Added web viewer documentation and usage instructions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread web_viewer_static/app.js
Comment on lines +36 to +39
chatItem.innerHTML = `
<div class="chat-item-name">${escapeHtml(chat.name)}</div>
<div class="chat-item-meta">${chat.message_count || 0} messages</div>
`;

Copilot AI Nov 19, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HTML is constructed using string concatenation with user-supplied data (chat.name). While escapeHtml() is used here, this pattern is error-prone. Consider using textContent property or DOM manipulation methods to safely set the chat name instead of innerHTML.

Suggested change
chatItem.innerHTML = `
<div class="chat-item-name">${escapeHtml(chat.name)}</div>
<div class="chat-item-meta">${chat.message_count || 0} messages</div>
`;
const chatNameDiv = document.createElement('div');
chatNameDiv.className = 'chat-item-name';
chatNameDiv.textContent = chat.name;
const chatMetaDiv = document.createElement('div');
chatMetaDiv.className = 'chat-item-meta';
chatMetaDiv.textContent = `${chat.message_count || 0} messages`;
chatItem.appendChild(chatNameDiv);
chatItem.appendChild(chatMetaDiv);

Copilot uses AI. Check for mistakes.
Comment thread web_viewer_static/app.js
if (!searchQuery) return html;

const query = escapeHtml(searchQuery).toLowerCase();
const regex = new RegExp(`(${query.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')})`, 'gi');

Copilot AI Nov 19, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The regex pattern replace(/[.*+?^${}()|[\]\\]/g, '\\$&') is used to escape regex special characters, but the character class is missing escaping for the closing bracket ]. While this works in practice due to how JavaScript handles it, it's clearer to escape it as \\] for consistency and clarity.

Suggested change
const regex = new RegExp(`(${query.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')})`, 'gi');
const regex = new RegExp(`(${query.replace(/[.*+?^${}()|[\]\\\]]/g, '\\$&')})`, 'gi');

Copilot uses AI. Check for mistakes.
Comment thread web_viewer.py
Comment on lines +417 to +418
app.run(debug=True, port=5000, host='0.0.0.0')

Copilot AI Nov 19, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The server runs with debug=True and host='0.0.0.0' in production. This is a significant security risk as it:

  1. Exposes the Werkzeug debugger which can execute arbitrary code
  2. Binds to all network interfaces, making it accessible from outside the local machine
  3. Exposes sensitive file paths and code

For production use, set debug=False and host='127.0.0.1' (localhost only), or make these configurable via environment variables.

Suggested change
app.run(debug=True, port=5000, host='0.0.0.0')
debug = os.environ.get('FLASK_DEBUG', 'False').lower() in ('1', 'true', 'yes')
host = os.environ.get('FLASK_HOST', '127.0.0.1')
port = int(os.environ.get('FLASK_PORT', '5000'))
app.run(debug=debug, port=port, host=host)

Copilot uses AI. Check for mistakes.
Comment thread web_viewer.py
Comment on lines +91 to +99
content = f.read()

remaining = content
while remaining.strip():
message, remaining = parse_message_json(remaining)
if message:
messages.append(message)
else:
break

Copilot AI Nov 19, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire file content is read into memory with f.read() before parsing. For very large message files (potentially gigabytes), this could cause memory issues. Consider using streaming/chunked reading to parse JSON objects incrementally.

Suggested change
content = f.read()
remaining = content
while remaining.strip():
message, remaining = parse_message_json(remaining)
if message:
messages.append(message)
else:
break
buffer = ""
while True:
chunk = f.read(8192)
if not chunk:
break
buffer += chunk
# Try to parse as many messages as possible from the buffer
while buffer.strip():
message, remaining = parse_message_json(buffer)
if message:
messages.append(message)
buffer = remaining
else:
# Not enough data for a complete message, read more
break
# After file is read, try to parse any remaining message in buffer
while buffer.strip():
message, remaining = parse_message_json(buffer)
if message:
messages.append(message)
buffer = remaining
else:
break

Copilot uses AI. Check for mistakes.
Comment thread web_viewer.py
Comment on lines +223 to +261
# Sample messages to discover chats (check first 100 and last 100)
sample_size = min(200, len(messages))
if sample_size > 0:
sample_indices = list(range(min(100, len(messages)))) + list(range(max(0, len(messages) - 100), len(messages)))
for idx in sample_indices:
if idx < len(messages):
msg = messages[idx]
if 'chat' in msg and msg['chat']:
chat_data = msg['chat']
chat_key = get_chat_key(chat_data)
if chat_key and chat_key not in seen_chats:
seen_chats.add(chat_key)
# Create chat info
chat_name = chat_data.get('username') or chat_data.get('first_name') or chat_data.get('title') or str(chat_data.get('id', 'Unknown'))
chats_dict[chat_key] = {
'id': chat_key,
'chat_id': chat_data.get('id'),
'username': chat_data.get('username'),
'name': chat_name,
'type': chat_data.get('type', ''),
'message_count': 0, # Will be calculated on demand
'source_folder': chat_dir.name
}

# Count messages per chat (do this efficiently)
chat_counts = {}
for msg in messages:
if 'chat' in msg and msg['chat']:
chat_data = msg['chat']
chat_key = get_chat_key(chat_data)
if chat_key:
chat_counts[chat_key] = chat_counts.get(chat_key, 0) + 1

# Update message counts
for chat_key, count in chat_counts.items():
if chat_key in chats_dict:
chats_dict[chat_key]['message_count'] = count
elif chat_key not in seen_chats:
# Chat discovered during counting

Copilot AI Nov 19, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function iterates through ALL messages twice - once to discover chats (lines 249-254) and then again to count messages per chat (lines 247-259). This is inefficient for large message files. Consider combining these operations into a single pass through the messages.

Suggested change
# Sample messages to discover chats (check first 100 and last 100)
sample_size = min(200, len(messages))
if sample_size > 0:
sample_indices = list(range(min(100, len(messages)))) + list(range(max(0, len(messages) - 100), len(messages)))
for idx in sample_indices:
if idx < len(messages):
msg = messages[idx]
if 'chat' in msg and msg['chat']:
chat_data = msg['chat']
chat_key = get_chat_key(chat_data)
if chat_key and chat_key not in seen_chats:
seen_chats.add(chat_key)
# Create chat info
chat_name = chat_data.get('username') or chat_data.get('first_name') or chat_data.get('title') or str(chat_data.get('id', 'Unknown'))
chats_dict[chat_key] = {
'id': chat_key,
'chat_id': chat_data.get('id'),
'username': chat_data.get('username'),
'name': chat_name,
'type': chat_data.get('type', ''),
'message_count': 0, # Will be calculated on demand
'source_folder': chat_dir.name
}
# Count messages per chat (do this efficiently)
chat_counts = {}
for msg in messages:
if 'chat' in msg and msg['chat']:
chat_data = msg['chat']
chat_key = get_chat_key(chat_data)
if chat_key:
chat_counts[chat_key] = chat_counts.get(chat_key, 0) + 1
# Update message counts
for chat_key, count in chat_counts.items():
if chat_key in chats_dict:
chats_dict[chat_key]['message_count'] = count
elif chat_key not in seen_chats:
# Chat discovered during counting
# Discover chats and count messages in a single pass
chat_counts = {}
seen_chats = set()
for msg in messages:
if 'chat' in msg and msg['chat']:
chat_data = msg['chat']
chat_key = get_chat_key(chat_data)
if chat_key:
# Count messages per chat
chat_counts[chat_key] = chat_counts.get(chat_key, 0) + 1
# Discover chat if not already seen
if chat_key not in seen_chats:
seen_chats.add(chat_key)
chat_name = chat_data.get('username') or chat_data.get('first_name') or chat_data.get('title') or str(chat_data.get('id', 'Unknown'))
chats_dict[chat_key] = {
'id': chat_key,
'chat_id': chat_data.get('id'),
'username': chat_data.get('username'),
'name': chat_name,
'type': chat_data.get('type', ''),
'message_count': 0, # Will be updated below
'source_folder': chat_dir.name
}
# Update message counts
for chat_key, count in chat_counts.items():
if chat_key in chats_dict:
chats_dict[chat_key]['message_count'] = count
else:
# Chat discovered during counting (should not happen, but for safety)

Copilot uses AI. Check for mistakes.
</div>
</div>
<div class="search-bar" id="searchBar" style="display: none;">
<input type="text" id="searchInput" placeholder="Search messages..." autocomplete="off">

Copilot AI Nov 19, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The search input field lacks an associated <label> element. While it has a placeholder, screen readers cannot properly announce the purpose of the input field. Add a <label> element (can be visually hidden with CSS if needed) or use aria-label attribute for better accessibility.

Suggested change
<input type="text" id="searchInput" placeholder="Search messages..." autocomplete="off">
<input type="text" id="searchInput" placeholder="Search messages..." autocomplete="off" aria-label="Search messages">

Copilot uses AI. Check for mistakes.
Comment thread web_viewer_static/app.js
function setupSearch() {
const searchInput = document.getElementById('searchInput');
const searchClear = document.getElementById('searchClear');
const searchResultsInfo = document.getElementById('searchResultsInfo');

Copilot AI Nov 19, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable searchResultsInfo.

Suggested change
const searchResultsInfo = document.getElementById('searchResultsInfo');

Copilot uses AI. Check for mistakes.
Comment thread TeleGatherer.py
Comment on lines +288 to +294
viewer_process = subprocess.Popen(
[sys.executable, 'web_viewer.py'],
creationflags=subprocess.CREATE_NEW_CONSOLE
)
else:
# On Unix-like systems, run in background
viewer_process = subprocess.Popen(

Copilot AI Nov 19, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment to 'viewer_process' is unnecessary as it is redefined before this value is used.

Suggested change
viewer_process = subprocess.Popen(
[sys.executable, 'web_viewer.py'],
creationflags=subprocess.CREATE_NEW_CONSOLE
)
else:
# On Unix-like systems, run in background
viewer_process = subprocess.Popen(
subprocess.Popen(
[sys.executable, 'web_viewer.py'],
creationflags=subprocess.CREATE_NEW_CONSOLE
)
else:
# On Unix-like systems, run in background
subprocess.Popen(

Copilot uses AI. Check for mistakes.
Comment thread TeleGatherer.py
)
else:
# On Unix-like systems, run in background
viewer_process = subprocess.Popen(

Copilot AI Nov 19, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment to 'viewer_process' is unnecessary as it is redefined before this value is used.

Copilot uses AI. Check for mistakes.
Comment thread web_viewer.py
@@ -0,0 +1,418 @@
import os
import json
import re

Copilot AI Nov 19, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 're' is not used.

Suggested change
import re

Copilot uses AI. Check for mistakes.
jakemuk and others added 14 commits June 24, 2026 09:34
Pyrogram is stuck at 2.0.106 on PyPI with outdated 32-bit peer-id ranges,
which made get_messages raise "Peer id invalid" for channels/groups with
modern 64-bit ids (menu option 6, Download ALL messages).

kurigram is a maintained drop-in fork that imports under the same `pyrogram`
namespace but ships 64-bit peer support, so no application code changes are
needed. Verified end-to-end: the bot now authenticates, resolves the peer,
and downloads real messages from the target chat.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- README: fix install command (requirements -> requirements.txt), add venv
  step, and document the kurigram dependency / Peer id invalid fix.
- Stop tracking .bot-history: it is already in .gitignore but was committed
  before that rule, so it kept leaking the bot_token:chat_id pairs into git.
  Removed from the index only; the local file is preserved.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Since .bot-history is no longer tracked in git, a fresh clone won't have it,
and check_file_for_token_and_chat_id opened it unconditionally for reading -
crashing with FileNotFoundError before the menu was ever shown. Treat a
missing file as "no prior entry"; it is then created on first run as before.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wrapper that runs TeleGatherer.py with the project's .venv interpreter (which
has kurigram and the other dependencies installed), so it doesn't accidentally
use a global Python that lacks them. Forwards all args through.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
kurigram is API-compatible with Pyrogram except for a few spots the download
path hit:

- Client.run() no longer accepts a coroutine (it takes only keyword args).
  Run the coroutine via app.loop.run_until_complete(), which is what the old
  Pyrogram run(coroutine) did internally.
- get_messages() returns None for a missing/deleted message id (old Pyrogram
  returned a stub with date=None). The `messages.date is None` check therefore
  raised AttributeError and - because the try/except wraps the whole loop -
  aborted the entire download at the first gap. Guard for None as well.
- With gaps now skipped, the counter-decrement logic could push the loop past
  message id 1 into negative ids forever; bound the loop at message_id > 1 to
  guarantee termination.

Verified end-to-end: downloads real messages from the target chat and writes
the .txt/.json logs, with no peer error, no abort on gaps, and clean exit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The downloader saved each message under its sender's username folder, so a
group chat's history was scattered across multiple folders (one per sender)
and could not be read as a single conversation. Save every message under the
chat's own folder (Downloads/<chat_id>/) instead, keeping the sender recorded
per-message in the log. Media now lands under downloads/<chat_id>/ as well.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Serve a chat's messages by scanning all download folders and filtering by
chat id, rather than reading only the single recorded source_folder. This
shows the full history even for older data that was split across per-sender
folders. De-duplicate by message id, since logs are append-only (re-downloads
repeat messages) and aggregation can surface the same message twice.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The web viewer derived the chat name/type by parsing the chat object out of
message bodies, which is fragile (fails for media-only or service-message-only
chats, and the on-disk folder is just a numeric chat_id).

The downloader now writes Downloads/<chat_id>/metadata.json (name, title, type,
username, members_count) once per download, and the viewer prefers it for the
chat name/type shown in the sidebar and header, falling back to message-derived
values when it's absent. This keeps the chat name/avatar/header intact and makes
the chat folders self-describing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The downloader opened a new connection per message and reconnected for every
id, which made downloading a whole chat impractically slow (hundreds of
logins + flood-waits). Fetch in batches of 200 ids (Telegram's per-call limit)
over a single open connection instead.

Also isolate each message: previously a single problematic message - e.g. a
web-page-preview "media" with no real file - raised out of the loop and aborted
the entire download. Now each message is processed in its own try/except,
un-resolvable media (no file_id) is skipped, and media download failures are
logged per-message rather than killing the run.

Note: option 6 still uses the Bot API (get_latest_messageid) to estimate the
message count before the MTProto download; heavy repeated MTProto logins can
make Telegram log the bot out of the Bot API session temporarily.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Repeat/update runs previously re-fetched and re-saved the entire chat (and
appended duplicate log entries). Now the downloader reads the ids already in
the chat's log up front, skips them, and stops early once it reaches a batch
whose messages are all already downloaded (going newest-first, everything older
is saved too). An update run therefore fetches roughly one batch of new
messages instead of the whole history.

The first full download is unchanged (nothing saved yet, so nothing skipped);
the win is on every subsequent run. Bot history can only be read by message id
(bots can't use get_chat_history), so the 200-ids-per-call batching already in
place remains the call-optimal path for the initial download.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
process_messages built its pyrogram Client without a bot_token and without a
fixed workdir, so it only worked when an already-authorized session file
happened to exist at a path derived from sys.argv[0]'s directory. On a clean
checkout (or when invoked from outside the repo) it failed with
"unable to open database file" / could not authenticate.

Pass bot_token so a fresh session authenticates on its own, and pin workdir="."
so the session is created under the working directory next to the sessions/
folder the function already makes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A discussion group and its linked channel have different chat ids, so messages
in a single downloaded folder were split into two same-named sidebar entries
(e.g. the group's 639 messages and the channel's 196 posts shown separately).

Folders written by the downloader carry a metadata.json; present each such
folder as ONE chat holding all of its messages (group + linked-channel posts),
named from the metadata. Legacy per-sender folders (no metadata.json) keep the
previous chat-id aggregation, so they are not mis-merged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Every message rendered its raw entity list (e.g. "MessageEntityType.BOLD:
offset 3, length 26"), cluttering the conversation. Hide .message-entities by
default and add a "Show entity details" checkbox in the search bar that toggles
a .show-entities class on the container to reveal them when needed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
formatDate showed time-only for messages from the current day (and relative
labels for recent ones), so today's messages appeared without a date even
though the full date is in the logs. Always render date + time
(e.g. "Jun 24, 2026, 14:17").

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants