Add web viewer feature for browsing downloaded messages#23
Conversation
- Add Flask-based web viewer with Telegram-style UI - Implement infinite scroll and search functionality - Add menu option 8 to TeleGatherer.py to launch viewer - Fix UTF-8 encoding issues for Windows compatibility - Update documentation and requirements
Removed duplicate downloads directory exclusion
There was a problem hiding this comment.
Pull Request Overview
This PR adds a web-based viewer for browsing downloaded Telegram messages with a Telegram-style dark UI. The feature includes a Flask backend API, responsive frontend with search and infinite scroll, and integration into the main TeleGatherer menu.
Key Changes
- New web viewer with Flask REST API for serving messages with pagination and caching
- Telegram-style responsive UI with search functionality and infinite scroll
- UTF-8 encoding fixes for Windows compatibility with emojis and special characters
Reviewed Changes
Copilot reviewed 7 out of 9 changed files in this pull request and generated 26 comments.
Show a summary per file
| File | Description |
|---|---|
web_viewer_static/styles.css |
Telegram-themed CSS with dark color scheme and responsive layout |
web_viewer_static/index.html |
Single-page HTML structure for the viewer interface |
web_viewer_static/app.js |
Frontend JavaScript handling chat loading, search, and infinite scroll |
web_viewer.py |
Flask backend with API endpoints for chats and messages, includes caching |
TeleGatherer.py |
Added menu option 8 to launch web viewer via subprocess |
helpers/TeleViewer.py |
Fixed UTF-8 encoding for file operations |
requirements.txt |
Added Flask 3.0.0 and flask-cors 4.0.0 dependencies |
.gitignore |
Added log file exclusion for web viewer static directory |
README.md |
Added web viewer documentation and usage instructions |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| chatItem.innerHTML = ` | ||
| <div class="chat-item-name">${escapeHtml(chat.name)}</div> | ||
| <div class="chat-item-meta">${chat.message_count || 0} messages</div> | ||
| `; |
There was a problem hiding this comment.
The HTML is constructed using string concatenation with user-supplied data (chat.name). While escapeHtml() is used here, this pattern is error-prone. Consider using textContent property or DOM manipulation methods to safely set the chat name instead of innerHTML.
| chatItem.innerHTML = ` | |
| <div class="chat-item-name">${escapeHtml(chat.name)}</div> | |
| <div class="chat-item-meta">${chat.message_count || 0} messages</div> | |
| `; | |
| const chatNameDiv = document.createElement('div'); | |
| chatNameDiv.className = 'chat-item-name'; | |
| chatNameDiv.textContent = chat.name; | |
| const chatMetaDiv = document.createElement('div'); | |
| chatMetaDiv.className = 'chat-item-meta'; | |
| chatMetaDiv.textContent = `${chat.message_count || 0} messages`; | |
| chatItem.appendChild(chatNameDiv); | |
| chatItem.appendChild(chatMetaDiv); |
| if (!searchQuery) return html; | ||
|
|
||
| const query = escapeHtml(searchQuery).toLowerCase(); | ||
| const regex = new RegExp(`(${query.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')})`, 'gi'); |
There was a problem hiding this comment.
[nitpick] The regex pattern replace(/[.*+?^${}()|[\]\\]/g, '\\$&') is used to escape regex special characters, but the character class is missing escaping for the closing bracket ]. While this works in practice due to how JavaScript handles it, it's clearer to escape it as \\] for consistency and clarity.
| const regex = new RegExp(`(${query.replace(/[.*+?^${}()|[\]\\]/g, '\\$&')})`, 'gi'); | |
| const regex = new RegExp(`(${query.replace(/[.*+?^${}()|[\]\\\]]/g, '\\$&')})`, 'gi'); |
| app.run(debug=True, port=5000, host='0.0.0.0') | ||
|
|
There was a problem hiding this comment.
The server runs with debug=True and host='0.0.0.0' in production. This is a significant security risk as it:
- Exposes the Werkzeug debugger which can execute arbitrary code
- Binds to all network interfaces, making it accessible from outside the local machine
- Exposes sensitive file paths and code
For production use, set debug=False and host='127.0.0.1' (localhost only), or make these configurable via environment variables.
| app.run(debug=True, port=5000, host='0.0.0.0') | |
| debug = os.environ.get('FLASK_DEBUG', 'False').lower() in ('1', 'true', 'yes') | |
| host = os.environ.get('FLASK_HOST', '127.0.0.1') | |
| port = int(os.environ.get('FLASK_PORT', '5000')) | |
| app.run(debug=debug, port=port, host=host) |
| content = f.read() | ||
|
|
||
| remaining = content | ||
| while remaining.strip(): | ||
| message, remaining = parse_message_json(remaining) | ||
| if message: | ||
| messages.append(message) | ||
| else: | ||
| break |
There was a problem hiding this comment.
The entire file content is read into memory with f.read() before parsing. For very large message files (potentially gigabytes), this could cause memory issues. Consider using streaming/chunked reading to parse JSON objects incrementally.
| content = f.read() | |
| remaining = content | |
| while remaining.strip(): | |
| message, remaining = parse_message_json(remaining) | |
| if message: | |
| messages.append(message) | |
| else: | |
| break | |
| buffer = "" | |
| while True: | |
| chunk = f.read(8192) | |
| if not chunk: | |
| break | |
| buffer += chunk | |
| # Try to parse as many messages as possible from the buffer | |
| while buffer.strip(): | |
| message, remaining = parse_message_json(buffer) | |
| if message: | |
| messages.append(message) | |
| buffer = remaining | |
| else: | |
| # Not enough data for a complete message, read more | |
| break | |
| # After file is read, try to parse any remaining message in buffer | |
| while buffer.strip(): | |
| message, remaining = parse_message_json(buffer) | |
| if message: | |
| messages.append(message) | |
| buffer = remaining | |
| else: | |
| break |
| # Sample messages to discover chats (check first 100 and last 100) | ||
| sample_size = min(200, len(messages)) | ||
| if sample_size > 0: | ||
| sample_indices = list(range(min(100, len(messages)))) + list(range(max(0, len(messages) - 100), len(messages))) | ||
| for idx in sample_indices: | ||
| if idx < len(messages): | ||
| msg = messages[idx] | ||
| if 'chat' in msg and msg['chat']: | ||
| chat_data = msg['chat'] | ||
| chat_key = get_chat_key(chat_data) | ||
| if chat_key and chat_key not in seen_chats: | ||
| seen_chats.add(chat_key) | ||
| # Create chat info | ||
| chat_name = chat_data.get('username') or chat_data.get('first_name') or chat_data.get('title') or str(chat_data.get('id', 'Unknown')) | ||
| chats_dict[chat_key] = { | ||
| 'id': chat_key, | ||
| 'chat_id': chat_data.get('id'), | ||
| 'username': chat_data.get('username'), | ||
| 'name': chat_name, | ||
| 'type': chat_data.get('type', ''), | ||
| 'message_count': 0, # Will be calculated on demand | ||
| 'source_folder': chat_dir.name | ||
| } | ||
|
|
||
| # Count messages per chat (do this efficiently) | ||
| chat_counts = {} | ||
| for msg in messages: | ||
| if 'chat' in msg and msg['chat']: | ||
| chat_data = msg['chat'] | ||
| chat_key = get_chat_key(chat_data) | ||
| if chat_key: | ||
| chat_counts[chat_key] = chat_counts.get(chat_key, 0) + 1 | ||
|
|
||
| # Update message counts | ||
| for chat_key, count in chat_counts.items(): | ||
| if chat_key in chats_dict: | ||
| chats_dict[chat_key]['message_count'] = count | ||
| elif chat_key not in seen_chats: | ||
| # Chat discovered during counting |
There was a problem hiding this comment.
The function iterates through ALL messages twice - once to discover chats (lines 249-254) and then again to count messages per chat (lines 247-259). This is inefficient for large message files. Consider combining these operations into a single pass through the messages.
| # Sample messages to discover chats (check first 100 and last 100) | |
| sample_size = min(200, len(messages)) | |
| if sample_size > 0: | |
| sample_indices = list(range(min(100, len(messages)))) + list(range(max(0, len(messages) - 100), len(messages))) | |
| for idx in sample_indices: | |
| if idx < len(messages): | |
| msg = messages[idx] | |
| if 'chat' in msg and msg['chat']: | |
| chat_data = msg['chat'] | |
| chat_key = get_chat_key(chat_data) | |
| if chat_key and chat_key not in seen_chats: | |
| seen_chats.add(chat_key) | |
| # Create chat info | |
| chat_name = chat_data.get('username') or chat_data.get('first_name') or chat_data.get('title') or str(chat_data.get('id', 'Unknown')) | |
| chats_dict[chat_key] = { | |
| 'id': chat_key, | |
| 'chat_id': chat_data.get('id'), | |
| 'username': chat_data.get('username'), | |
| 'name': chat_name, | |
| 'type': chat_data.get('type', ''), | |
| 'message_count': 0, # Will be calculated on demand | |
| 'source_folder': chat_dir.name | |
| } | |
| # Count messages per chat (do this efficiently) | |
| chat_counts = {} | |
| for msg in messages: | |
| if 'chat' in msg and msg['chat']: | |
| chat_data = msg['chat'] | |
| chat_key = get_chat_key(chat_data) | |
| if chat_key: | |
| chat_counts[chat_key] = chat_counts.get(chat_key, 0) + 1 | |
| # Update message counts | |
| for chat_key, count in chat_counts.items(): | |
| if chat_key in chats_dict: | |
| chats_dict[chat_key]['message_count'] = count | |
| elif chat_key not in seen_chats: | |
| # Chat discovered during counting | |
| # Discover chats and count messages in a single pass | |
| chat_counts = {} | |
| seen_chats = set() | |
| for msg in messages: | |
| if 'chat' in msg and msg['chat']: | |
| chat_data = msg['chat'] | |
| chat_key = get_chat_key(chat_data) | |
| if chat_key: | |
| # Count messages per chat | |
| chat_counts[chat_key] = chat_counts.get(chat_key, 0) + 1 | |
| # Discover chat if not already seen | |
| if chat_key not in seen_chats: | |
| seen_chats.add(chat_key) | |
| chat_name = chat_data.get('username') or chat_data.get('first_name') or chat_data.get('title') or str(chat_data.get('id', 'Unknown')) | |
| chats_dict[chat_key] = { | |
| 'id': chat_key, | |
| 'chat_id': chat_data.get('id'), | |
| 'username': chat_data.get('username'), | |
| 'name': chat_name, | |
| 'type': chat_data.get('type', ''), | |
| 'message_count': 0, # Will be updated below | |
| 'source_folder': chat_dir.name | |
| } | |
| # Update message counts | |
| for chat_key, count in chat_counts.items(): | |
| if chat_key in chats_dict: | |
| chats_dict[chat_key]['message_count'] = count | |
| else: | |
| # Chat discovered during counting (should not happen, but for safety) |
| </div> | ||
| </div> | ||
| <div class="search-bar" id="searchBar" style="display: none;"> | ||
| <input type="text" id="searchInput" placeholder="Search messages..." autocomplete="off"> |
There was a problem hiding this comment.
The search input field lacks an associated <label> element. While it has a placeholder, screen readers cannot properly announce the purpose of the input field. Add a <label> element (can be visually hidden with CSS if needed) or use aria-label attribute for better accessibility.
| <input type="text" id="searchInput" placeholder="Search messages..." autocomplete="off"> | |
| <input type="text" id="searchInput" placeholder="Search messages..." autocomplete="off" aria-label="Search messages"> |
| function setupSearch() { | ||
| const searchInput = document.getElementById('searchInput'); | ||
| const searchClear = document.getElementById('searchClear'); | ||
| const searchResultsInfo = document.getElementById('searchResultsInfo'); |
There was a problem hiding this comment.
Unused variable searchResultsInfo.
| const searchResultsInfo = document.getElementById('searchResultsInfo'); |
| viewer_process = subprocess.Popen( | ||
| [sys.executable, 'web_viewer.py'], | ||
| creationflags=subprocess.CREATE_NEW_CONSOLE | ||
| ) | ||
| else: | ||
| # On Unix-like systems, run in background | ||
| viewer_process = subprocess.Popen( |
There was a problem hiding this comment.
This assignment to 'viewer_process' is unnecessary as it is redefined before this value is used.
| viewer_process = subprocess.Popen( | |
| [sys.executable, 'web_viewer.py'], | |
| creationflags=subprocess.CREATE_NEW_CONSOLE | |
| ) | |
| else: | |
| # On Unix-like systems, run in background | |
| viewer_process = subprocess.Popen( | |
| subprocess.Popen( | |
| [sys.executable, 'web_viewer.py'], | |
| creationflags=subprocess.CREATE_NEW_CONSOLE | |
| ) | |
| else: | |
| # On Unix-like systems, run in background | |
| subprocess.Popen( |
| @@ -0,0 +1,418 @@ | |||
| import os | |||
| import json | |||
| import re | |||
There was a problem hiding this comment.
Import of 're' is not used.
| import re |
Pyrogram is stuck at 2.0.106 on PyPI with outdated 32-bit peer-id ranges, which made get_messages raise "Peer id invalid" for channels/groups with modern 64-bit ids (menu option 6, Download ALL messages). kurigram is a maintained drop-in fork that imports under the same `pyrogram` namespace but ships 64-bit peer support, so no application code changes are needed. Verified end-to-end: the bot now authenticates, resolves the peer, and downloads real messages from the target chat. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- README: fix install command (requirements -> requirements.txt), add venv step, and document the kurigram dependency / Peer id invalid fix. - Stop tracking .bot-history: it is already in .gitignore but was committed before that rule, so it kept leaking the bot_token:chat_id pairs into git. Removed from the index only; the local file is preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Since .bot-history is no longer tracked in git, a fresh clone won't have it, and check_file_for_token_and_chat_id opened it unconditionally for reading - crashing with FileNotFoundError before the menu was ever shown. Treat a missing file as "no prior entry"; it is then created on first run as before. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wrapper that runs TeleGatherer.py with the project's .venv interpreter (which has kurigram and the other dependencies installed), so it doesn't accidentally use a global Python that lacks them. Forwards all args through. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
kurigram is API-compatible with Pyrogram except for a few spots the download path hit: - Client.run() no longer accepts a coroutine (it takes only keyword args). Run the coroutine via app.loop.run_until_complete(), which is what the old Pyrogram run(coroutine) did internally. - get_messages() returns None for a missing/deleted message id (old Pyrogram returned a stub with date=None). The `messages.date is None` check therefore raised AttributeError and - because the try/except wraps the whole loop - aborted the entire download at the first gap. Guard for None as well. - With gaps now skipped, the counter-decrement logic could push the loop past message id 1 into negative ids forever; bound the loop at message_id > 1 to guarantee termination. Verified end-to-end: downloads real messages from the target chat and writes the .txt/.json logs, with no peer error, no abort on gaps, and clean exit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The downloader saved each message under its sender's username folder, so a group chat's history was scattered across multiple folders (one per sender) and could not be read as a single conversation. Save every message under the chat's own folder (Downloads/<chat_id>/) instead, keeping the sender recorded per-message in the log. Media now lands under downloads/<chat_id>/ as well. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Serve a chat's messages by scanning all download folders and filtering by chat id, rather than reading only the single recorded source_folder. This shows the full history even for older data that was split across per-sender folders. De-duplicate by message id, since logs are append-only (re-downloads repeat messages) and aggregation can surface the same message twice. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The web viewer derived the chat name/type by parsing the chat object out of message bodies, which is fragile (fails for media-only or service-message-only chats, and the on-disk folder is just a numeric chat_id). The downloader now writes Downloads/<chat_id>/metadata.json (name, title, type, username, members_count) once per download, and the viewer prefers it for the chat name/type shown in the sidebar and header, falling back to message-derived values when it's absent. This keeps the chat name/avatar/header intact and makes the chat folders self-describing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The downloader opened a new connection per message and reconnected for every id, which made downloading a whole chat impractically slow (hundreds of logins + flood-waits). Fetch in batches of 200 ids (Telegram's per-call limit) over a single open connection instead. Also isolate each message: previously a single problematic message - e.g. a web-page-preview "media" with no real file - raised out of the loop and aborted the entire download. Now each message is processed in its own try/except, un-resolvable media (no file_id) is skipped, and media download failures are logged per-message rather than killing the run. Note: option 6 still uses the Bot API (get_latest_messageid) to estimate the message count before the MTProto download; heavy repeated MTProto logins can make Telegram log the bot out of the Bot API session temporarily. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Repeat/update runs previously re-fetched and re-saved the entire chat (and appended duplicate log entries). Now the downloader reads the ids already in the chat's log up front, skips them, and stops early once it reaches a batch whose messages are all already downloaded (going newest-first, everything older is saved too). An update run therefore fetches roughly one batch of new messages instead of the whole history. The first full download is unchanged (nothing saved yet, so nothing skipped); the win is on every subsequent run. Bot history can only be read by message id (bots can't use get_chat_history), so the 200-ids-per-call batching already in place remains the call-optimal path for the initial download. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
process_messages built its pyrogram Client without a bot_token and without a fixed workdir, so it only worked when an already-authorized session file happened to exist at a path derived from sys.argv[0]'s directory. On a clean checkout (or when invoked from outside the repo) it failed with "unable to open database file" / could not authenticate. Pass bot_token so a fresh session authenticates on its own, and pin workdir="." so the session is created under the working directory next to the sessions/ folder the function already makes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A discussion group and its linked channel have different chat ids, so messages in a single downloaded folder were split into two same-named sidebar entries (e.g. the group's 639 messages and the channel's 196 posts shown separately). Folders written by the downloader carry a metadata.json; present each such folder as ONE chat holding all of its messages (group + linked-channel posts), named from the metadata. Legacy per-sender folders (no metadata.json) keep the previous chat-id aggregation, so they are not mis-merged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Every message rendered its raw entity list (e.g. "MessageEntityType.BOLD: offset 3, length 26"), cluttering the conversation. Hide .message-entities by default and add a "Show entity details" checkbox in the search bar that toggles a .show-entities class on the container to reveal them when needed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
formatDate showed time-only for messages from the current day (and relative labels for recent ones), so today's messages appeared without a date even though the full date is in the logs. Always render date + time (e.g. "Jun 24, 2026, 14:17"). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Overview
This PR adds a modern web-based viewer that provides a Telegram-style interface for browsing and searching downloaded Telegram messages. The viewer makes it easier to analyze collected intelligence from Telegram channels.
Features Added
Web Viewer Interface
Integration
TeleGatherer.pyto launch the web viewer directlypython web_viewer.pyhttp://localhost:5000Technical Details
New Files
web_viewer.py- Flask backend server with REST API endpointsweb_viewer_static/index.html- Main HTML pageweb_viewer_static/styles.css- Telegram-style CSS stylingweb_viewer_static/app.js- Frontend JavaScript with infinite scroll and searchModified Files
TeleGatherer.py- Added menu option 8 to launch web viewerhelpers/TeleViewer.py- Fixed UTF-8 encoding issues for Windows compatibility (handles emojis and special characters)requirements.txt- Added Flask==3.0.0 and flask-cors==4.0.0README.md- Added comprehensive web viewer documentation.gitignore- Added Downloads folder exclusionAPI Endpoints
GET /api/chats- List all available chatsGET /api/chats/<chat_id>/messages- Get messages with paginationGET /api/chats/<chat_id>/info- Get chat informationKey Improvements
Smart Chat Grouping - Messages are grouped by actual chat ID (from message data), not folder names, so messages from different chats stored in the same folder are displayed separately
Search Capabilities - Search across:
Performance -
Windows Compatibility - Fixed encoding issues to properly handle emojis and non-ASCII characters on Windows
Usage
TeleGatherer.py(option 6)TeleGatherer.pyand select option 8python web_viewer.pyhttp://localhost:5000Dependencies
Notes