OpenSource/telegram-crawler: 🕷 Automatically detect changes made to the official Telegram sites, clients and servers.

mirror of https://github.com/MarshalX/telegram-crawler.git synced 2026-07-21 02:21:34 +00:00

🕷 Automatically detect changes made to the official Telegram sites, clients and servers. https://t.me/tgcrawl

crawler crawling crawling-python parser telegram telegram-org telegram-updates

Python 100%

Find a file

Ilya (Marshal) 81837f9f06 Update README		2026-07-21 02:34:30 +02:00
.github	Restore data branch checkout lost in workflow refactor	2026-07-21 01:36:07 +02:00
.gitignore	fixed fetching tgmacos beta and tgandroid beta (#22 )	2025-08-16 13:59:47 +02:00
ccl_bplist.py	add crawling of Telegram for iOS from TestFlight;	2022-05-09 12:05:55 +02:00
LICENSE	add readme and license file	2021-04-24 17:34:55 +02:00
make_and_send_alert.py	Clean up lint and type errors	2026-07-21 02:21:07 +02:00
make_files_tree.py	Clean up lint and type errors	2026-07-21 02:21:07 +02:00
make_tracked_links_list.py	Clean up lint and type errors	2026-07-21 02:21:07 +02:00
README.md	Update README	2026-07-21 02:34:30 +02:00
requirements.txt	Migrate from aiohttp to httpx and log download progress (#25 )	2026-07-21 02:11:04 +02:00
tracked_links.txt	Update tracked links	2026-07-20 23:53:29 +00:00
tracked_res_links.txt	Update tracked links	2026-07-20 23:53:29 +00:00
tracked_tr_links.txt	Update tracked links	2026-01-22 12:08:11 +00:00
unwebpack_sourcemap.py	fix encoding	2023-09-15 22:57:06 +02:00

README.md

🕷 Telegram Crawler

This project automatically detects changes made to the official Telegram sites, beta clients, MTProto servers and even mini apps. It helps to anticipate future updates and other news (new vacancies, API updates, etc.).

Name	Commits	Status
Data tracker	Commits
Site links collector	Commits

✅ passing means the run went fine (with or without new changes)
❌ failing means a real error: a crawl failure, a timeout or a push conflict

To see new changes, look at the fresh commits in the data branch and at the alert channels, not at the badge color.

Subscribe to the channel with alerts to stay updated, or to the forum version, where every tracked area has its own topic. Alerts are also mirrored to a Discord channel. A copy of the Telegram websites and client resources is stored here.

What is tracked

Target	How	Data folder	Alert hashtags
Websites. telegram.org and its subdomains (core, corefork, blogfork, td, desktop, osx, instantview, promote and more), contest.com, web app betas, plus TL schemas of TDLib and tdesktop straight from their GitHub repos	A recursive link crawler collects the pages, then the content crawler snapshots them with all dynamic parts removed	`data/web`	`#web`
Web resources. CSS, JS, images, videos and other static files of the sites above	Text resources are stored as is, binary ones as SHA-256 hashes	`data/web_res`	`#web_res`
Translations. All keys and values of translations.telegram.org for every client platform	Paginated AJAX collection of each category, plus an aggregated `translation_keys.json`	`data/web_tr`	`#web_tr`
MTProto servers. Production and test DCs	Real MTProto sessions (via Pyrogram) call `help.GetConfig`, `help.GetCdnConfig`, `help.GetCountriesList`, `help.GetAppConfig`, `messages.GetAvailableReactions` and `help.GetPremiumPromo`. They also fetch a set of official sticker packs. Volatile fields (access hashes, file references, dates) are scrubbed	`data/server`, `data/server/test`	`#server`, `#test_server`
Android client. Public beta and stable APKs from telegram.org/dl	apktool decompiles resources only (`-s` flag), then `strings.xml` and `public.xml` are tracked	`data/client/android-beta`, `data/client/android-stable-dl`	`#android`, `#android_dl`
macOS beta client.	The Sparkle `versions.xml` feed points to the latest build. `Localizable.strings` is tracked as text, other resources as hashes, and `Assets.car` is unpacked with acextract	`data/client/macos-beta`	`#macos`
iOS beta client.	A decrypted IPA from decrypted-telegram-ios is unpacked. Binary plists are decoded to JSON with `ccl_bplist`, other files are tracked as hashes, `Assets.car` goes through acextract as well	`data/client/ios-beta`	`#ios`
Wallet mini app. walletbot.me	unwebpack_sourcemap.py restores the original `src` tree from webpack source maps	`data/mini_app/wallet`	`#mini_app`, `#wallet`

How it works

Link crawling runs as often as possible. It starts from the home page of telegram.org, detects relative and absolute sub links and recursively repeats the operation. There is also a set of HIDDEN_URLS to help the crawler find pages that nothing links to, and ADDITIONAL_URLS for external files like TL schemas. Exceptions are managed by a system of rules. The result is three lists of unique links: tracked_links.txt for pages, tracked_res_links.txt for static resources and tracked_tr_links.txt for translations. A safety check aborts the run if a list suddenly shrinks too much, so a temporary Telegram outage cannot wipe the link base.
Content crawling is launched as often as possible and uses the lists collected in step 1. It fetches every link and builds a tree of subfolders and files. All dynamic content is removed from files (page generation timestamps, API hashes, nonces, passport SSIDs, Sparkle signatures, APK tokens, TON rate and so on), so a git diff only ever shows real changes. Binary files are stored as SHA-256 hashes to keep the repo small. The same script also runs all the client, server and mini app trackers from the table above. The MODE environment variable selects what to run: all, web, web_res, web_tr, server, client or mini_app.
Everything runs on GitHub Actions, so no servers are needed. You can simply fork this repository and run your own tracker. The content workflow is a matrix of six jobs, one per mode, with macOS runners where Apple tooling is needed. Each job commits its own folder to the data branch, so parallel jobs never conflict. When there is nothing new, the job simply has nothing to commit and finishes successfully. The link crawler also has a safety net: if the list of links suddenly shrinks too much, the run fails on purpose instead of committing a broken link base.
Sending alerts. A push to the data branch triggers an alert workflow (it lives on the data branch too). The script fetches the commit diff from the GitHub API, converts changed paths into hashtags and posts a summary to the @tgcrawl channel, to the matching topics of the @tfcrawl forum and to Discord.

FAQ

Q: How often is "as often as possible"?

A: TL;DR: the content update action runs roughly every 10 minutes. More info:

Q: Why are there two separate crawl scripts instead of one?

A: Because the original idea was to update the tracked links once an hour, and separate scripts and workflows were more convenient for that. After the Telegram 7.7 update, I realized that finding new blog posts that slowly was a bad idea.

Q: Why does the alert script have a while loop?

A: Because the GitHub API doesn't return commit information immediately after a push to the repository. The script waits for it to appear...

Q: Why are you using a GitHub Personal Access Token in the actions/checkout step?

A: So pushes made by the workflow can trigger other workflows. More info:

Action does not trigger another on push tag action

Q: Why are you using GitHub PAT in make_and_send_alert.py?

A: To increase the GitHub API rate limits.

Q: Why are you decompiling the .apk file on each run?

A: Because it doesn't take much time. Only resources are decompiled (the -s flag of apktool disables disassembly of dex files). Writing a check that skips decompilation based on the hash of the apk file would cost more time than it saves.

Example of link crawler rules configuration

Every rule is a regex. Allow rules have higher priority than deny ones, and an empty string matches any URL. For example, this keeps only the root, the first level pages and the English categories of the translations platform:

CRAWL_RULES = {
    'translations.telegram.org': {
        'allow': {
            r'^[^/]*$',  # root
            r'org/[^/]*/$',  # 1 lvl sub
            r'/en/[a-z_]+/$'  # 1 lvl after /en/
        },
        'deny': {
            '',  # all
        }
    },
}

The current configuration always lives in make_tracked_links_list.py: see CRAWL_RULES for the rules and HIDDEN_URLS for the manually added links.

License

Licensed under the MIT License.