mirror of
https://github.com/MarshalX/telegram-crawler.git
synced 2025-03-15 13:22:43 +01:00
cleanup readme file; disable tests of link crawler.
This commit is contained in:
parent
72b7efcde5
commit
ae43e9e442
3 changed files with 27 additions and 28 deletions
|
@ -2,7 +2,7 @@ name: Generate or update list of tracked links
|
|||
|
||||
on:
|
||||
schedule:
|
||||
- cron: '* * * * *'
|
||||
- cron: '0 * * * *'
|
||||
push:
|
||||
branches:
|
||||
- main
|
||||
|
|
51
README.md
51
README.md
|
@ -1,4 +1,4 @@
|
|||
## Telegram Web Crawler
|
||||
## 🕷 Telegram Web Crawler
|
||||
|
||||
This project is developed to automatically detect changes made
|
||||
to the official Telegram sites. This is necessary for anticipating
|
||||
|
@ -10,52 +10,51 @@ future updates and other things (new vacancies, API updates, etc).
|
|||
| Site updates tracker| [Commits](https://github.com/MarshalX/telegram-crawler/commits/data) |  |
|
||||
| Site links tracker | [Commits](https://github.com/MarshalX/telegram-crawler/commits/main/tracked_links.txt) |  |
|
||||
|
||||
* passing – new changes
|
||||
* failing – no changes
|
||||
* ✅ passing – new changes
|
||||
* ❌ failing – no changes
|
||||
|
||||
You should to subscribe to **[channel with alerts](https://t.me/tgcrawl)**
|
||||
to stay updated or watch (enable notifications) this repository with "All Activity" setting.
|
||||
Copy of Telegram websites stored **[here](https://github.com/MarshalX/telegram-crawler/tree/data/data)**.
|
||||
|
||||

|
||||
|
||||
### How it should work in dreams
|
||||
### How it works
|
||||
|
||||
1. [Link crawling](make_tracked_links_list.py) runs once an hour.
|
||||
Starts crawling from the home page of the site.
|
||||
Detects relative and absolute sub links and recursively repeats the operation.
|
||||
Writes a list of unique links for future content comparison.
|
||||
Additionally, there is the ability to add links by hand to help the script
|
||||
find more hidden (links to which no one refers) links.
|
||||
find more hidden (links to which no one refers) links. To manage exceptions,
|
||||
there is a [system of rules](#Example of link crawler rules configuration)
|
||||
for the link crawler.
|
||||
|
||||
2. [Content crawling](make_files_tree.py) is launched as often as
|
||||
possible and uses the existing list of links collected in step 1.
|
||||
2. [Content crawling](make_files_tree.py) is launched **as often as
|
||||
possible** and uses the existing list of links collected in step 1.
|
||||
Going through the base it gets contains and builds a system of subfolders
|
||||
and files. Removes all dynamic content from files.
|
||||
|
||||
3. Works without own servers. Used [GitHub Actions](.github/workflows/).
|
||||
All file changes are tracked by the GIT and are beautifully
|
||||
displayed on the GitHub. Github Actions should be built
|
||||
correctly only if there are changes on the Telegram website.
|
||||
Otherwise, the workflow should fail.
|
||||
If build was successful, we can send notifications to
|
||||
Telegram channel and so on.
|
||||
3. Using of [GitHub Actions](.github/workflows/). Works without own servers.
|
||||
You can just fork this repository and own tracker system by yourself.
|
||||
Workflows launch scripts and commit changes. All file changes are tracked
|
||||
by the GIT and beautifully displayed on the GitHub. GitHub Actions
|
||||
should be built correctly only if there are changes on the Telegram website.
|
||||
Otherwise, the workflow should fail. If build was successful, we can
|
||||
send notifications to Telegram channel and so on.
|
||||
|
||||
### The real world
|
||||
### FAQ
|
||||
|
||||
**Q:** How many is "**as often as possible**"?
|
||||
|
||||
**A:** TLTR: content update action runs every ~10 minutes. More info:
|
||||
- [Scheduled actions cannot be run more than once every 5 minutes.](https://github.blog/changelog/2019-11-01-github-actions-scheduled-jobs-maximum-frequency-is-changing/)
|
||||
- [GitHub Actions workflow not triggering at scheduled time](https://upptime.js.org/blog/2021/01/22/github-actions-schedule-not-working/). TLTR: actions run every ~10 minutes.
|
||||
- GitHub Actions freeze for 5 minutes when updating the list of links.
|
||||
Locally, this takes less than 10 seconds for several hundred requests.
|
||||
**This is not really a big problem. The list of links is rarely updated.**
|
||||
- When updating links, ~one link may be lost, and the next list generation
|
||||
it is returned. This will lead to the successful execution of workflow
|
||||
when there were no server changes. Most likely this is a
|
||||
bug in my script that can be fixed. As a last resort, compare
|
||||
the old version of the linkbase with the new one.
|
||||
- [GitHub Actions workflow not triggering at scheduled time](https://upptime.js.org/blog/2021/01/22/github-actions-schedule-not-working/).
|
||||
|
||||
### TODO list
|
||||
|
||||
- bug fixes;
|
||||
- alert system;
|
||||
- add storing history of content using hashes;
|
||||
- add storing hashes of image, svg, video.
|
||||
|
||||
### Example of link crawler rules configuration
|
||||
|
@ -64,7 +63,7 @@ Copy of Telegram websites stored **[here](https://github.com/MarshalX/telegram-c
|
|||
CRAWL_RULES = {
|
||||
# every rule is regex
|
||||
# empty string means match any url
|
||||
# allow rules with high priority than deny
|
||||
# allow rules with higher priority than deny
|
||||
'translations.telegram.org': {
|
||||
'allow': {
|
||||
r'^[^/]*$', # root
|
||||
|
|
|
@ -29,7 +29,7 @@ BASE_URL_REGEX = r'telegram.org'
|
|||
CRAWL_RULES = {
|
||||
# every rule is regex
|
||||
# empty string means match any url
|
||||
# allow rules with high priority than deny
|
||||
# allow rules with higher priority than deny
|
||||
'translations.telegram.org': {
|
||||
'allow': {
|
||||
r'^[^/]*$', # root
|
||||
|
|
Loading…
Add table
Reference in a new issue