cleanup readme file; disable tests of link crawler.

2025-03-15 13:22:43 +01:00 · 2021-04-25 14:13:44 +02:00 · 2021-04-25 14:13:44 +02:00 · ae43e9e442
commit ae43e9e442
parent 72b7efcde5
3 changed files with 27 additions and 28 deletions
--- a/.github/workflows/make_tracked_links_list.yml
+++ b/.github/workflows/make_tracked_links_list.yml
@ -2,7 +2,7 @@ name: Generate or update list of tracked links

 on:
  schedule:
-    - cron: '* * * * *'
+    - cron: '0 * * * *'
  push:
    branches:
      - main
--- a/README.md
+++ b/README.md
@ -1,4 +1,4 @@
-## Telegram Web Crawler
+## 🕷 Telegram Web Crawler

 This project is developed to automatically detect changes made 
 to the official Telegram sites. This is necessary for anticipating
@ -10,52 +10,51 @@ future updates and other things (new vacancies, API updates, etc).
 | Site updates tracker| [Commits](https://github.com/MarshalX/telegram-crawler/commits/data)  | ![Fetch new content of tracked links to files](https://github.com/MarshalX/telegram-crawler/actions/workflows/make_files_tree.yml/badge.svg?branch=main)  |
 | Site links tracker | [Commits](https://github.com/MarshalX/telegram-crawler/commits/main/tracked_links.txt)  | ![Generate or update list of tracked links](https://github.com/MarshalX/telegram-crawler/actions/workflows/make_tracked_links_list.yml/badge.svg?branch=main)  |

-* passing – new changes
-* failing – no changes
+* ✅ passing – new changes
+* ❌ failing – no changes

+You should to subscribe to **[channel with alerts](https://t.me/tgcrawl)** 
+to stay updated or watch (enable notifications) this repository with "All Activity" setting.
 Copy of Telegram websites stored **[here](https://github.com/MarshalX/telegram-crawler/tree/data/data)**.

 ![GitHub pretty diff](https://i.imgur.com/BK8UAju.png)

-### How it should work in dreams
+### How it works

 1. [Link crawling](make_tracked_links_list.py) runs once an hour. 
   Starts crawling from the home page of the site. 
   Detects relative and absolute sub links and recursively repeats the operation. 
   Writes a list of unique links for future content comparison. 
   Additionally, there is the ability to add links by hand to help the script 
-   find more hidden (links to which no one refers) links.
+   find more hidden (links to which no one refers) links. To manage exceptions,
+   there is a [system of rules](#Example of link crawler rules configuration)
+   for the link crawler.

-2. [Content crawling](make_files_tree.py) is launched as often as 
-   possible and uses the existing list of links collected in step 1. 
+2. [Content crawling](make_files_tree.py) is launched **as often as 
+   possible** and uses the existing list of links collected in step 1. 
   Going through the base it gets contains and builds a system of subfolders 
   and files. Removes all dynamic content from files.
   
-3. Works without own servers. Used [GitHub Actions](.github/workflows/).
-   All file changes are tracked by the GIT and are beautifully 
-   displayed on the GitHub. Github Actions should be built 
-   correctly only if there are changes on the Telegram website. 
-   Otherwise, the workflow should fail. 
-   If build was successful, we can send notifications to 
-   Telegram channel and so on.
+3. Using of [GitHub Actions](.github/workflows/). Works without own servers.
+   You can just fork this repository and own tracker system by yourself.
+   Workflows launch scripts and commit changes. All file changes are tracked 
+   by the GIT and beautifully displayed on the GitHub. GitHub Actions 
+   should be built correctly only if there are changes on the Telegram website. 
+   Otherwise, the workflow should fail. If build was successful, we can 
+   send notifications to Telegram channel and so on.
   
-### The real world
+### FAQ

+**Q:** How many is "**as often as possible**"?
+
+**A:** TLTR: content update action runs every ~10 minutes. More info:
 - [Scheduled actions cannot be run more than once every 5 minutes.](https://github.blog/changelog/2019-11-01-github-actions-scheduled-jobs-maximum-frequency-is-changing/)
-    - [GitHub Actions workflow not triggering at scheduled time](https://upptime.js.org/blog/2021/01/22/github-actions-schedule-not-working/). TLTR: actions run every ~10 minutes.
- GitHub Actions freeze for 5 minutes when updating the list of links. 
-  Locally, this takes less than 10 seconds for several hundred requests.
-  **This is not really a big problem. The list of links is rarely updated.**
- When updating links, ~one link may be lost, and the next list generation 
-  it is returned. This will lead to the successful execution of workflow
-  when there were no server changes. Most likely this is a 
-  bug in my script that can be fixed. As a last resort, compare 
-  the old version of the linkbase with the new one.
+- [GitHub Actions workflow not triggering at scheduled time](https://upptime.js.org/blog/2021/01/22/github-actions-schedule-not-working/).
  
 ### TODO list

- bug fixes;
 - alert system;
+- add storing history of content using hashes;
 - add storing hashes of image, svg, video.

 ### Example of link crawler rules configuration
@ -64,7 +63,7 @@ Copy of Telegram websites stored **[here](https://github.com/MarshalX/telegram-c
 CRAWL_RULES = {
    # every rule is regex
    # empty string means match any url
-    # allow rules with high priority than deny
+    # allow rules with higher priority than deny
    'translations.telegram.org': {
        'allow': {
            r'^[^/]*$',  # root
--- a/make_tracked_links_list.py
+++ b/make_tracked_links_list.py
@ -29,7 +29,7 @@ BASE_URL_REGEX = r'telegram.org'
 CRAWL_RULES = {
    # every rule is regex
    # empty string means match any url
-    # allow rules with high priority than deny
+    # allow rules with higher priority than deny
    'translations.telegram.org': {
        'allow': {
            r'^[^/]*$',  # root