run link crawler as often as possible; update readme; fix workflows

This commit is contained in:
Il'ya (Marshal) 2021-04-26 22:27:57 +02:00
parent f3efcdafac
commit 0afa3e61e6
5 changed files with 29 additions and 10 deletions

View file

@ -24,6 +24,7 @@ jobs:
if: ${{ github.event.head_commit.author.name == 'GitHub Action' }}
env:
COMMIT_SHA: ${{ github.sha }}
GITHUB_PAT: ${{ secrets.PAT_FOR_ALERTS }}
TELEGRAM_BOT_TOKEN: ${{ secrets.TELEGRAM_BOT_TOKEN }}
run: |
pip install -r requirements.txt

View file

@ -2,8 +2,9 @@ name: Fetch new content of tracked links to files
on:
schedule:
- cron: '* * * * * '
- cron: '* * * * *'
push:
# trigger on updated linkbase
branches:
- main

View file

@ -2,10 +2,7 @@ name: Generate or update list of tracked links
on:
schedule:
- cron: '0 * * * *'
push:
branches:
- main
- cron: '* * * * *'
jobs:
make_tracked_links_file:

View file

@ -20,7 +20,7 @@ Copy of Telegram websites stored **[here](https://github.com/MarshalX/telegram-c
### How it works
1. [Link crawling](make_tracked_links_list.py) runs once an hour.
1. [Link crawling](make_tracked_links_list.py) runs **as often as possible**.
Starts crawling from the home page of the site.
Detects relative and absolute sub links and recursively repeats the operation.
Writes a list of unique links for future content comparison.
@ -44,12 +44,32 @@ Copy of Telegram websites stored **[here](https://github.com/MarshalX/telegram-c
### FAQ
**Q:** How many is "**as often as possible**"?
**Q:** How often is "**as often as possible**"?
**A:** TLTR: content update action runs every ~10 minutes. More info:
- [Scheduled actions cannot be run more than once every 5 minutes.](https://github.blog/changelog/2019-11-01-github-actions-scheduled-jobs-maximum-frequency-is-changing/)
- [GitHub Actions workflow not triggering at scheduled time](https://upptime.js.org/blog/2021/01/22/github-actions-schedule-not-working/).
**Q:** Why there is 2 separated crawl scripts instead of one?
**A:** Because the previous idea was to update tracked links once at hour.
It was so comfortably to use separated scripts and workflows.
After Telegram 7.7 update, I realised that find new blog posts so slowly is bad idea.
**Q:** Why alert for sending alerts have while loop?
**A:** Because GitHub API doesn't return information about commit immediately
after push to repository. Therefore, script are waiting for information to appear...
**Q:** Why are you using GitHab Personal Access Token in action/checkout workflow`s step?
**A:** To have ability to trigger other workflows by on push trigger. More info:
- [Action does not trigger another on push tag action ](https://github.community/t/action-does-not-trigger-another-on-push-tag-action/17148)
**Q:** Why are you using GitHab PAT in [make_and_send_alert.py](make_and_send_alert.py)?
**A:** To increase limits of GitHub API.
### TODO list
- add storing history of content using hashes;

View file

@ -7,7 +7,7 @@ import aiohttp
COMMIT_SHA = os.environ['COMMIT_SHA']
TELEGRAM_BOT_TOKEN = os.environ['TELEGRAM_BOT_TOKEN']
GITHUB_PWA = os.environ['GITHUB_PWA']
GITHUB_PAT = os.environ['GITHUB_PAT']
REPOSITORY = os.environ.get('REPOSITORY', 'MarshalX/telegram-crawler')
CHAT_ID = os.environ.get('CHAT_ID', '@tgcrawl')
@ -58,7 +58,7 @@ async def main():
session=session,
url=f'{BASE_GITHUB_API}{GITHUB_LAST_COMMITS}'.format(repo=REPOSITORY, sha=COMMIT_SHA),
headers={
'Authorization': f'token {GITHUB_PWA}'
'Authorization': f'token {GITHUB_PAT}'
}
)