Scrape my site, get a free surprise
Fuck techbros
Why would you want to fuck techbros?
So you’ve published your latest pithy take on something to the Fediverse.
And a supremely arrogant and entitled white guy comes along (it’s always a white guy) and scrapes your aphorisms for their latest widget.
It’s also probably some AI bollocks.
The something doesn’t matter, suffice to say it won’t work without your output being its input.
How to fuck techbros with tech
Yes, stuff that you post on the Internet is generally viewable (and therefore scrapeable) by all.
Yes, any engineer entitled white guy with beautiful soup
and pandas
can scrape a Fedi timeline but that is not the only way that your info can be grabbed.
Also, closing some of the ways that data is harvested boils down to a personal choice:
You can be nice.
Or you can choose violence.
Act accordingly!
Being nice
Any feed can become unwanted RSS feeds
Every Fedi feed is accessible through an RSS link.
Simply append .rss
to a <domain>/<@username> combination.
e.g. https://mastodon.social/@consummatetinkerer.rss
Surprisingly, this is a feature, not a bug.
So, those people who set the “Require follow requests” option in preferences, thinking that their accounts are unviewable by non followers, sorry, you were misled.
The fix: Send the RSS requests back to the instance homepage
# Send any .rss request to the homepage
location ~* \.rss$ {
return 301 $scheme://social.consummatetinkerer.net;
}
The /api/v1/instance leaks broadcasts personal info
This one really annoys me.
If you append /api/v1/instance
to the end of a Mastodon URL, you get the instance info.
Fine.
But part of the info returned is an Admin account name, presented as the ‘contact_account’.
(Which is usually the first account created when setting up the software.)
The Mastodon helpers from /packages/backend/src/server/api/mastodon/helpers/misc.ts
code seems to me to be saying; “find the first Admin account that is not deleted or suspended and present that User as a contact”.
This is Internet Security 101.
Have separate Users and Admins.
Don’t broadcast any aspect of the Admin account login details.
export class MiscHelpers {
public static async getInstance(ctx: MastoContext): Promise<MastodonEntity.Instance> {
const userCount = Users.count({ where: { host: IsNull() } });
const noteCount = Notes.count({ where: { userHost: IsNull() } });
const instanceCount = Instances.count({ cache: 3600000 });
const contact = await Users.findOne({
where: {
host: IsNull(),
isAdmin: true,
isDeleted: false,
isSuspended: false,
},
order: { id: "ASC" },
})
The fix: Reroute the traffic
In the same fashion that we reroute the RSS request, we send the traffic back to the instance homepage.
# Send the instance API URL to the homepage
location = /api/v1/instance {
return 301 $scheme://social.example.com;
}
The /api/v1/peers leaks broadcasts instance info
The fix: Reroute the traffic once more
In the same fashion that we re-route the RSS request, we send the traffic back to the instance homepage.
# Send the peers API URL to the homepage
location = /api/v1/peers {
return 301 $scheme://social.example.com;
}
Choosing violence
Make it someone else’s problem
So far the ‘fixes’ have been non adversarial.
Your server routes requests from URLs to another, different URL, also on your server.
However, nothing is preventing the re-routing of traffic to a third party, say, an intelligence agency with letters in its name.
Here’s the /api/v1/peers
example again but this time re-routing the traffic to an external web address.
# Send the peers API URL to an external Address
location = /api/v1/peers {
return 301 ThreeLetterAgency.Address;
}
Make it their problem
But what if you wanted to disrupt the incoming request?
Maybe respond somehow?
null bombs
Rather than rerouting the traffic to a URL, why not send a null bomb?
I found the following somewhere on the Internets at some point in the distant past.
I have NOT tried it out.
First, make a bomb
Create a 42.gz
file.
This is a file that:
creates an archive that requires an excessive amount of time, disk space, or memory to unpack.
Execute the following in a Terminal:
dd if=/dev/zero bs=1M count=102400 | gzip -c - > 42.gz
Then, deploy the bomb
42.gz
in the example below is in the root folder of the web server (which we can change).
This nginx.conf code example would send the 42.gz
file when a request is made to /wp-login.php
location.
# Any location you want to deny
location = /wp-login.php {
# Make the bot think it's just transport compression
add_header Content-Encoding gzip;
try_files /42.gz =404;
# tell nginx to not blow it's own foot off if the client doesn't support gzip
gunzip off;
# lie to the client that the nullbomb is HTML so it doesn't immediately reject it
types { text/html gz; }
}
tarpits
Tarpits are a tried and tested way to devour incoming SSH requests.
Endlessh is an SSH tarpit that very slowly sends an endless, random SSH banner . It keeps SSH clients locked up for hours or even days at a time. The purpose is to put your real SSH server on another port and then let the script kiddies get stuck in this tarpit instead of bothering a real server.
Wrap it all up and present it with a bow
What if you could:
- Identify the scrapers
- Reroute the traffic
- Make it someone else’s problem
- Deploy a null bomb
All at once ?
Well, Hetzner have a ‘speed test’ file you can use for such a project.
https://fsn1-speed.hetzner.com
(Other files are also available.)
Maybe add something like the below to your nginx.conf
file?
# Send bots a present
if ($http_user_agent ~* "AdsBot-Google|Amazonbot") {
return 307 https://fsn1-speed.hetzner.com/10GB.bin;
}
Resources
I’ll also add to this list as time goes by and more useful resources appear.
How to block ChatGPT
https://sizeof.cat/post/block-chatgpt-scraping