banner
Vinking

Vinking

你写下的每一个BUG 都是人类反抗被人工智能统治的一颗子弹

Prohibit marketing spiders from crawling the website

Today, I checked the website's access logs and found that a User Agent Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html) has been crawling the site very frequently. Looking at the log of over twenty Mb, it's clear that this bot has been crawling the site for quite some time.

Website Logs

From the pages crawled by this bot, it seems to be randomly combining some parameters of previously crawled pages for access, resulting in a long string of links that mostly return 404. Additionally, the bot crawls every few seconds, causing normal access records in the logs to be overwhelmed by tens of thousands of lines of junk bot records.

At first, I thought that since it's a bot, it should comply with robots.txt, so I considered adding rules to robots.txt:

User-agent: SemrushBot
Disallow: /

However, after checking online, I found out that this bot apparently does not adhere to robots.txt 😅 (the official page claims that the bot strictly follows robots.txt, but feedback from the internet suggests otherwise), and it's not just this SemrushBot; many marketing bots do not comply with robots.txt. I had no choice but to block it using Nginx. In the Nginx Free Firewall in Baota, I clicked on User-Agent Filtering under Global Configuration, and added the following regular expression (compiled from the internet; I didn't expect there to be so many, and I also included some useless UAs, so please check if there are any UAs you need before using):

(nmap|NMAP|HTTrack|sqlmap|Java|zgrab|Go-http-client|CensysInspect|leiki|webmeup|Python|python|curllCurl|wget|Wget|toutiao|Barkrowler|AhrefsBot|a Palo Alto|ltx71|censys|DotBot|MauiBot|MegaIndex.ru|BLEXBot|ZoominfoBot|ExtLinksBot|hubspot|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|Bytespider|Ezooms|JikeSpider|SemrushBot)

After a few seconds, I could see the intercepted data.

Intercepted Data

Later, seeing the continuously rising number of interceptions, I realized that simply initiating requests, whether they return 404 or 444, would consume server resources. Such frequent requests every few seconds are not a long-term solution. Upon checking, I found that these bots' IPs are all foreign nodes, and since my website's foreign routes are also resolved to Cloudflare, I could let Cloudflare help block them in between, preventing them from accessing the site.

Bot IP

After entering the corresponding domain console in Cloudflare, I went to the Security section and clicked on the WAF item, then clicked Add Rule. In the Expression Preview, I added the following expression (again, please check if any necessary UAs are blocked before using):

(http.user_agent contains "SemrushBot") or (http.user_agent contains "FeedDemon") or (http.user_agent contains "Indy Library") or (http.user_agent contains "Alexa Toolbar") or (http.user_agent contains "AskTbFXTV") or (http.user_agent contains "AhrefsBot") or (http.user_agent contains "CrawlDaddy") or (http.user_agent contains "CoolpadWebkit") or (http.user_agent contains "Java") or (http.user_agent contains "Feedly") or (http.user_agent contains "UniversalFeedParser") or (http.user_agent contains "ApacheBench") or (http.user_agent contains "Microsoft URL Control") or (http.user_agent contains "Swiftbot") or (http.user_agent contains "ZmEu") or (http.user_agent contains "jaunty") or (http.user_agent contains "Python-urllib") or (http.user_agent contains "lightDeckReports Bot") or (http.user_agent contains "YYSpider") or (http.user_agent contains "DigExt") or (http.user_agent contains "HttpClient") or (http.user_agent contains "MJ12bot") or (http.user_agent contains "heritrix") or (http.user_agent contains "Bytespider") or (http.user_agent contains "Ezooms") or (http.user_agent contains "JikeSpider") or (http.user_agent contains "HTTrack") or (http.user_agent contains "Apache-HttpClient") or (http.user_agent contains "harvest") or (http.user_agent contains "audit") or (http.user_agent contains "dirbuster") or (http.user_agent contains "pangolin") or (http.user_agent contains "nmap") or (http.user_agent contains "sqln") or (http.user_agent contains "hydra") or (http.user_agent contains "libwww") or (http.user_agent contains "BBBike") or (http.user_agent contains "sqlmap") or (http.user_agent contains "w3af") or (http.user_agent contains "owasp") or (http.user_agent contains "Nikto") or (http.user_agent contains "fimap") or (http.user_agent contains "havij") or (http.user_agent contains "BabyKrokodil") or (http.user_agent contains "netsparker") or (http.user_agent contains "httperf") or (http.user_agent contains " SF/") or (http.user_agent contains "zgrab") or (http.user_agent contains "NMAP") or (http.user_agent contains "Go-http-client") or (http.user_agent contains "CensysInspect") or (http.user_agent contains "leiki") or (http.user_agent contains "webmeup") or (http.user_agent contains "Python") or (http.user_agent contains "python") or (http.user_agent contains "wget") or (http.user_agent contains "Wget") or (http.user_agent contains "toutiao") or (http.user_agent contains "Barkrowler") or (http.user_agent contains "a Palo Alto") or (http.user_agent contains "ltx71") or (http.user_agent contains "censys") or (http.user_agent contains "DotBot") or (http.user_agent contains "MauiBot") or (http.user_agent contains "MegaIndex.ru") or (http.user_agent contains "BLEXBot") or (http.user_agent contains "ZoominfoBot") or (http.user_agent contains "ExtLinksBot") or (http.user_agent contains "hubspot")

Then, I selected Block as the action and saved it.

Cloudflare Custom Rules

At this point, if the risk interception number in Baota's Nginx Free Firewall does not increase anymore, and the interception count of Cloudflare's firewall rules skyrockets, it proves that Cloudflare has successfully blocked access from these junk bots.

Wake Up Update#

After waking up, I found that it had already blocked over two thousand times; this thing is really persistent 😅.

Cloudflare Custom Rules

Custom Block Page#

Recently, I saw Cloudflare's default blocking page and thought it looked too ugly... I wanted to change it to a custom blocking page.

Cloudflare Default Blocking Page

I found out that a custom page requires upgrading to the Pro plan to use. Well, a freeloader like me would never upgrade just for a custom page, so I had to implement it in a more cumbersome way. The idea is to directly redirect all visitors with the above User-Agent to a custom page.

Note

🔔 Please note: Pages hosted on Vercel will be inaccessible within the country. Since the overseas access to this site will go through Cloudflare, there is no need to consider whether it can be accessed domestically. If you want it to be accessible domestically as well, please use another page hosting platform.

We need to write a static block page ourselves and host it on Vercel.

Detailed Steps for Vercel Deployment#

First, go to Vercel and create a new project.

Create a New Project

Then, you can choose to import your custom page from an existing repository on the left, or clone a template on the right. Since I don't have a custom page yet, I can click the Browse All Templates link on the right to choose a template.

Clone Template

Since I'm using React to write the page, I'll use the Create React App template. Click the Deploy button to use the template.

Next, click on Github in Create Git Repository to connect your Github, enter a custom repository name, check the Create private Git Repository option to set it as a private repository, and click Create to automatically deploy.

Deployment

When you see the Congratulations! prompt, it means the deployment was successful.

Successful Deployment

Finally, submit the code to the newly created repository on Github, and after the submission is complete, Vercel will automatically update the page.

Below is the custom page I wrote using React, which you can use directly.

\src\App.js file:

import React, { Component } from 'react'
import './App.css';

export default class App extends Component {
  state = {
    tran: {
      lang: 'en',
      title: 'This request has been blocked',
      contents_1: 'Some of your characteristics exist in the blacklist and this request has been blocked.',
      contents_2: 'If you think this is a false alarm, please contact me promptly.',
      footer: '@Vinking Security Center',
      tips: 'Details have been saved to continuously optimize the Security Center'
    }
  }
  handleTranslations = () => {
    const { lang } = this.state.tran
    const newState = (lang === 'en') ? {
      lang: 'zh',
      title: '请求已被拦截',
      contents_1: '您的一些特征存在于黑名单中,此次请求已被拦截。',
      contents_2: '如果您觉得这是误报,请及时联系我。',
      symbols: '@ Vinking 安全中心',
      tips: '详细信息已保存以持续优化安全中心'
    } : {
      lang: 'en',
      title: 'This request has been blocked',
      contents_1: 'Some of your characteristics exist in the blacklist and this request has been blocked.',
      contents_2: 'If you think this is a false alarm, please contact me promptly.',
      symbols: '@Vinking Security Center',
      tips: 'Details have been saved to continuously optimize the Security Center'
    }
    document.title = newState.title
    this.setState({ tran: newState })
  }
  render() {
    const { title, contents_1, contents_2, symbols, tips } = this.state.tran
    return (
      <div className="content">
        <div className="card">
          <div className="cardHeader">
            <div>{title}</div>
            <div className='translation' onClick={this.handleTranslations}>
              <svg xmlns="http://www.w3.org/2000/svg" width="15" height="15" viewBox="0 0 1024 1024"><path fill="#f8f9fa" d="M608 416h288c35.36 0 64 28.48 64 64v416c0 35.36-28.48 64-64 64H480c-35.36 0-64-28.48-64-64V608H128c-35.36 0-64-28.48-64-64V128c0-35.36 28.48-64 64-64h416c35.36 0 64 28.48 64 64v288zm0 64v64c0 35.36-28.48 64-64 64h-64v256.032C480 881.696 494.304 896 511.968 896H864a31.968 31.968 0 0 0 31.968-31.968V512A31.968 31.968 0 0 0 864 480.032H608zM128 159.968V512c0 17.664 14.304 31.968 31.968 31.968H512A31.968 31.968 0 0 0 543.968 512V160a31.968 31.968 0 0 0-31.936-32H160a31.968 31.968 0 0 0-32 31.968zm64 244.288V243.36h112.736V176h46.752c6.4.928 9.632 1.824 9.632 2.752a10.56 10.56 0 0 1-1.376 4.128c-2.752 7.328-4.128 16.032-4.128 26.112v34.368h119.648v156.768h-50.88v-20.64h-68.768V497.76h-49.504V379.488h-67.36v24.768H192zm46.72-122.368v60.48h67.392V281.92h-67.36zm185.664 60.48V281.92h-68.768v60.48h68.768zm203.84 488H576L668.128 576h64.64l89.344 254.4h-54.976l-19.264-53.664H647.488l-19.232 53.632zm33.024-96.256h72.864l-34.368-108.608h-1.376l-37.12 108.608zM896 320h-64a128 128 0 0 0-128-128v-64a192 192 0 0 1 192 192zM128 704h64a128 128 0 0 0 128 128v64a192 192 0 0 1-192-192z" /></svg>
            </div>
          </div>
          <div className="cardDesc">
            <span className="red">{contents_1}</span>
            <br />
            {contents_2}
          </div>
          <div className="cardSymbols">
            <div>{symbols}</div>
          </div>
        </div>
        <div className="tips">{tips}</div>
      </div>
    )
  }
}

\public\index.html file:

<!doctype html>
<html>

<head>
  <meta charset="utf-8">
  <meta name="viewport"
    content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no, shrink-to-fit=no">
  <meta name="theme-color" content="#07092f">
  <title>This request has been blocked</title>
  <meta name="description" content="This request has been blocked.">
</head>

<body>
  <div id="root"></div>
</body>

</html>

Back to Cloudflare, click on the redirect rules under the rules section, and create a rule. Fill in the expression for blocking marketing bots in the expression editing box, select Static for Type, enter the link to the Vercel hosted page in the URL, set the Status Code to 301, and uncheck the option to preserve query strings to customize the blocking page.

Cloudflare Redirect

This article was synchronized to xLog by Mix Space
The original link is https://www.vinking.top/posts/codes/blocking-bots-with-cloudflare


Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.