== *** ==

Making Free Proxies with Tor and Ansible

2024-09-20

Est. 6m read

Was reading through Hacker News when I saw “We accidentally burned through 200GB of proxy bandwidth in 6 hours”. Brutal! 😅

I remember getting into Skyvern. Really interesting tech! Too bad the open-source models aren’t quite there yet. I’m not VC enough to spend AI credits on web scraping either.

The post was honest, and I certainly could’ve made the very same mistake! But recently I’ve been feeling more like the Chad pictured below:

Virgin API Consumer vs. Chad third-party scraper

My favorite trick at the moment for getting free proxies is to just use Tor. It doesn’t work with every website, but for ones that do, that’s ~2k+ proxies free of charge!

Getting Tor to work well as a web scraper took some work, and that’s what this post will be about.

Avoiding Detection

When starting with the PoC (proof-of-concept) I started with webfp/tor-browser-selenium. Seemed like the natural place to start. Real quickly though, it became apparent that somehow… sites were detecting Selenium and rejecting my requests as bot-like.

Diving deep into the Firefox about:* pages looking for what could be the issue. I spent quite a while looking, trying things like privacy.resistFingerprinting = False, excluding domains, etc. etc. In the end, I believe it was a combination of Selenium and the browser telling the website that it was being automated. “Marionette” as they called it.

The Solution

This library is the solution: kaliiiiiiiiii/Selenium-Driverless. It hides the fact that Selenium is “driving” the browser and comes with some other patches that Selenium is (intentionally?) missing.

Since we’re using Selenium-Driverless, we can’t use the Tor browser that was included in the previous library. Now might be a good time to pick a better browser than Tor’s browser anyhow since it’s probably not what the average user is using.

Blending in

The trick to bypassing detection is to look as average as possible.

So we combine this sneaky Selenium client with a regular Chrome browser, but how do we connect that to Tor? Well, there’s a launch argument for that! It looks like:

./google-chrome --proxy-server=socks5://<HOST>:<PORT>

But How to Tor?

Tor is often used entirely through the “Tor Bundle” which includes the browser. But alternatively, you can install the tor service on a standard Linux machine and get a new connection on each instance.

This is not an "Exit Node"

Exit Nodes are a lot more involved in their setup, installing tor is just a connection to the network. You won’t have to worry about other people using your IP to surf the web.

Within my /etc/tor/torrc file¹:

SocksPort 0.0.0.0:9050
ControlPort 0.0.0.0:9051
Log notice stdout
DataDirectory /var/lib/tor
HashedControlPassword <HASHED_PASSWORD_HERE>

These are mostly defaults. The 0.0.0.0 is to ensure we can connect from machines on the local network. The SocksPort is our proxy and the ControlPort is used to control the tor service (e.g. renewing the IP). The password is generated with tor --hash-password password_here and is used to authenticate on :9051.

Creating Unlimited Proxies

In hindsight

I should’ve checked to see if there’s a Dockerized way of creating a Tor connection. Guess I needed an excuse to finally automate Proxmox. Feel free to deviate from what I did here, but the concepts will still apply.

To create a bunch of these tor services, I used Ansible and Proxmox to create 4 LXC containers, each one with Alpine and a static IP. I wanted them to be as lightweight as possible just in-case I need more.

For your benefit, here’s the Ansible playbook:

- name: Create LXC containers for Tor proxies on Proxmox
  hosts: proxmox
  gather_facts: no
  vars:
    proxmox_api_host: "10.0.0.69"
    proxmox_api_user: "root@pam"
    proxmox_api_password: "hunter2"
    proxmox_node: "akon"
    container_password: "hunter2"
    containers:
      - { name: "torproxy1", id: 201, ip: "10.0.0.201" }
      - { name: "torproxy2", id: 202, ip: "10.0.0.202" }
      - { name: "torproxy3", id: 203, ip: "10.0.0.203" }
      - { name: "torproxy4", id: 204, ip: "10.0.0.204" }

  tasks:
    - name: Create LXC containers
      community.general.proxmox:
        api_host: "{{ proxmox_api_host }}"
        api_user: "{{ proxmox_api_user }}"
        api_password: "{{ proxmox_api_password }}"
        node: "{{ proxmox_node }}"
        vmid: "{{ item.id }}"
        hostname: "{{ item.name }}"
        ostemplate: 'local:vztmpl/alpine-3.19-default_20240207_amd64.tar.xz'
        password: "{{ container_password }}"
        netif: '{"net0":"name=eth0,ip={{ item.ip }}/24,gw=10.0.0.1,bridge=vmbr0"}' # gateway may need to change
        storage: local-lvm
        unprivileged: no
        onboot: yes
        features:
          - nesting=1
      loop: "{{ containers }}"

    - name: Start LXC containers
      community.general.proxmox:
        api_host: "{{ proxmox_api_host }}"
        api_user: "{{ proxmox_api_user }}"
        api_password: "{{ proxmox_api_password }}"
        node: "{{ proxmox_node }}"
        vmid: "{{ item.id }}"
        state: started
      loop: "{{ containers }}"

    - name: Configure SSH and install packages in containers
      ansible.builtin.command:
        cmd: >
          pct exec {{ item.id }} -- /bin/sh -c "
          apk update &&
          apk add openssh &&
          rc-update add sshd &&
          echo 'PermitRootLogin yes' >> /etc/ssh/sshd_config &&
          echo 'root:{{ container_password }}' | chpasswd &&
          rc-service sshd start &&
          apk add tor python3 &&
          echo 'SocksPort 0.0.0.0:9050' > /etc/tor/torrc &&
          echo 'ControlPort 0.0.0.0:9051' >> /etc/tor/torrc &&
          echo 'Log notice stdout' >> /etc/tor/torrc &&
          echo 'DataDirectory /var/lib/tor' >> /etc/tor/torrc &&
          echo 'HashedControlPassword <HASHED_PASSWORD_HERE>' >> /etc/tor/torrc &&
          rc-update add tor default &&
          rc-service tor start
          "          
      loop: "{{ containers }}"

    - name: Wait for LXC containers to be ready
      ansible.builtin.wait_for:
        host: "{{ item.ip }}"
        port: 22
        timeout: 300
      loop: "{{ containers }}"

    - name: Verify Tor is running in containers
      ansible.builtin.command:
        cmd: pct exec {{ item.id }} -- rc-service tor status
      loop: "{{ containers }}"
      register: tor_status

    - name: Display Tor status
      ansible.builtin.debug:
        var: tor_status

Running the playbook:

Scaling this up to 8x, 16x, 32x is no problem. We can add each IP as a proxy and round robin through all of them to distribute the load. After each request, we can use the ControlPort to renew the IP and essentially get a new proxy.

In Python there’s a library called stem for controlling Tor over the ControlPort.

Here’s most of the code I’m using to round robin:

import asyncio
from stem.control import Controller

PROXIES = [
  {"host": "10.0.0.201", "port": 9050, "control_port": 9051},
  {"host": "10.0.0.202", "port": 9050, "control_port": 9051},
  {"host": "10.0.0.203", "port": 9050, "control_port": 9051},
  {"host": "10.0.0.204", "port": 9050, "control_port": 9051}
]

async def renew_tor_ip(proxy):
  with Controller.from_port(address=proxy["host"], port=proxy["control_port"]) as controller:
    controller.authenticate("password_here")
    controller.signal(Signal.NEWNYM)

async def run_session(proxy):
  while True:
    await launch_chrome(proxy)
    await do_scraping()
    await renew_tor_ip(proxy)
    await asyncio.sleep(3)

sessions = [run_session(proxy) for proxy in PROXIES]

await asyncio.gather(*sessions)

Wrapping Up

That’s it. Hopefully you got something out of this. There’s still Cloudflare checks, IP blacklists, etc. etc. that may cause troubles when scraping a web page, but I don’t think you’ll ever be hit with a $500 bill for using the Tor proxies!

One Final Trick

If you want to use a specific exit node’s IP, you can specify it with: ExitNodes IP in the torrc². You can even see the cached list of ExitNodes with:

sudo grep -B3 "^s.*Exit" /var/lib/tor/cached-microdesc-consensus | grep "^r" | awk '{print $6 ":" $7}'

Pages

Blog