Making Free Proxies with Tor and Ansible
2024-09-20
Est. 6m read
Was reading through Hacker News when I saw “We accidentally burned through 200GB of proxy bandwidth in 6 hours”. Brutal! 😅
I remember getting into Skyvern. Really interesting tech! Too bad the open-source models aren’t quite there yet. I’m not VC enough to spend AI credits on web scraping either.
The post was honest, and I certainly could’ve made the very same mistake! But recently I’ve been feeling more like the Chad pictured below:
My favorite trick at the moment for getting free proxies is to just use Tor. It doesn’t work with every website, but for ones that do, that’s ~2k+ proxies free of charge!
Getting Tor to work well as a web scraper took some work, and that’s what this post will be about.
Avoiding Detection
When starting with the PoC (proof-of-concept) I started with webfp/tor-browser-selenium. Seemed like the natural place to start. Real quickly though, it became apparent that somehow… sites were detecting Selenium and rejecting my requests as bot-like.
Diving deep into the Firefox about:*
pages looking for what could be the issue. I spent quite a while looking,
trying things like privacy.resistFingerprinting = False
, excluding domains, etc. etc. In the end, I believe
it was a combination of Selenium and the browser telling the website that it was being automated. “Marionette”
as they called it.
The Solution
This library is the solution: kaliiiiiiiiii/Selenium-Driverless. It hides the fact that Selenium is “driving” the browser and comes with some other patches that Selenium is (intentionally?) missing.
Since we’re using Selenium-Driverless, we can’t use the Tor browser that was included in the previous library. Now might be a good time to pick a better browser than Tor’s browser anyhow since it’s probably not what the average user is using.
Blending in
The trick to bypassing detection is to look as average as possible.
So we combine this sneaky Selenium client with a regular Chrome browser, but how do we connect that to Tor? Well, there’s a launch argument for that! It looks like:
./google-chrome --proxy-server=socks5://<HOST>:<PORT>
But How to Tor?
Tor is often used entirely through the “Tor Bundle” which includes the browser. But alternatively, you can install the tor service on a standard Linux machine and get a new connection on each instance.
This is not an "Exit Node"
Exit Nodes are a lot more involved in their setup, installing tor is just a connection to the network. You won’t have to worry about other people using your IP to surf the web.
Within my /etc/tor/torrc
file1:
SocksPort 0.0.0.0:9050
ControlPort 0.0.0.0:9051
Log notice stdout
DataDirectory /var/lib/tor
HashedControlPassword <HASHED_PASSWORD_HERE>
These are mostly defaults. The 0.0.0.0 is to ensure we can connect from machines on the local network.
The SocksPort is our proxy and the ControlPort is used to control the tor service (e.g. renewing the IP).
The password is generated with tor --hash-password password_here
and is used to authenticate on :9051.
Creating Unlimited Proxies
In hindsight
I should’ve checked to see if there’s a Dockerized way of creating a Tor connection. Guess I needed an excuse to finally automate Proxmox. Feel free to deviate from what I did here, but the concepts will still apply.
To create a bunch of these tor
services, I used Ansible and Proxmox to create 4 LXC containers, each one
with Alpine and a static IP. I wanted them to be as lightweight as possible just in-case I need more.
For your benefit, here’s the Ansible playbook:
- name: Create LXC containers for Tor proxies on Proxmox
hosts: proxmox
gather_facts: no
vars:
proxmox_api_host: "10.0.0.69"
proxmox_api_user: "root@pam"
proxmox_api_password: "hunter2"
proxmox_node: "akon"
container_password: "hunter2"
containers:
- { name: "torproxy1", id: 201, ip: "10.0.0.201" }
- { name: "torproxy2", id: 202, ip: "10.0.0.202" }
- { name: "torproxy3", id: 203, ip: "10.0.0.203" }
- { name: "torproxy4", id: 204, ip: "10.0.0.204" }
tasks:
- name: Create LXC containers
community.general.proxmox:
api_host: "{{ proxmox_api_host }}"
api_user: "{{ proxmox_api_user }}"
api_password: "{{ proxmox_api_password }}"
node: "{{ proxmox_node }}"
vmid: "{{ item.id }}"
hostname: "{{ item.name }}"
ostemplate: 'local:vztmpl/alpine-3.19-default_20240207_amd64.tar.xz'
password: "{{ container_password }}"
netif: '{"net0":"name=eth0,ip={{ item.ip }}/24,gw=10.0.0.1,bridge=vmbr0"}' # gateway may need to change
storage: local-lvm
unprivileged: no
onboot: yes
features:
- nesting=1
loop: "{{ containers }}"
- name: Start LXC containers
community.general.proxmox:
api_host: "{{ proxmox_api_host }}"
api_user: "{{ proxmox_api_user }}"
api_password: "{{ proxmox_api_password }}"
node: "{{ proxmox_node }}"
vmid: "{{ item.id }}"
state: started
loop: "{{ containers }}"
- name: Configure SSH and install packages in containers
ansible.builtin.command:
cmd: >
pct exec {{ item.id }} -- /bin/sh -c "
apk update &&
apk add openssh &&
rc-update add sshd &&
echo 'PermitRootLogin yes' >> /etc/ssh/sshd_config &&
echo 'root:{{ container_password }}' | chpasswd &&
rc-service sshd start &&
apk add tor python3 &&
echo 'SocksPort 0.0.0.0:9050' > /etc/tor/torrc &&
echo 'ControlPort 0.0.0.0:9051' >> /etc/tor/torrc &&
echo 'Log notice stdout' >> /etc/tor/torrc &&
echo 'DataDirectory /var/lib/tor' >> /etc/tor/torrc &&
echo 'HashedControlPassword <HASHED_PASSWORD_HERE>' >> /etc/tor/torrc &&
rc-update add tor default &&
rc-service tor start
"
loop: "{{ containers }}"
- name: Wait for LXC containers to be ready
ansible.builtin.wait_for:
host: "{{ item.ip }}"
port: 22
timeout: 300
loop: "{{ containers }}"
- name: Verify Tor is running in containers
ansible.builtin.command:
cmd: pct exec {{ item.id }} -- rc-service tor status
loop: "{{ containers }}"
register: tor_status
- name: Display Tor status
ansible.builtin.debug:
var: tor_status
Running the playbook:
Scaling this up to 8x, 16x, 32x is no problem. We can add each IP as a proxy and round robin through all of them to distribute the load. After each request, we can use the ControlPort to renew the IP and essentially get a new proxy.
In Python there’s a library called stem for controlling Tor over the ControlPort.
Here’s most of the code I’m using to round robin:
import asyncio
from stem.control import Controller
PROXIES = [
{"host": "10.0.0.201", "port": 9050, "control_port": 9051},
{"host": "10.0.0.202", "port": 9050, "control_port": 9051},
{"host": "10.0.0.203", "port": 9050, "control_port": 9051},
{"host": "10.0.0.204", "port": 9050, "control_port": 9051}
]
async def renew_tor_ip(proxy):
with Controller.from_port(address=proxy["host"], port=proxy["control_port"]) as controller:
controller.authenticate("password_here")
controller.signal(Signal.NEWNYM)
async def run_session(proxy):
while True:
await launch_chrome(proxy)
await do_scraping()
await renew_tor_ip(proxy)
await asyncio.sleep(3)
sessions = [run_session(proxy) for proxy in PROXIES]
await asyncio.gather(*sessions)
Wrapping Up
That’s it. Hopefully you got something out of this. There’s still Cloudflare checks, IP blacklists, etc. etc. that may cause troubles when scraping a web page, but I don’t think you’ll ever be hit with a $500 bill for using the Tor proxies!
One Final Trick
If you want to use a specific exit node’s IP, you can specify it with: ExitNodes IP
in
the torrc2. You can even see the cached list of ExitNodes with:
sudo grep -B3 "^s.*Exit" /var/lib/tor/cached-microdesc-consensus | grep "^r" | awk '{print $6 ":" $7}'