Regex Is Just the ^
2023-01-01
Est. 8m read
Regular expressions was my first “Aha!” moment in automation. It felt like a magical power. Unfortunately the shine has worn off for various reasons. Regex has its place but I often find myself reaching for other tools nowadays. Here are some of my favorite regex-like tools that I use nearly every day.
Overview
XPath for HTML
Regex for HTML is a nightmare. You probably want a tool that is better suited for the job. XPath is perfect for querying HTML whether its attributes, tags, or text. It is widely supported and is crucial for advanced web scraping. I often use it in E2E tests as well.
My favorite way to use XPath is with the Chrome Inspector. If you press Cmd+F
you can search for XPath expressions. It’s a great way to test out your
queries on any page.
A basic XPath query looks like this:
//div[@class="my-class"]
This will find all div
tags with the class my-class
. The //
means to
search all descendants of the current node. The @
means to search for an
attribute.
There’s a lot more to XPath, similar to how there’s a lot more to regex. I typically just open a cheat sheet and the Chrome Inspector to get the job done.
Pipe | Everything
I don’t see people using pipes enough! It’s a great way to chain commands and it’s one of the first automation concepts I was excited by.
For those who don’t know, a pipe is the |
character. It’s used to take the
output of one command and use it as the input of another command.
I often use it for grep
, less
, pbcopy
, and jq
. I’m sure you’ll find many
new uses for it as well.
Here’s a simple grep
and less
example for completeness:
$ ls | grep Library | less
This will list all files using ls
and then pipe the output to grep
which
will filter out all files that don’t contain the word Library
. The output
of grep
is then piped to less
which will allow you to scroll through the
results like a read-only vim buffer (press q to exit the buffer.)
xargs is Simple
I’ve shied away from xargs my whole life. In my mind it was the final boss. Turns out it’s very straightforward for what I’ve always needed it for: Executing a shell command multiple times with different arguments using the output of another command (the output would be a list).
Let’s start with a very simple example. Let’s say you have a list of files:
$ ls
Library
System
usr
And we want to run echo File: {file}
using that output. We can do that with:
$ ls | xargs -I {file} echo File: {file}
File: Library
File: System
File: usr
Typically I’d replace -I {file}
with -I{}
(notice there’s no space
and no words) for brevity. But the -I
flag can be anything. It’s a placeholder.
It’s a bit different if you want to run multiple commands, but it’s still simple.
Let’s say we want to run echo File: {file}
and echo File: {file} is cool
:
$ ls | xargs -I {file} sh -c "echo File: {file}; echo File: {file} is cool"
File: Library
File: Library is cool
File: System
File: System is cool
...
Notice the sh -c
part. That’s because we want to specify multiple commands to
xargs.
jq for JSON
I’ve been learning a lot of Kubernetes lately and the -o json
flag is a
lifesaver. It prints much more information than the default output and it’s easy
to parse once you’ve learned the basics of jq
.
Let’s say we’ve got kubectl
setup and we’re happily running a cluster. We
can get a list of namespaces in 2 ways:
$ kubectl get ns # normal output
NAME STATUS AGE
default Active 2d
my-app Active 2d
...
$ kubectl get ns -o json # json output
{
"apiVersion": "v1",
"items": [
{
"apiVersion": "v1",
"kind": "Namespace",
"metadata": {
"creationTimestamp": "2021-01-01T00:00:00Z",
"name": "default",
"resourceVersion": "123456",
"selfLink": "/api/v1/namespaces/default",
"uid": "123
...
We can see the JSON is a bit more verbose. It’s also parsable by jq. Any valid
JSON is accepted by jq. For this example, let’s say we want to get the
apiVersion
:
$ kubectl get ns -o json | jq '.apiVersion'
"v1"
The syntax typically starts with .
at the beginning. The period means we’re
starting at the root of the JSON. There are other options, but this is the most
common. Then, with simple dot notation, we can access the apiVersion
key.
Let’s say we want to loop through each of the items in the items
array, and
print out that data in a new format, { version: ..., kind: ..., name: ... }
:
$ kubectl get ns -o json | jq '.items[] | { version: .apiVersion, kind: .kind, name: .metadata.name }'
{
"version": "v1",
"kind": "Namespace",
"name": "default"
}
{
"version": "v1",
"kind": "Namespace",
"name": "my-app"
}
In that example, we’re telling jq that items
is an array. Then making use of
jq’s pipe function to map the kubectl data onto the our new format.
Since that map is the final jq filter, it will be returned (and/or printed).
The output of jq can be piped into other commands or into jq again!
One last trick, for decoding base64 using jq (this is often useful for viewing
kube secrets), let’s use the @base64d
filter:
$ echo '{"secret_message": "SG93ZHkh"}' | jq '.secret_message | @base64d'
Google Sheets for Math
I used to use Jupiter notebooks but it’s quite a bit of overhead for simple math. I’ve found that Google Sheets is powerful enough for most of my calculations and they’re easy to share.
I’m not going to bore you on how to use Google Sheets. I’m sure you’ve used it.
I want to talk about the formulas. If you’re unfamiliar, you can use the
=
sign to start a formula. For example, =1+1
will return 2
in the cell.
=IMPORTXML(url, xpath)
This is my favorite formula. It allows you to import data from an HTML page and optionally filter it using XPath. It’s a bit of a mouthful but it’s very useful for scraping the web for your calculations.
I was using this most when tracking crypto prices. You could say I was a step ahead of SBF.
=IMPORTXML("https://www.coingecko.com/en/coins/bitcoin", "//span[@data-target='price.price']")
If you put this into a cell, it will return a bunch of numbers. The first one
is the current price. To constrain this data, we can add
=ARRAY_CONSTRAIN(arr, rows, cols)
to our formula:
=ARRAY_CONSTRAIN(IMPORTXML("https://www.coingecko.com/en/coins/bitcoin", "//span[@data-target='price.price']"), 1, 1)
This will return the current price of Bitcoin in that cell. This won’t work for pages that need JavaScript but there are typically multiple sources for data to choose from. So long as you understand your XPath selectors, you can build powerful Google Sheets. CoinGecko wrote a blog post on how to create a trigger that updates your sheet 24/7 if that’s something you’re interested in.
=ROW()
This is a simple one but it’s useful for using the same formula across multiple
rows. Its counterpart is =COLUMN()
and they simply return the row or column
of the cell.
Once you’ve got a formula that works (=ROW()
is the minimum viable formula),
you can copy it across multiple rows and it’ll just work!
To combine this number with a string, you can use the &
operator:
=ROW()&" is the row number"
Having a formula that is reusable across cells can save a lot of time.
=INDIRECT(cell_name)
This is a powerful one. It allows you to reference a cell by name. For example,
if you have "B6"
, you can use =INDIRECT("B6")
to read the value of B6
.
It’s the simplest way to reference a cell by name. You can also use it to
reference a cell by name in a different sheet. For example, if you have a sheet
called Sheet1
and a cell called B6
, you can use =INDIRECT("Sheet1!B6")
to
read the value of B6
in Sheet1
.
Conclusion
There are probably some good ones I’ve missed but these are my favorites. I often use these tricks to save myself from repetitive tasks. They’ve also been instrumental in web scraping and data analysis.
A word of caution: I do not like to automate destructive commands. There’s probably a better way.
If you’re reading this post I’d also advise you learn keyboard shortcuts for your OS and each of your apps. Clicking is for the birds.
Bonus Tips
pbcopy
For macOS users, you can pipe into pbcopy
to copy the output to your clipboard:
$ ls | pbcopy
Give many arguments to mv, kill, etc.
$ kill -9 {123,456,789} # kills 3 processes
Display every Kubernetes secret in every namespace
This will output every secret (decoded from base64) as { name: ..., password: ... }
in every namespace:
$ kubectl get ns -o json | jq '.items[].metadata.labels."kubernetes.io/metadata.name"' | xargs -I{} sh -c "echo '\n{}'; kubectl get secrets -n {} -o json | jq '.items[] | { name: .metadata.name, password: .data | map_values(@base64d) }'"
Display the name of every Kubernetes secret in every namespace
This will output every secret name in every namespace:
$ kubectl get ns -o json | jq '.items[].metadata.labels."kubernetes.io/metadata.name"' | xargs -I{} sh -c "echo '\n{}'; kubectl get secrets -n {} -o json | jq '.items[].metadata.name'"
default
secret-1
secret-2
my-app
secret-3
secret-4
...