== *** ==

Regex Is Just the ^

2023-01-01

Est. 8m read

Regular expressions was my first “Aha!” moment in automation. It felt like a magical power. Unfortunately the shine has worn off for various reasons. Regex has its place but I often find myself reaching for other tools nowadays. Here are some of my favorite regex-like tools that I use nearly every day.

Overview

XPath for HTML

Regex for HTML is a nightmare. You probably want a tool that is better suited for the job. XPath is perfect for querying HTML whether its attributes, tags, or text. It is widely supported and is crucial for advanced web scraping. I often use it in E2E tests as well.

My favorite way to use XPath is with the Chrome Inspector. If you press Cmd+F you can search for XPath expressions. It’s a great way to test out your queries on any page.

A basic XPath query looks like this:

//div[@class="my-class"]

This will find all div tags with the class my-class. The // means to search all descendants of the current node. The @ means to search for an attribute.

There’s a lot more to XPath, similar to how there’s a lot more to regex. I typically just open a cheat sheet and the Chrome Inspector to get the job done.

Preferred Cheat Sheet

Pipe | Everything

I don’t see people using pipes enough! It’s a great way to chain commands and it’s one of the first automation concepts I was excited by.

For those who don’t know, a pipe is the | character. It’s used to take the output of one command and use it as the input of another command.

I often use it for grep, less, pbcopy, and jq. I’m sure you’ll find many new uses for it as well.

Here’s a simple grep and less example for completeness:

$ ls | grep Library | less

This will list all files using ls and then pipe the output to grep which will filter out all files that don’t contain the word Library. The output of grep is then piped to less which will allow you to scroll through the results like a read-only vim buffer (press q to exit the buffer.)

xargs is Simple

I’ve shied away from xargs my whole life. In my mind it was the final boss. Turns out it’s very straightforward for what I’ve always needed it for: Executing a shell command multiple times with different arguments using the output of another command (the output would be a list).

Let’s start with a very simple example. Let’s say you have a list of files:

$ ls
Library
System
usr

And we want to run echo File: {file} using that output. We can do that with:

$ ls | xargs -I {file} echo File: {file}
File: Library
File: System
File: usr

Typically I’d replace -I {file} with -I{} (notice there’s no space and no words) for brevity. But the -I flag can be anything. It’s a placeholder.

It’s a bit different if you want to run multiple commands, but it’s still simple. Let’s say we want to run echo File: {file} and echo File: {file} is cool:

$ ls | xargs -I {file} sh -c "echo File: {file}; echo File: {file} is cool"
File: Library
File: Library is cool
File: System
File: System is cool
...

Notice the sh -c part. That’s because we want to specify multiple commands to xargs.

jq for JSON

I’ve been learning a lot of Kubernetes lately and the -o json flag is a lifesaver. It prints much more information than the default output and it’s easy to parse once you’ve learned the basics of jq.

Let’s say we’ve got kubectl setup and we’re happily running a cluster. We can get a list of namespaces in 2 ways:

$ kubectl get ns # normal output
NAME              STATUS   AGE
default           Active   2d
my-app            Active   2d
...

$ kubectl get ns -o json # json output
{
  "apiVersion": "v1",
  "items": [
    {
      "apiVersion": "v1",
      "kind": "Namespace",
      "metadata": {
        "creationTimestamp": "2021-01-01T00:00:00Z",
        "name": "default",
        "resourceVersion": "123456",
        "selfLink": "/api/v1/namespaces/default",
        "uid": "123
...

We can see the JSON is a bit more verbose. It’s also parsable by jq. Any valid JSON is accepted by jq. For this example, let’s say we want to get the apiVersion:

$ kubectl get ns -o json | jq '.apiVersion'
"v1"

The syntax typically starts with . at the beginning. The period means we’re starting at the root of the JSON. There are other options, but this is the most common. Then, with simple dot notation, we can access the apiVersion key.

Let’s say we want to loop through each of the items in the items array, and print out that data in a new format, { version: ..., kind: ..., name: ... }:

$ kubectl get ns -o json | jq '.items[] | { version: .apiVersion, kind: .kind, name: .metadata.name }'
{
  "version": "v1",
  "kind": "Namespace",
  "name": "default"
}
{
  "version": "v1",
  "kind": "Namespace",
  "name": "my-app"
}

In that example, we’re telling jq that items is an array. Then making use of jq’s pipe function to map the kubectl data onto the our new format. Since that map is the final jq filter, it will be returned (and/or printed).

The output of jq can be piped into other commands or into jq again!

One last trick, for decoding base64 using jq (this is often useful for viewing kube secrets), let’s use the @base64d filter:

$ echo '{"secret_message": "SG93ZHkh"}' | jq '.secret_message | @base64d'

Google Sheets for Math

I used to use Jupiter notebooks but it’s quite a bit of overhead for simple math. I’ve found that Google Sheets is powerful enough for most of my calculations and they’re easy to share.

I’m not going to bore you on how to use Google Sheets. I’m sure you’ve used it. I want to talk about the formulas. If you’re unfamiliar, you can use the = sign to start a formula. For example, =1+1 will return 2 in the cell.

=IMPORTXML(url, xpath)

This is my favorite formula. It allows you to import data from an HTML page and optionally filter it using XPath. It’s a bit of a mouthful but it’s very useful for scraping the web for your calculations.

I was using this most when tracking crypto prices. You could say I was a step ahead of SBF.

=IMPORTXML("https://www.coingecko.com/en/coins/bitcoin", "//span[@data-target='price.price']")

If you put this into a cell, it will return a bunch of numbers. The first one is the current price. To constrain this data, we can add =ARRAY_CONSTRAIN(arr, rows, cols) to our formula:

=ARRAY_CONSTRAIN(IMPORTXML("https://www.coingecko.com/en/coins/bitcoin", "//span[@data-target='price.price']"), 1, 1)

This will return the current price of Bitcoin in that cell. This won’t work for pages that need JavaScript but there are typically multiple sources for data to choose from. So long as you understand your XPath selectors, you can build powerful Google Sheets. CoinGecko wrote a blog post on how to create a trigger that updates your sheet 24/7 if that’s something you’re interested in.

=ROW()

This is a simple one but it’s useful for using the same formula across multiple rows. Its counterpart is =COLUMN() and they simply return the row or column of the cell.

Once you’ve got a formula that works (=ROW() is the minimum viable formula), you can copy it across multiple rows and it’ll just work!

To combine this number with a string, you can use the & operator:

=ROW()&" is the row number"

Having a formula that is reusable across cells can save a lot of time.

=INDIRECT(cell_name)

This is a powerful one. It allows you to reference a cell by name. For example, if you have "B6", you can use =INDIRECT("B6") to read the value of B6.

It’s the simplest way to reference a cell by name. You can also use it to reference a cell by name in a different sheet. For example, if you have a sheet called Sheet1 and a cell called B6, you can use =INDIRECT("Sheet1!B6") to read the value of B6 in Sheet1.

Conclusion

There are probably some good ones I’ve missed but these are my favorites. I often use these tricks to save myself from repetitive tasks. They’ve also been instrumental in web scraping and data analysis.

A word of caution: I do not like to automate destructive commands. There’s probably a better way.

If you’re reading this post I’d also advise you learn keyboard shortcuts for your OS and each of your apps. Clicking is for the birds.

Bonus Tips

pbcopy

For macOS users, you can pipe into pbcopy to copy the output to your clipboard:

$ ls | pbcopy

Give many arguments to mv, kill, etc.

$ kill -9 {123,456,789} # kills 3 processes

Display every Kubernetes secret in every namespace

This will output every secret (decoded from base64) as { name: ..., password: ... } in every namespace:

$ kubectl get ns -o json | jq '.items[].metadata.labels."kubernetes.io/metadata.name"' | xargs -I{} sh -c "echo '\n{}'; kubectl get secrets -n {} -o json | jq '.items[] | { name: .metadata.name, password: .data | map_values(@base64d) }'"

Display the name of every Kubernetes secret in every namespace

This will output every secret name in every namespace:

$ kubectl get ns -o json | jq '.items[].metadata.labels."kubernetes.io/metadata.name"' | xargs -I{} sh -c "echo '\n{}'; kubectl get secrets -n {} -o json | jq '.items[].metadata.name'"
default
secret-1
secret-2

my-app
secret-3
secret-4
...

Pages

Blog