search signals | SEO the /ha(rd|cker)/ way

Command Line SEO: Sorting & Filtering

Using Awk & Sort to Manipulate a List of URLs and SEO Data

A photo of colorful canoes on a lake symbolizing the columns and sorting of a CSV file. It’s fairly common to get a list of URLs or a csv dump from an SEO tool or even parsed server logs. When trying to analyze the data, the first reaction is typically to load the data into Excel or Google Docs where the columns can be sorted, deduped, or filtered to narrow the data set to something more manageable. However, a program like Excel can sometimes take several steps to import and selectively filter the data how you might want it. On top of it all, when dealing with large data sets, Excel can sometimes be very slow the mac beachball of death to simply open the file, let alone filter or sort columns & rows.

Using the command line, we can quickly whittle down larger data sets into something much more manageable before loading it into a spreadsheet program.

Large sites can have anywhere from 20,000 pages to over 500,000. So identifying problems at scale is an important task. In the following examples, we’ll work with the csv output from a popular SEO tool Screaming Frog that crawls a site and provides data points like:

  • URLs
  • Status Code
  • Title Tag
  • Number of characters in the Title Tag
  • Meta Description Tag
  • Canonical Tag
  • … additional tags & HTTP header important to SEO

Filtering: Find URLs Missing their Canonical Tags

After crawling a site and exporting the CSV data, internal_all.csv; I can paste the following into the Terminal app on my mac and get all the URLs that are missing their canonical tags.

List of URLs that are Missing the Canonical tag
1
2
awk -F'","' '$3 ~ /200/' internal_all.csv |
awk -F'","' '$21 ~ /^$/ {print $1"\""}' > no_canonicals.csv

Here’s the basic logic in plain english:

  1. Find all URLs that responded with a 200 status code
  2. Check the canonical tag column
  3. If the canonical is blank
  4. Then add the URL to the file no_canonicals.csv

The key is to learn how to use awk “Like A Boss”. Awk is an awesome tool for selecting rows and columns that we find useful and getting rid of everything else.

Let’s breakdown the awk commands:

  • awk -F'","' '$3 ~ /200/' internal_all.csv
    • -F'","' → separate columns by ","
    • $3 → using the status code in column 3
    • ~ /200/ → perform a regEx looking for 200
    • internal_all.csv → get the data from this file
  • awk -F'","' '$21 ~ /^$/ {print $1"\""}' > no_canonicals.csv
    • '$21 ~ /^$/' → using the canonical tag content in column 21, perform a regEx to find all rows where the content is blank
    • {print $1"\""} → output the URL in column 1 and add a closing double quote "
  • > no_canonicals.csv → then save to no_canonicals.csv

Parsing CSV Files with Awk

The -F is an optional flag designating the field separator. The Screaming Frog CSV file is delimited with commas and the text qualifiers are double quotes ". We could have simply told awk to use a comma as a separator, -F,, however, this would have caused problems; Within some of the fields, commas can occur naturally throughout the content such as the meta descriptions, titles, or even URLs. In order to parse the fields into the appropriate columns, we include the double quotes " that surround the fields. In order to tell awk that "," is the entire delimiter, we need to surround it with single quotes ' like this '","'.

Only Get Rows Where the Status Code is 200 OK

Once we have the appropriate columns, we use '$3 ~ /200/' to get all the valid URLs, since status codes like 301s and 404s will also have blank canonical tags. The $3 tells awk to select the status code from the 3rd column then perform a regular expression match, ~, using the pattern /200/.

Then we simply give awk the file which to perform this operation on internal_all.csv.

Checking if the Canonical Field is Blank

We then pipe the valid 200 URLs and related data to another awk search. It’s pretty much the same as the first, but this time we’re going to output the URLs into a usable CSV file. The $21 matches the canonical tags in the 21rd column, and then the regEx looks to see if that column is blank, /^$/. For all the blank URLs, we output {print $1"\""} the first URL in the first column, $1 and then append a double quote at the end, "\"".

Take note that in both awk statements, the regEx command is wrapped in single quotes '. This is significant because awk evaluates things differently when the commands are wrapped in a single quote versus a double quote. You’re almost always going to use a single quote and then escape out any double quotes like we’ve already done.

Outputting the Title Along with the URL

If you want to get more than just the URL, you can add another field like the title tag in column $5.

URLs and Titles for Missing Canonical tag
1
2
3
awk -F'","' \
'($3 ~ /200/) && ($21 ~ /^$/) {OFS="\",\""; print $1, $5"\""}' \
internal_all.csv > no_canonicals.csv

Animated walkthrough of using awk on the command line to process a screaming frog CSV to output URLs with missing canonicals and their corresponding titles The statement is slightly different, but more concise this way. Instead of piping the results from the first statement (checking for 200 status code) to the next statement, we simply wrap the regEx with parentheses and use the double amperstands && to check for both the 200 status code AND the empty canonical tag.

For the CSV output, we tell awk that the “Output File Separator” OFS="\",\""; is ","; That way when we list out the columns we want printed $1, $5, awk automatically inserts "," between each field. We then cap off the end of the row by adding a double quote "\"".

Taking it a step further, if you wanted to also include the <h1> value in the 11th column, simply add it to the list.

URLs, Titles, and H1s for Missing Canonical tag
1
2
3
awk -F'","' \
'($3 ~ /200/) && ($21 ~ /^$/) {OFS="\",\""; print $1, $5, $11"\""}' \
internal_all.csv > no_canonicals.csv

note: if executing the statement all on 1 line, you don’t need the backslashes \ . They’re inserted for readability reasons and are necessary when creating multi-line statements.

Super Fast Columns Sorting

Another common SEO task is to get a list of all the 404s + 301s, and you’ll probably want to start with the URLs that have the most inlinks because they’ll have the largest impact on the user experience (Ux).

To achieve this simply paste this on the command line:

A list of 404s and 301s, sorted by Number of Inlinks then by URL
1
2
awk -F'","' '$3 ~ /(404|301)/ {OFS="\t"; print $3, $25, $1"\""}' internal_all.csv |
sort -k1,1n -k2,2nr -k3,3d > 404s-301s-sorted-by-inlinks.csv

Because we’re doing this at the command line it’ll be super fast, even for a 100 MB file, compared to doing it in Excel or Google Spreadsheets.

Here’s the basic login in plain english:

  1. Find all URLs with a 404 or 301 response status code
  2. Get the status code, inlink count, and URL
  3. Sort the columns in descending order
  4. First by status code
  5. Then by # of inlinks
  6. Then by URL
  7. Then save it into the file 404s-301s-sorted-by-inlinks.csv

Let’s break down the bash commands:

  • awk -F'","' '$3 ~ /(404|301)/ {OFS="\t"; print $3, $25, $1"\""}' internal_all.csv
    • -F'","' → separate columns by ","
    • $3 → using the status code in column 3
    • ~ /(404|301)/ → perform a regEx matching 404s or 301s
    • OFS="\t"; → for the output, use tabs to delimited the columns
    • $3 → output the status code in column 3
    • $25 → output the inlink count in column 25
    • $1"\"" → output the URL in column 1 and cap the row with a closing double quote "
    • internal_all.csv → get the data from this file
  • sort -k1,1n -k2,2nr -k3,3d > 404s-301s-sorted-by-inlinks.csv
    • sort -k1,1n → sort rows by status code in the 1st column
    • -k2,2nr → then sort by inlink count in descending order r
    • -k3,3d → then sort the URLs alphabetically d
  • > 404s-301s-sorted-by-inlinks.csv → then save to filename 404s-301s-sorted-by-inlinks.csv

The sort command is super duper fast and flexible. The optional n and r flags sort the columns numerically and in reverse / descending order. Because we outputted the columns using a tab to delimit each field in the awk statement, this allowed us to sort the status codes and inbound links as numbers instead of strings, which is a BIG difference. If the numeric fields were encapsulated in double quotes " like "404","100","http://www.searchsignals.com/" then the status codes and inbound links would be evaluated as strings. So “1”, “10”, and “100” would group together and “2”, “22”, “200” would be sorted together, instead of being sorted numerically as 1, 2, 10, 22, 100, 200.

Using the -k flag, we’re telling sort to first sort by the status codes in the first column -k1,1n; the trailing n tells sort to order it numerically. Similarly we sort the inbound link counts in the 2nd column numerically -k2,2nr but in descending order, using the r command which means reverse order. Lastly we sort by URLs in the 3rd column -k3,3d alphabetically by using the d command which means “dictionary” order.

Other Cool Commands like Merging & Deduping

Although I won’t go as in-depth, you should definitely check out the other features of awk and sort. You can easily do this by typing man awk or man sort which are the “man pages” aka manual / documentation. Otherwise, stackoverflow has some great threads on it.

For example, awk has a feature to split a field into smaller parts and sort conveniently comes with a -u flag which means unique, which will dedupe a file based on the column you provide it.

Getting Example URLs for Common 404 Errors

So, assume we had 10,000+ 404 errors. There’s a high likelihood that many of the errors come from similar URLs (http://example.com/photos/100s-of-photos, http://example.com/users/lots-of-usernames). We can use split and unique to give us 1 relevant example from each directory.

Example URLs Causing 404 Errors
1
2
3
awk -F'","' \
'$3 ~ /404/ {OFS="\t"; split($1,a,"/"); print $1"\"", a[3], a[4], a[5]}' \
internal_all.csv | sort -u -k3,3 > 404_example_urls.csv

Merging Multiple Screaming Frog Files

Lastly, lots of people run multiple instances of Screaming Frog and it’s common to have multiple CSV files which you need to combine. Normally cat is a great command line tool to do this, and Christopher Yee has a nice tutorial on how to use cat. But Screaming Frog always outputs the header row, which we don’t need and can mess up our data. So instead, we’ll use tail.

Merging CSV Files Together with tail
1
tail -qn +2 *.csv > internal_all_merged.csv

The n flag followed by the +2 basically tells tail to exclude the first row of each file. Then the q flag tells tail not to output the filename. q works on the mac but might now always work for all command line systems. We’re also assuming that all the CSV files in the current directory are files you want to merge together. If you have them named as internal_all-1.csv, internal_all-2.csv, etc. you can change the command to internal_all*.csv instead.

Merging Specific CSV Files Together
1
tail -qn +2 internal_all*.csv > internal_all_merged.csv

If you want something a bit more universal, try using a combination of find and xargs.

Merging Specific CSV Files Together with find and xargs
1
2
find . -name "*.csv" |
xargs -n 1 tail -n +2 > internal_all_merged.csv

Command Line Cookbook for Screaming Frog

Here is a summary of the common SEO tasks from above, as well as, a few other quick commands you can do:

URLs Missing Canonical
1
2
awk -F'","' '($3 ~ /200/) && ($21 ~ /^$/) {print $1"\""}' \
internal_all.csv > no_canonicals.csv
URLs & Titles for Missing Canonical
1
2
3
awk -F'","' \
'($3 ~ /200/) && ($21 ~ /^$/) {OFS="\",\""; print $1, $5"\""}' \
internal_all.csv > no_canonicals_plus_titles.csv
URLs, Titles, & H1s for Missing Canonical
1
2
3
awk -F'","' \
'($3 ~ /200/) && ($21 ~ /^$/) {OFS="\",\""; print $1, $5, $11"\""}' \
internal_all.csv > no_canonicals_plus_titles_h1.csv
URLs Missing Title Tags
1
2
3
awk -F'","' \
'($3 ~ /200/) && ($5 ~ /^$/) {print $1"\""}' \
internal_all.csv > no_title_tag.csv
URLs Missing Meta Descriptions
1
2
3
awk -F'","' \
'($3 ~ /200/) && ($8 ~ /^$/) {print $1"\""}' \
internal_all.csv > no_meta_description.csv
URLs Missing H1s
1
2
3
awk -F'","' \
'($3 ~ /200/) && ($11 ~ /^$/) {print $1"\""}' \
internal_all.csv > no_h1s.csv
URL, Description, and Length of Pages w/ Short Meta Descriptions (< 50 Chars)
1
2
3
awk -F'","' '$3 ~ /200/' internal_all.csv |
awk -F'","' '$8 > 0 && $8 < 50 {OFS="\t"; print $8, $9, $1"\""}' |
sort -k1,1n -k2,2d > short_meta_descriptions.csv
URLs that Have a Canonical tag
1
2
3
awk -F'","' \
'($3 ~ /200/) && ($21 ~ /.+/) {print $1"\""}' \
internal_all.csv > has_canonicals.csv
Example URLs Causing 404 Errors
1
2
3
awk -F'","' '$3 ~ /404/' internal_all.csv |
awk -F'","' '{OFS="\t"; split($1,a,"/"); print $1"\"", a[3], a[4], a[5]}' |
sort -u -k3,3 > 404_example_urls.csv
404s & 301s sorted by Number of Inlinks then URL
1
2
awk -F'","' '$3 ~ /(404|301)/ {OFS="\t"; print $3, $25, $1"\""}' internal_all.csv |
sort -k1,1n -k2,2nr -k3,3d > 404s-301s-sorted-by-inlinks.csv
Sort Similar or Duplicate Title Tags Together
1
2
awk -F'","' '$3 ~ /200/ {OFS="\t"; print "\""$5"\"", $1"\""}' internal_all.csv |
sort > cluster_titles.csv
Sort Similar or Duplicate Meta Descriptions Together
1
2
awk -F'","' '$3 ~ /200/ {OFS="\t"; print "\""$8"\"", $1"\""}' internal_all.csv |
sort > cluster_titles.csv
Sort Similar or Duplicate H1 Tags Together
1
2
awk -F'","' '$3 ~ /200/ {OFS="\t"; print "\""$11"\"", $1"\""}' internal_all.csv |
sort > cluster_titles.csv
URL & # of Outbound Links w/ Less than 5 Outbound Links
1
2
3
awk -F'","' \
'($3 ~ /200/) && ($26 < 5) {OFS="\t"; print $26, $1"\""}' \
internal_all.csv | sort -k1,1n -k2,2d > few_outbound_links.csv
URLs With Fewer than 5 Inbound Links
1
2
3
awk -F'","' \
'($3 ~ /200/) && ($25 < 5) {OFS="\t"; print $25, $1"\""}' \
internal_all.csv | sort -k1,1n -k2,2d > few_inbound_links.csv
Merging Specific CSV Files Together
1
tail -qn +2 internal_all*.csv > internal_all_merged.csv
Merging Specific CSV Files Together with find and xargs
1
2
find . -name "*.csv" |
xargs -n 1 tail -n +2 > internal_all_merged.csv

Update » Changed column numbers to match the latest version of Screaming Frog v2.11 (e.g. $23 to $21 for canonicals)

Comments