Using Awk & Sort to Manipulate a List of URLs and SEO Data
It’s fairly common to get a list of URLs or a csv dump from an SEO tool or even parsed server logs. When trying to analyze the data, the first reaction is typically to load the data into Excel or Google Docs where the columns can be sorted, deduped, or filtered to narrow the data set to something more manageable. However, a program like Excel can sometimes take several steps to import and selectively filter the data how you might want it. On top of it all, when dealing with large data sets, Excel can sometimes be very slow to simply open the file, let alone filter or sort columns & rows.
Using the command line, we can quickly whittle down larger data sets into something much more manageable before loading it into a spreadsheet program.
Large sites can have anywhere from 20,000 pages to over 500,000. So identifying problems at scale is an important task. In the following examples, we’ll work with the csv output from a popular SEO tool Screaming Frog that crawls a site and provides data points like:
- URLs
- Status Code
- Title Tag
- Number of characters in the Title Tag
- Meta Description Tag
- Canonical Tag
- … additional tags & HTTP header important to SEO
Filtering: Find URLs Missing their Canonical Tags
After crawling a site and exporting the CSV data,
internal_all.csv
; I can paste the following into the Terminal app on my mac and get all the URLs that are missing their canonical tags.List of URLs that are Missing the Canonical tag
awk -F'","' '$3 ~ /200/' internal_all.csv |
awk -F'","' '$21 ~ /^$/ {print $1"\""}' > no_canonicals.csv
Here’s the basic logic in plain english:
- Find all URLs that responded with a 200 status code
- Check the canonical tag column
- If the canonical is blank
- Then add the URL to the file
no_canonicals.csv
The key is to learn how to use awk “Like A Boss”. Awk is an awesome tool for selecting rows and columns that we find useful and getting rid of everything else.
Let’s breakdown the awk commands:
awk -F'","' '$3 ~ /200/' internal_all.csv
F'","'
→ separate columns by","
$3
→ using the status code in column 3~ /200/
→ perform a regEx looking for 200internal_all.csv
→ get the data from this file
awk -F'","' '$21 ~ /^$/ {print $1"\""}' > no_canonicals.csv
'$21 ~ /^$/'
→ using the canonical tag content in column 21, perform a regEx to find all rows where the content is blank{print $1"\""}
→ output the URL in column 1 and add a closing double quote"
> no_canonicals.csv
→ then save tono_canonicals.csv
Parsing CSV Files with Awk
The
F
is an optional flag designating the field separator. The Screaming Frog CSV file is delimited with commas and the text qualifiers are double quotes "
. We could have simply told awk to use a comma as a separator, F,
, however, this would have caused problems; Within some of the fields, commas can occur naturally throughout the content such as the meta descriptions, titles, or even URLs. In order to parse the fields into the appropriate columns, we include the double quotes "
that surround the fields. In order to tell awk that ","
is the entire delimiter, we need to surround it with single quotes '
like this '","'
.Only Get Rows Where the Status Code is 200 OK
Once we have the appropriate columns, we use
'$3 ~ /200/'
to get all the valid URLs, since status codes like 301s and 404s will also have blank canonical tags. The $3
tells awk to select the status code from the 3rd column then perform a regular expression match, ~
, using the pattern /200/
.Then we simply give awk the file which to perform this operation on
internal_all.csv
.Checking if the Canonical Field is Blank
We then pipe the valid 200 URLs and related data to another awk search. It’s pretty much the same as the first, but this time we’re going to output the URLs into a usable CSV file. The
$21
matches the canonical tags in the 21rd column, and then the regEx looks to see if that column is blank, /^$/
. For all the blank URLs, we output {print $1"\""}
the first URL in the first column, $1
and then append a double quote at the end, "\""
.Take note that in both awk statements, the regEx command is wrapped in single quotes
'
. This is significant because awk evaluates things differently when the commands are wrapped in a single quote versus a double quote. You’re almost always going to use a single quote and then escape out any double quotes like we’ve already done.Outputting the Title Along with the URL
If you want to get more than just the URL, you can add another field like the title tag in column
$5
.URLs and Titles for Missing Canonical tag
awk -F'","' \
'($3 ~ /200/) && ($21 ~ /^$/) {OFS="\",\""; print $1, $5"\""}' \
internal_all.csv > no_canonicals.csv

The statement is slightly different, but more concise this way. Instead of piping the results from the first statement (checking for 200 status code) to the next statement, we simply wrap the regEx with parentheses and use the double amperstands && to check for both the 200 status code AND the empty canonical tag.
For the CSV output, we tell awk that the “Output File Separator”
OFS="\",\"";
is ","
; That way when we list out the columns we want printed $1, $5
, awk automatically inserts ","
between each field. We then cap off the end of the row by adding a double quote "\""
.Taking it a step further, if you wanted to also include the
<h1>
value in the 11th column, simply add it to the list.URLs, Titles, and H1s for Missing Canonical tag
awk -F'","' \
'($3 ~ /200/) && ($21 ~ /^$/) {OFS="\",\""; print $1, $5, $11"\""}' \
internal_all.csv > no_canonicals.csv
note: if executing the statement all on 1 line, you don’t need the backslashes
\
. They’re inserted for readability reasons and are necessary when creating multi-line statements.Super Fast Columns Sorting
Another common SEO task is to get a list of all the 404s + 301s, and you’ll probably want to start with the URLs that have the most inlinks because they’ll have the largest impact on the user experience (Ux).
To achieve this simply paste this on the command line:
A list of 404s and 301s, sorted by Number of Inlinks then by URL
awk -F'","' '$3 ~ /(404|301)/ {OFS="\t"; print $3, $25, $1"\""}' internal_all.csv |
sort -k1,1n -k2,2nr -k3,3d > 404s-301s-sorted-by-inlinks.csv
Because we’re doing this at the command line it’ll be super fast, even for a 100 MB file, compared to doing it in Excel or Google Spreadsheets.
Here’s the basic login in plain english:
- Find all URLs with a 404 or 301 response status code
- Get the status code, inlink count, and URL
- Sort the columns in descending order
- First by status code
- Then by # of inlinks
- Then by URL
- Then save it into the file
404s-301s-sorted-by-inlinks.csv
Let’s break down the bash commands:
awk -F'","' '$3 ~ /(404|301)/ {OFS="\t"; print $3, $25, $1"\""}' internal_all.csv
F'","'
→ separate columns by","
$3
→ using the status code in column 3~ /(404|301)/
→ perform a regEx matching 404s or 301sOFS="\t";
→ for the output, usetabs
to delimited the columns$3
→ output the status code in column 3$25
→ output the inlink count in column 25$1"\""
→ output the URL in column 1 and cap the row with a closing double quote"
internal_all.csv
→ get the data from this file
sort -k1,1n -k2,2nr -k3,3d > 404s-301s-sorted-by-inlinks.csv
sort -k1,1n
→ sort rows by status code in the 1st columnk2,2nr
→ then sort by inlink count in descending orderr
k3,3d
→ then sort the URLs alphabeticallyd
> 404s-301s-sorted-by-inlinks.csv
→ then save to filename404s-301s-sorted-by-inlinks.csv
Sorting URLs by Inbound Links
The sort command is super duper fast and flexible. The optional
n
and r
flags sort the columns numerically and in reverse / descending order. Because we outputted the columns using a tab
to delimit each field in the awk statement, this allowed us to sort the status codes and inbound links as numbers instead of strings, which is a BIG difference. If the numeric fields were encapsulated in double quotes "
like "404","100","https://www.searchsignals.com/"
then the status codes and inbound links would be evaluated as strings. So “1”, “10”, and “100” would group together and “2”, “22”, “200” would be sorted together, instead of being sorted numerically as 1, 2, 10, 22, 100, 200
.Using the
k
flag, we’re telling sort
to first sort by the status codes in the first column k1,1n
; the trailing n
tells sort
to order it numerically. Similarly we sort the inbound link counts in the 2nd column numerically k2,2nr
but in descending order, using the r
command which means reverse order. Lastly we sort by URLs in the 3rd column k3,3d
alphabetically by using the d
command which means “dictionary” order.Other Cool Commands like Merging & Deduping
Although I won’t go as in-depth, you should definitely check out the other features of
awk
and sort
. You can easily do this by typing man awk
or man sort
which are the “man pages” aka manual / documentation. Otherwise, stackoverflow has some great threads on it.For example,
awk
has a feature to split
a field into smaller parts and sort
conveniently comes with a u
flag which means unique, which will dedupe a file based on the column you provide it.Getting Example URLs for Common 404 Errors
So, assume we had 10,000+ 404 errors. There’s a high likelihood that many of the errors come from similar URLs (
https://example.com/photos/100s-of-photos
, https://example.com/users/lots-of-usernames
). We can use split
and unique
to give us 1 relevant example from each directory.Example URLs Causing 404 Errors
awk -F'","' \
'$3 ~ /404/ {OFS="\t"; split($1,a,"/"); print $1"\"", a[3], a[4], a[5]}' \
internal_all.csv | sort -u -k3,3 > 404_example_urls.csv
Merging Multiple Screaming Frog Files
Lastly, lots of people run multiple instances of Screaming Frog and it’s common to have multiple CSV files which you need to combine. Normally
cat
is a great command line tool to do this, and Christopher Yee has a nice tutorial on how to use cat. But Screaming Frog always outputs the header row, which we don’t need and can mess up our data. So instead, we’ll use tail
.Example URLs Causing 404 Errors
awk -F'","' \
'$3 ~ /404/ {OFS="\t"; split($1,a,"/"); print $1"\"", a[3], a[4], a[5]}' \
internal_all.csv | sort -u -k3,3 > 404_example_urls.csv
Merging Multiple Screaming Frog Files
Lastly, lots of people run multiple instances of Screaming Frog and it’s common to have multiple CSV files which you need to combine. Normally
cat
is a great command line tool to do this, and Christopher Yee has a nice tutorial on how to use cat. But Screaming Frog always outputs the header row, which we don’t need and can mess up our data. So instead, we’ll use tail
.Merging CSV Files Together with tail
tail -qn +2 *.csv > internal_all_merged.csv
The
n
flag followed by the +2
basically tells tail to exclude the first row of each file. Then the q
flag tells tail not to output the filename. q
works on the mac but might now always work for all command line systems. We’re also assuming that all the CSV files in the current directory are files you want to merge together. If you have them named as internal_all-1.csv
, internal_all-2.csv
, etc. you can change the command to internal_all*.csv
instead.Merging Specific CSV Files Together
tail -qn +2 internal_all*.csv > internal_all_merged.csv
If you want something a bit more universal, try using a combination of
find
and xargs
.Merging Specific CSV Files Together with find and xargs
find . -name "*.csv" |
xargs -n 1 tail -n +2 > internal_all_merged.csv
Command Line Cookbook for Screaming Frog
Here is a summary of the common SEO tasks from above, as well as, a few other quick commands you can do:
URLs Missing Canonical
`awk -F'","' '($3 ~ /200/) && ($21 ~ /^$/) {print $1"\""}' \
internal_all.csv > no_canonicals.csv`
URLs & Titles for Missing Canonical
awk -F'","' \
'($3 ~ /200/) && ($21 ~ /^$/) {OFS="\",\""; print $1, $5"\""}' \
internal_all.csv > no_canonicals_plus_titles.csv
URLs, Titles, & H1s for Missing Canonical1
2
3
awk -F'","' \
'($3 ~ /200/) && ($21 ~ /^$/) {OFS="\",\""; print $1, $5, $11"\""}' \
internal_all.csv > no_canonicals_plus_titles_h1.csv
URLs Missing Title Tags1
2
3
awk -F'","' \
'($3 ~ /200/) && ($5 ~ /^$/) {print $1"\""}' \
internal_all.csv > no_title_tag.csv
URLs Missing Meta Descriptions1
2
3
awk -F'","' \
'($3 ~ /200/) && ($8 ~ /^$/) {print $1"\""}' \
internal_all.csv > no_meta_description.csv
URLs Missing H1s1
2
3
awk -F'","' \
'($3 ~ /200/) && ($11 ~ /^$/) {print $1"\""}' \
internal_all.csv > no_h1s.csv
URL, Description, and Length of Pages w/ Short Meta Descriptions (< 50 Chars)1
2
3
awk -F'","' '$3 ~ /200/' internal_all.csv |
awk -F'","' '$8 > 0 && $8 < 50 {OFS="\t"; print $8, $9, $1"\""}' |
sort -k1,1n -k2,2d > short_meta_descriptions.csv
URLs that Have a Canonical tag1
2
3
awk -F'","' \
'($3 ~ /200/) && ($21 ~ /.+/) {print $1"\""}' \
internal_all.csv > has_canonicals.csv
Example URLs Causing 404 Errors1
2
3
awk -F'","' '$3 ~ /404/' internal_all.csv |
awk -F'","' '{OFS="\t"; split($1,a,"/"); print $1"\"", a[3], a[4], a[5]}' |
sort -u -k3,3 > 404_example_urls.csv
404s & 301s sorted by Number of Inlinks then URL1
2
awk -F'","' '$3 ~ /(404|301)/ {OFS="\t"; print $3, $25, $1"\""}' internal_all.csv |
sort -k1,1n -k2,2nr -k3,3d > 404s-301s-sorted-by-inlinks.csv
Sort Similar or Duplicate Title Tags Together1
2
awk -F'","' '$3 ~ /200/ {OFS="\t"; print "\""$5"\"", $1"\""}' internal_all.csv |
sort > cluster_titles.csv
Sort Similar or Duplicate Meta Descriptions Together1
2
awk -F'","' '$3 ~ /200/ {OFS="\t"; print "\""$8"\"", $1"\""}' internal_all.csv |
sort > cluster_titles.csv
Sort Similar or Duplicate H1 Tags Together1
2
awk -F'","' '$3 ~ /200/ {OFS="\t"; print "\""$11"\"", $1"\""}' internal_all.csv |
sort > cluster_titles.csv
URL & # of Outbound Links w/ Less than 5 Outbound Links1
2
3
awk -F'","' \
'($3 ~ /200/) && ($26 < 5) {OFS="\t"; print $26, $1"\""}' \
internal_all.csv | sort -k1,1n -k2,2d > few_outbound_links.csv
URLs With Fewer than 5 Inbound Links1
2
3
awk -F'","' \
'($3 ~ /200/) && ($25 < 5) {OFS="\t"; print $25, $1"\""}' \
internal_all.csv | sort -k1,1n -k2,2d > few_inbound_links.csv
Merging Specific CSV Files Together1
tail -qn +2 internal_all*.csv > internal_all_merged.csv
Merging Specific CSV Files Together with find and xargs1
2
find . -name "*.csv" |
xargs -n 1 tail -n +2 > internal_all_merged.csv
Update » Changed column numbers to match the latest version of Screaming Frog v2.11 (e.g. $23 to $21 for canonicals)
Read more »