Sed - How it works, useful techniques and patterns

In order to be able to use sed well it really helps to understand how it works, and then also see some examples of how to put it into practice, so I will try and show both those things.

Sed is a good tool for mass file editing

Despite having syntax that can quickly devolve into an unintelligible string of letters and symbols, it really does work well. It was specifically designed for text processing which I think is the prime reason for it seeming unusual and unfamiliar, but also the reason it works well for editing text files. When pulling data from various US Census sources to be used for creating maps (Data Processed by Sed) sed was my prime tool. I used it to make many edits to the raw text data before adding it to mongodb. Mongodb aggregation write-up..

keep the individual commands simple

Sed can actually do some pretty complex stuff. Within 48 hours of writing that complex stuff in one long command however, it will often look like gibberish. The better plan is to accomplish tasks by using a number of simpler commands that you can actually follow the logic of.

There are multiple ways to string small commands together

These are all equivalent. A line of input is passed to each command, it operates on the line, and then the line with any changes is passed to the next command.

printf %"sabc" | sed '{s/a/A/; s/b/B/; s/c/C/}'
printf %"sabc" | sed -e 's/a/A/' -e 's/b/B/' -e 's/c/C/'
printf %"sabc" | sed 's/a/A/; s/b/B/; s/c/C/'

So when you want to break commands into pieces, you have options.

Commands can also be pulled from a file:

If you have a lot of commands to run or a nested structure that is easier to read on several lines, putting the commands in a file is a good choice. The search/replace commands here would all be run sequentially on the first ten lines of the file passed to them.
sedcommand.txt:

1,10{
s/regex1/replace1/g;
s/regex2/replace2/g;
s/regex3/replace3/g;
s/regex4/replace4/g;
s/regex5/replace5/g;
s/regex6/replace6/g;
s/regex7/replace7/g;
s/regex8/replace8/g;
}

Calling this file with sed:
Use the -f flag and provide the name of the file.
sed -n -f ‘sedcommand.txt’ input.file

Lines of Input flow through the commands you specify

Sed is called data driven because as each line of input from a file or files is fed into the script, not unlike automatically looping through the file line by line.
Each line then can be used to determine what sed does, including using more data (lines), or determining what parts of the code to execute. The data goes through the commands you specify, and then is output with any changes specified by the commands. The next unused line of data is then fed in.

The current line/lines being acted on is called the pattern space

Sometimes referring to input/lines gets confusing if you have added more lines. If I refer to ‘pattern space’ I am talking about the whole pattern currently being processed, whether that is one line or many.

Find and replace is the simplest use of sed

s/regex/replace/flags
It checks each line of input for a regex and replaces it. Input files are provided by find/xargs here.

# the g flag means global replace; replaces all instances in a line
find . -type f -print0 | xargs -0 sed -r "s/One/Two/g"  

-r is for extended regular expression, which I almost always turn on.

Patterns can be used to select lines for further actions

The regex / / followed by a command means that any line matching that regex will be put through the commands. To run multiple commands, put them inside braces.

sed '/regex/{ commands }'
# Print the line numbers that match a regex  
sed -r '/regex/ ='  inputfile.txt
# extract only lines matching the regex, insert empty lines between  
sed -r '/.*regex.*/ {p;G;}'  inputfile.txt
# same, but do it with lines not matching the regex   
sed -r '/.*regex.*/ !{p}'  inputfile.txt 

You can use more than one pattern match, nested patterns

In this case the mainregex selects lines and passes them to the commands in the braces, which use more pattern matching to determine whether to run the command(s) they specify.

sed -r '/mainregex/{ s/regex1/replace/g;  /subregex2/p;  /subregex3/{G;p;} }' inputfile.txt      

Run commands on a range of lines

You can specify a literal range, or use patterns that will mark the start and end of the range.

sed -r '50,300{p}
sed -r '/regex1/,/regex2/{p} 

Testing your commands

By default sed prints to stdout, so you can see what affect the command had and it will not modify a file. If there large files, lots of lines use the -n option:
sed -n /regex/p
This suppresses all printing of lines except ones you explicitly print with p so that you don’t get the whole content.

Editing files in place

By default sed takes input from the file and outputs it to the terminal, which is useful for seeing the result. Once you are satisfied with you command, you can modify files directly with the -i flag.

sed -r -i.bak 's/regex/replace/g' inputfile

The .bak part means sed will create a backup file named inputfile.back.

You almost never want to use -i with -n!

If you are testing a command with the suppress automatic printing flag (-n) and then add the -i flag, you will usually wipe most of the file that you were working on. Remember to remove it before modifying files in place!

Actions besides find/replace

N - add the next line to the line being processed. Useful for anything you want to be able to work with multiple lines.

# find lines matching regex1, append the next line, print both only if the
# combined line matches regex2
sed -n -r "/^regex1/{N; /regex2/p } " inputfile 

d/D - delete the whole current pattern space (d), or the first line in the pattern space (D). If there is only one line, same action.

sed '$!N; /^\(.*\)\n\1$/!P; D'  
# Deletes repeat lines
# $!N; - add the next line to pattern space if not at the last line
# /^\(.*\)\n\1$/ - Regex to check if the patterns before and after the 
# newline match, indicating repeated lines.
# !P - if the lines don't repeat, Print the first
# D - delete the first line no matter what 

G - add a newline (as long as you are NOT using HOLD SPACE)
Don’t know what hold space is? Then you very likely aren’t using it.

# print the line matching the regex (p;)
# Then replace it with next line and a newline (n;G;) and print again (p;)
# prints a matching line, the next line, and a newline for prettiness. 
sed -n '/regex/{p;n;G;p}' Full_Conversions.bat  -

Acting on large numbers of files

I usually use find and pipe to xargs/sed.

# Run on all the files, but do not modify. Print only lines where 
# the regex was found/replaced. For changing markdown header level.   
sudo find ./src -type f -iname  "*.md" -print0 | xargs -0 sed -n -r 's/####/###/gp'
# To actually change files, use -i.bak and remove -n
# If -n is not removed, you will end up deleting most of the contents!
sudo find ./src -type f -iname  "*.md" -print0 | xargs -0 sed -i.bak -r 's/####/###/gp'

Hold space

This allows you to store away the pattern space for later. You can append to it (H) so you can essentially store up lines.
Hold space in action -

 sed -n -r "/regex1/{n;H};   \${x; s/regex/repl/g;p}" dataFile.txt  

/regex1/{n;H} - each time the regex is found, take the next line (n;) and add it to the hold space (H;).
$ - Special command that indicates to run something only at the end of file.
{x; s/regex/replace/g;p } - when at the end of file, take all the lines in the hold space (x;) run the find/replace on each, then print that out.

loops

Usually if I am thinking about using a loop, I rethink or don’t use sed. I find loops and labels counterintuitive in many cases, and hard to reason about.
With that said, here is an example anyway
The goal here is to collect all lines lines between two patterns, deleting the whole thing anytime a disqualifying pattern is found, and then run commands on the whole block of lines afterwards.

A more complex loop -

 sed -n -r '/^regexmain/{  
    # loop starts here
    :loop 
        # append line to pattern space
	N 
        # delete whole pattern space if disqualifying pattern found.  
	/\nregex1/{d}
        # If pattern is found, end loop, continue to next commands with 
        # whatever is in the pattern space
	/regexend/!b loop
        # delete lines with strings in them 
	s/(str1|str2|str3).*\n//g  
        # delete blank lines 
	s/\n.{0,10}\n/\n/g  
        # print what is left
	p  
        # add a blank line afterwards
	a \\  
	
	}' <input.log >output.txt  
     # I hope you are happy now

Why I don’t like doing complex loops

It can get really hard to figure out what the pattern space holds, when the loop will restart, and how the commands will act on large chunks of data. Plus it is just a pain to read.

Summary of the good with using sed

  1. For relatively simple changes to be made in large numbers of files, I think sed is a great tool. It is available on just about any linux system and I use it even on fully featured systems.
  2. It is designed to work with files and lines of data so there is no open, read, parse overhead. Find/replacing and pattern searching in general is a breeze since it was designed for that.
  3. Since it can integrate easily with find/xargs, you can run it on loads of files with detailed selection, and you can readily check the output/tune it before actually modifying the file contents.

drawbacks

  1. When the task gets a little bit complex, sed commands can quickly become incomprehensbile. It is hard to remember all the letters, symbols, and visualize how they work together since you can’t step through. If you can’t break these commands up, I usually switch to using something other than sed.