Vitaly Parnas - Text Processing

Print whole blocks initiating with a record start regex, containing a $pattern regex.

sed -n "/$record_start_regex/{x;/$pattern/Ip;d}; \${x;G;/$pattern/Ip}; {H}"
awk '/Pat1/{g=1; next} /Pat2/{g=0} g'

Within the given start/end pattern range, insert new content before $pattern.

sed "/$start_pattern/,/$end_pattern/s/$pattern/$new_content\n&/g"

Convert CSV data to a markdown table

sed '1p; 1s/[^,]\+/:---:/g; s/,/ | /g; s/^.*$/| \0 |/g'

Print only the inner lines inside the patterns. Can similarly delete those lines by replacing ‘p’ with ’d'.

sed -n '/Pat1/,/Pat2/{//!p}'

Extract only the From or Subject headers of an email:

sed -rn '1,/^$/{/^(from|subject):/Ip}'

Convert an html table to a CSV format:

# Strips out just the table from the html document
# Removes any html inner comments
# Converts to CSV, assuming first row to be header row
# Wraps each field in double quotes
# Removes double quotes in header line (in the sed '1 s/"//g' statement)
# Removes any formatting markup from header line

sed -re "/<table/,/<\/table/!d" \
-e "s/.*(<table)/\1/; s/(<\/table>).*/\1/;" \
-e 's/<!--.*-->//g; s/^[[:space:]]*//g; s/[[:space:]]*$//g' |\
tr -d '\n' |\
sed -re 's/<\/(TR|THEAD|TBODY)[^>]*>/\n/Ig' \
-e 's/<\/?(TABLE|TR|THEAD|TBODY)[^>]*>//Ig' \
-e "s/\"/'/g" |\
sed -re 's/^<T[DH][^>]*>|<\/?T[DH][^>]*>$/"/Ig' \
-e 's/[[:space:]]*<\/T[DH][^>]*><T[DH][^>]*>/","/Ig' \
-e '/^[[:space:]]*$/d' \
-e '1s/"//g; 1s/<\/?[^>]*>//g'

Extract a VCARD record matching pattern:

sed -rne "/BEGIN:VCARD/{x;/$pattern/I{s/END:VCARD/\0\n/;p};d};" \
    -e "\${x;G;/$pattern/Ip}; {H}"

Rearrance CSV columns:

awk 'BEGIN{FS=","; OFS=","}{print $4,$3,$2,$1}'

Most frequent word count, words of minimum length:

# Replace the bracketed parameters with your preferred values
tr -cs A-Za-z '\012' | tr A-Z a-z | egrep "\<.{<min_chars>,}\>" | sort | uniq -c | sort -nr | head -<top_results>

Questions, comments? Connect.