Frequently used words and word pairs

2020-03-03 @Technology
By Vitaly Parnas

Here’s a quick command-line procedure to compute the most frequently used single words as well as two-word combinations, respecting certain configurable filtering. This demands only the standard built-in Unix commands. And the solution is way faster than any Python analytical package I’d used.

I was initially intrigued by the language I tend to overuse in my writings. Employing the already widespread set of piped commands for the task, I extended them to enable filtering and the two-word combination support.

Single words

  1. Extract the entire body of text we wish to feed into the procedure. For this, enter the top level directory containing your writings and cat (concatenate) all the relevant text to the standard output:

    find . -type f -exec cat {} \+
    
  2. Generate the list of words, replacing all non-alphanumeric characters with a newline and converting everything to lowercase. Pipe the above into the following routine.

    ... | tr -cs A-Za-z '\012' | tr A-Z a-z 
    
  3. Include only words of minimum number of characters, say, four.

    ... | egrep "\<.{4,}\>"
    
  4. Sort by the word counts and display only the top 30 results.

    ... | sort | uniq -c  | sort -nr | head -30
    

And the complete procedure, combining the above steps:

$ find . -type f -exec cat {} \+ |
tr -cs A-Za-z '\012' | tr A-Z a-z | 
egrep "\<.{4,}\>" |
sort | uniq -c  | sort -nr | head -30

Executing the above across all my blog posts:

1406 with
1240 that
 987 this
 665 more
 650 time
 572 your
 503 from
 486 have
 473 some
 407 language
 387 much
 384 what
 376 such
 363 even
 357 other
 332 which
 304 also
 295 only
 277 would
 262 these
 254 into
 247 than
 241 most
 235 find
 233 title
 233 over
 229 many
 223 content
 223 category
 215 status

And for at least 5-character words:

406 language
356 other
331 which
276 would
261 these
232 title
222 content
222 category
214 status
212 there
207 published
200 however
197 those
195 certain
188 could
181 having
176 rather
176 https
175 experience
165 among
164 first
152 without
148 point
145 entirely
143 years
142 about
140 problem
136 still
132 means
132 further

Pairs of words

Here’s the slightly longer, modified full procedure.

$ find . -type f -exec cat {} \+ |
tr -cs A-Za-z '\012' | tr A-Z a-z | 
sed -n '1{h;d};{x;G;s/\n/ /;p}' |
egrep -v "\<.{,2}\>" |
sort | uniq -c  | sort -nr | head -30

The above sed invocation is key to generate the word pairs. I will not decompose it into individual steps, since that would require explaining too much of the sed syntax.

The modified egrep, rather than including words of at least three characters, excludes any words of at most two. This exclusion is necessary since we want to eliminate any word pairs with any of the two words below the specified length.

(If we employed the inclusion method, any one of the two words passing for three characters would include the pair in the resulting list, irrespective of the other word length.)

Here’s the result executed against my writings:

273 with the
248 for the
218 and the
190 status published
156 date category
139 from the
125 you can
109 the above
 98 not only
 94 the same
 93 the more
 80 the other
 79 the time
 78 into the
 77 and yet
 70 the respective
 70 over the
 69 rather than
 68 but the
 64 for example
 62 the language
 62 the following
 62 the first
 57 you may
 57 the most
 57 among the
 55 beyond the
 53 sub sub
 53 all the
 49 not the

A lot of the above pairs include the article the in the results, which isn’t very useful in the hunt for overused pairs.

Let’s modify the egrep statement to also filter out pairs with the word the:

egrep -v "\<(.{,2}|the)\>"

And the outcome:

190 status published
156 date category
125 you can
 98 not only
 77 and yet
 69 rather than
 64 for example
 58 you may
 54 sub sub
 49 need not
 48 and not
 46 time track
 45 org wiki
 44 wikipedia org
 44 category lifestyle
 42 you will
 42 you are
 42 time and
 42 https www
 39 not necessarily
 38 you have
 38 but also
 37 what you
 37 modified category
 37 date modified
 35 you don
 35 other hand
 34 that you
 32 with respect
 32 too much

The above still isn’t perfect, as it includes certain markup metadata present within every post, along with other garbage. But it does provide us with a bit more information.

We could continue to feed additional filters into the egrep statement to exclude uninformative elements (such as the too frequent word you). And we could display far more than the top 30 results. But this should suffice for demonstrative purposes.

Questions, comments? Connect.