Here’s a quick command-line procedure to compute the most frequently used single words as well as two-word combinations, respecting certain configurable filtering. This demands only the standard built-in Unix commands. And the solution is way faster than any Python analytical package I’d used.
I was initially intrigued by the language I tend to overuse in my writings. Employing the already widespread set of piped commands for the task, I extended them to enable filtering and the two-word combination support.
Single words
Extract the entire body of text we wish to feed into the procedure. For this, enter the top level directory containing your writings and cat (concatenate) all the relevant text to the standard output:
find . -type f -exec cat {} \+
Generate the list of words, replacing all non-alphanumeric characters with a newline and converting everything to lowercase. Pipe the above into the following routine.
... | tr -cs A-Za-z '\012' | tr A-Z a-z
Include only words of minimum number of characters, say, four.
... | egrep "\<.{4,}\>"
Sort by the word counts and display only the top 30 results.
... | sort | uniq -c | sort -nr | head -30
And the complete procedure, combining the above steps:
$ find . -type f -exec cat {} \+ |
tr -cs A-Za-z '\012' | tr A-Z a-z |
egrep "\<.{4,}\>" |
sort | uniq -c | sort -nr | head -30
Executing the above across all my blog posts:
1406 with
1240 that
987 this
665 more
650 time
572 your
503 from
486 have
473 some
407 language
387 much
384 what
376 such
363 even
357 other
332 which
304 also
295 only
277 would
262 these
254 into
247 than
241 most
235 find
233 title
233 over
229 many
223 content
223 category
215 status
And for at least 5-character words:
406 language
356 other
331 which
276 would
261 these
232 title
222 content
222 category
214 status
212 there
207 published
200 however
197 those
195 certain
188 could
181 having
176 rather
176 https
175 experience
165 among
164 first
152 without
148 point
145 entirely
143 years
142 about
140 problem
136 still
132 means
132 further
Pairs of words
Here’s the slightly longer, modified full procedure.
$ find . -type f -exec cat {} \+ |
tr -cs A-Za-z '\012' | tr A-Z a-z |
sed -n '1{h;d};{x;G;s/\n/ /;p}' |
egrep -v "\<.{,2}\>" |
sort | uniq -c | sort -nr | head -30
The above sed invocation is key to generate the word pairs. I will not decompose it into individual steps, since that would require explaining too much of the sed syntax.
The modified egrep, rather than including words of at least three characters, excludes any words of at most two. This exclusion is necessary since we want to eliminate any word pairs with any of the two words below the specified length.
(If we employed the inclusion method, any one of the two words passing for three characters would include the pair in the resulting list, irrespective of the other word length.)
Here’s the result executed against my writings:
273 with the
248 for the
218 and the
190 status published
156 date category
139 from the
125 you can
109 the above
98 not only
94 the same
93 the more
80 the other
79 the time
78 into the
77 and yet
70 the respective
70 over the
69 rather than
68 but the
64 for example
62 the language
62 the following
62 the first
57 you may
57 the most
57 among the
55 beyond the
53 sub sub
53 all the
49 not the
A lot of the above pairs include the article the in the results, which isn’t very useful in the hunt for overused pairs.
Let’s modify the egrep statement to also filter out pairs with the word the:
egrep -v "\<(.{,2}|the)\>"
And the outcome:
190 status published
156 date category
125 you can
98 not only
77 and yet
69 rather than
64 for example
58 you may
54 sub sub
49 need not
48 and not
46 time track
45 org wiki
44 wikipedia org
44 category lifestyle
42 you will
42 you are
42 time and
42 https www
39 not necessarily
38 you have
38 but also
37 what you
37 modified category
37 date modified
35 you don
35 other hand
34 that you
32 with respect
32 too much
The above still isn’t perfect, as it includes certain markup metadata present within every post, along with other garbage. But it does provide us with a bit more information.
We could continue to feed additional filters into the egrep statement to exclude uninformative elements (such as the too frequent word you). And we could display far more than the top 30 results. But this should suffice for demonstrative purposes.
Questions, comments? Connect.