Analyzing the greatest books data

2020-01-21 @Technology

It turns out there’s a greatest book list aggregator that averages data from over 100 different lists.

It also enables both the fiction as well as nonfiction CSV databases for download.

I performed some curious analytics on the data using SQL queries. For the CSV-SQL interface, I leveraged the rows Python library, quick to setup, slow to execute. But no matter for the one-off purpose.

To install rows, I needed also install a series of dependencies not automatically detected by pip:

$ pip install rows click requests requests_cache tqdm

I also cleaned up the many trailing spaces in the data, required later for proper author grouping (in case anyone repeats the experiment).

$ sed -i 's/\s,/,/g' ...csv

I then added a header rank,title,author,year to the otherwise headerless CSVs. Here’s a small section of each of the resulting fiction and non-fiction files:

$ head tgb_1.csv

1,In Search of Lost Time,Marcel Proust,1913
2,Ulysses,James Joyce,1922
3,Don Quixote,Miguel de Cervantes,1605
4,The Great Gatsby,F. Scott Fitzgerald,1925
5,One Hundred Years of Solitude,Gabriel Garcia Marquez,1967
6,Moby Dick,Herman Melville,1851
7,War and Peace,Leo Tolstoy,1869
8,Lolita,Vladimir Nabokov,1955
9,Hamlet,William Shakespeare,1601

$ head tgb_2.csv

1,Essays,Michel de Montaigne,1580
2,Walden,Henry David Thoreau,1854
4,The Interpretation of Dreams,Sigmund Freud,1899
5,The Prince,Niccolo Machiavelli,1532
6,The Diary of a Young Girl,Anne Frank,1944
7,The Autobiography of Malcolm X,Alex Haley,1965
8,The Confessions of Jean-Jacques Rousseau,Jean-Jacques Rousseau,1782
9,Silent Spring,Rachel Carson,1962

Let’s proceed to the analysis.


To insure rows actually functions, I ran a quick warm-up query:

$ rows query "select * from table1 
where author like '%Austen%'" tgb_1.csv

| rank |         title         |    author   | year |
|   18 |   Pride and Prejudice | Jane Austen | 1813 |
|   69 |                  Emma | Jane Austen | 1815 |
|  254 |            Persuasion | Jane Austen | 1818 |
|  652 | Sense and Sensibility | Jane Austen | 1811 |

Authors with at least four works.

This offers partial rationale for why I find Latin-American used book sections dominated by works of Mario Vargas Llosa.

$ rows query "select author, count(author) as ct 
from table1 group by author having ct > 3 
order by ct desc" tgb_1.csv

|           author           |  ct |
|        William Shakespeare |  19 |
|            Charles Dickens |  13 |
|           William Faulkner |  12 |
|                    Unknown |  11 |
|                Henry James |  10 |
|                Philip Roth |   9 |
|               Iris Murdoch |   9 |
|             Samuel Beckett |   8 |
|                    Molière |   8 |
|           Ernest Hemingway |   8 |
|                  Sophocles |   7 |
|         Mario Vargas Llosa |   7 |
|                John Updike |   7 |
|                Jack London |   7 |
|            Haruki Murakami |   7 |
|            Cormac McCarthy |   7 |
|            Margaret Atwood |   6 |
|              Joseph Conrad |   6 |
|             John Steinbeck |   6 |
|               Henrik Ibsen |   6 |
|              Graham Greene |   6 |
|                 Gene Wolfe |   6 |
|               Evelyn Waugh |   6 |
|             D. H. Lawrence |   6 |
|                C. S. Lewis |   6 |
|           Anthony Trollope |   6 |
|                Alice Munro |   6 |
|          Yasunari Kawabata |   5 |
|       William S. Burroughs |   5 |
|               Willa Cather |   5 |
|           Vladimir Nabokov |   5 |
|             Virginia Woolf |   5 |
|              Toni Morrison |   5 |
|                Thomas Mann |   5 |
|               Thomas Hardy |   5 |
|                Saul Bellow |   5 |
|             Salman Rushdie |   5 |
|           Robertson Davies |   5 |
|                 Roald Dahl |   5 |
|             Raymond Carver |   5 |
|             Philip K. Dick |   5 |
|                Leo Tolstoy |   5 |
|            John Galsworthy |   5 |
|            John Dos Passos |   5 |
|           Honoré de Balzac |   5 |
|                H. G. Wells |   5 |
|         Fyodor Dostoyevsky |   5 |
|                Franz Kafka |   5 |
|        F. Scott Fitzgerald |   5 |
|            Bernard Malamud |   5 |
|              Anton Chekhov |   5 |
|                  Anonymous |   5 |
|                 Anne Tyler |   5 |
|             William Trevor |   4 |
|                Victor Hugo |   4 |
|          Ursula K. Le Guin |   4 |
|             Thomas Pynchon |   4 |
|               Stephen King |   4 |
|     Robert Louis Stevenson |   4 |
|             Rikki Ducornet |   4 |
|           Raymond Chandler |   4 |
|             Philip Pullman |   4 |
|              Norman Mailer |   4 |
|        Nathaniel Hawthorne |   4 |
|             Naguib Mahfouz |   4 |
|              Milan Kundera |   4 |
|                Martin Amis |   4 |
|              Kurt Vonnegut |   4 |
|             Kazuo Ishiguro |   4 |
| Johann Wolfgang von Goethe |   4 |
|         Jeanette Winterson |   4 |
|                Jane Austen |   4 |
|                James Joyce |   4 |
|              James Baldwin |   4 |
|              Italo Calvino |   4 |
|             Isabel Allende |   4 |
|      Isaac Bashevis Singer |   4 |
|              Hermann Hesse |   4 |
|            Herman Melville |   4 |
|                Henry Green |   4 |
|               George Eliot |   4 |
|          Flannery O'Connor |   4 |
|                  Euripides |   4 |
|              Edith Wharton |   4 |
|                Don DeLillo |   4 |
|             Bertolt Brecht |   4 |
|         Barbara Kingsolver |   4 |
|          August Strindberg |   4 |
|             Arthur Rimbaud |   4 |
|              Angela Carter |   4 |
|               Albert Camus |   4 |
|                  Aeschylus |   4 |

Authors of at least two works published before 100AD.

$ rows query "select author, count(author) as ct 
from table1 where year < 100 
group by author having ct > 1 order by ct desc" tgb_1.csv

|    author    | ct |
|    Sophocles |  7 |
|      Unknown |  4 |
|    Euripides |  4 |
|    Aeschylus |  4 |
| Aristophanes |  3 |
|        Homer |  2 |
|       Hesiod |  2 |

Same for the period between 100 and 1800AD.

|           author           | ct |
|        William Shakespeare | 19 |
|                    Molière |  7 |
|                    Unknown |  6 |
|                  Anonymous |  5 |
|                Jean Racine |  3 |
|           Pierre Corneille |  2 |
|               Matsuo Bashō |  2 |
|                John Milton |  2 |
| Johann Wolfgang von Goethe |  2 |
|             Henry Fielding |  2 |
|          Fernando de Rojas |  2 |
|               Daniel Defoe |  2 |
|             Alexander Pope |  2 |

Authors of at least four works published between 1800 and 1900AD.

|         author         | ct |
|        Charles Dickens | 12 |
|            Henry James |  6 |
|       Honoré de Balzac |  5 |
|           Henrik Ibsen |  5 |
|     Fyodor Dostoyevsky |  5 |
|       Anthony Trollope |  5 |
|            Victor Hugo |  4 |
|           Thomas Hardy |  4 |
| Robert Louis Stevenson |  4 |
|    Nathaniel Hawthorne |  4 |
|            Leo Tolstoy |  4 |
|            Jane Austen |  4 |
|           George Eliot |  4 |
|      August Strindberg |  4 |

List of all works published before 200AD.

Homer’s and Virgil’s epics tend to dominate these rankings. It would behoove me to read some of the playwrights.

$ rows query "select rank, substr(title, 0, 35) as title, 
author, year from table1 
where year < 200 order by author, rank"  tgb_1 .csv

| rank |               title                |     author    | year |
|  109 |                           Oresteia |     Aeschylus | -458 |
|  249 |                   Prometheus Bound |     Aeschylus | -415 |
|  266 |                     The Suppliants |     Aeschylus | -470 |
|  267 |                       The Persians |     Aeschylus | -472 |
|  268 |               Seven Against Thebes |     Aeschylus | -467 |
|  589 |                     Aesop's Fables |         Aesop | -560 |
|  751 | The Golden Ass (Metamorphoses): Or |      Apuleius |  180 |
|  326 |                         Lysistrata |  Aristophanes | -411 |
|  370 |                          The Birds |  Aristophanes | -414 |
|  385 |                         The Clouds |  Aristophanes | -423 |
|  221 |                              Medea |     Euripides | -431 |
|  418 |                        The Bacchae |     Euripides | -405 |
|  514 |                       Trojan Women |     Euripides | -415 |
|  537 |                         Hippolytus |     Euripides | -428 |
| 1591 |                           Alcestis |     Euripedes | -438 |
| 1995 |                            Gateway | Frederik Pohl |   84 |
|  737 |                          Fragments |    Heraclitus | -475 |
| 2021 |                       The Theogony |        Hesiod | -700 |
| 2213 |                     Works and Days |        Hesiod | -700 |
|   11 |                        The Odyssey |         Homer | -700 |
|   25 |                          The Iliad |         Homer | -700 |
|  535 |                           The Odes |        Horace |  -23 |
|  838 |       The Recognition of Sakuntala |      Kalidasa | -200 |
|  449 |                    De Rerum Natura |     Lucretius |  -55 |
| 2572 |                       The Dyskolos |     Menander  | -316 |
|  148 |                      Metamorphoses |          Ovid |    8 |
|  192 |                       The Republic |         Plato | -380 |
| 1982 |                The Poems of Sappho |        Sappho | -570 |
| 2222 |                           Thyestes |        Seneca |   62 |
|   63 |                   Oedipus the King |     Sophocles | -429 |
|   96 |                           Antigone |     Sophocles | -442 |
|  153 |                 Oedipus at Colonus |     Sophocles | -401 |
|  196 |                            Electra |     Sophocles | -409 |
|  282 |                               Ajax |     Sophocles | -450 |
|  283 |                   Women of Trachis |     Sophocles | -410 |
|  284 |                        Philoctetes |     Sophocles | -409 |
|  232 |                  Epic of Gilgamesh |       Unknown | -600 |
| 2203 |               Orpheus and Eurydice |       Unknown |    8 |
| 2243 |     The Twelve Labours of Hercules |       Unknown | -600 |
| 2255 |    The Quest for the Golden Fleece |       Unknown | -300 |
|   49 |                         The Aeneid |        Virgil |  -19 |

List of works for a handful of specific authors.

$ rows query "select rank, substr(title, 0, 35) as title, 
author, year 
from table1 where author in 
    ('Thomas Mann', 'Mario Vargas Llosa', 
    'Ursula K. Le Guin', 'Joseph Conrad') 
order by author, rank" tgb_1.csv

| rank |              title               |       author       | year |
|   26 |                Heart of Darkness |      Joseph Conrad | 1899 |
|  118 |                         Lord Jim |      Joseph Conrad | 1900 |
|  134 |                         Nostromo |      Joseph Conrad | 1904 |
|  450 |                 The Secret Agent |      Joseph Conrad | 1907 |
| 1020 |                  The Shadow Line |      Joseph Conrad | 1917 |
| 1818 |               Under Western Eyes |      Joseph Conrad | 1911 |
|  335 |  The War of the End of the World | Mario Vargas Llosa | 1981 |
|  572 |   The Feast of the Goat: A Novel | Mario Vargas Llosa | 2000 |
|  606 |             The Time of the Hero | Mario Vargas Llosa | 1963 |
|  961 |    Conversation in the Cathedral | Mario Vargas Llosa | 1969 |
| 1012 |  Aunt Julia and the Scriptwriter | Mario Vargas Llosa | 1977 |
| 1941 | The Real Life of Alejandro Mayta | Mario Vargas Llosa | 1986 |
| 1990 |                  The Storyteller | Mario Vargas Llosa | 1987 |
|   53 |               The Magic Mountain |        Thomas Mann | 1924 |
|  129 |                   Doctor Faustus |        Thomas Mann | 1947 |
|  138 |                     Buddenbrooks |        Thomas Mann | 1901 |
|  357 |                  Death in Venice |        Thomas Mann | 1912 |
| 2200 |                   The Black Swan |        Thomas Mann | 1954 |
|  611 |             A Wizard of Earthsea |  Ursula K. Le Guin | 1968 |
|  730 |        The Left Hand Of Darkness |  Ursula K. Le Guin | 1969 |
|  855 |                 The Dispossessed |  Ursula K. Le Guin | 1974 |
| 2406 |               Always Coming Home |  Ursula K. Le Guin | 1985 |


Authors of at least three works.

|        author        | ct |
|        Sigmund Freud | 19 |
|                Plato |  6 |
|        William James |  4 |
|        George Orwell |  4 |
|  Friedrich Nietzsche |  4 |
|    Winston Churchill |  3 |
|         Tracy Kidder |  3 |
|            Tom Wolfe |  3 |
|         Thomas Paine |  3 |
|         Studs Terkel |  3 |
| Samuel Eliot Morison |  3 |
|          Robert Caro |  3 |
|   Richard Hofstadter |  3 |
|           Mark Twain |  3 |
|            Karl Marx |  3 |
|          John McPhee |  3 |
|           John Dewey |  3 |
|          Joan Didion |  3 |
|  Henry David Thoreau |  3 |
|     David McCullough |  3 |
|    Cornelius Tacitus |  3 |
|     C. Vann Woodward |  3 |
|          C. S. Lewis |  3 |
|         Barack Obama |  3 |
|            Aristotle |  3 |

All authors published before 100AD.

|         author        | ct |
|                 Plato |  6 |
|             Aristotle |  3 |
|            Thucydides |  1 |
|                Sun Zi |  1 |
|             Sima Qian |  1 |
|               Sallust |  1 |
|     Pliny (the Elder) |  1 |
|   Philo of Alexandria |  1 |
|               Mencius |  1 |
| Marcus Tullius Cicero |  1 |
|                  Livy |  1 |
|               Lao Tsu |  1 |
|           Hippocrates |  1 |
|             Herodotus |  1 |
|                Euclid |  1 |
|     Cornelius Tacitus |  1 |
|             Confucius |  1 |
|                 China |  1 |
|            Archimedes |  1 |

Authors of at least two works published between 100 and 1800AD.

|         author        | ct |
|          Thomas Paine |  3 |
|               Ptolemy |  2 |
|            John Locke |  2 |
|       Johannes Kepler |  2 |
| Jean-Jacques Rousseau |  2 |
|            David Hume |  2 |
|     Cornelius Tacitus |  2 |
|         Blaise Pascal |  2 |
|     Benjamin Franklin |  2 |

All works published before 200AD.

Much content for worthwhile reading canons here.

$ rows query "select rank, substr(title, 0, 25) as title, 
author, year from table1 where year < 200 
order by author, rank" tgb_2.csv

| rank |          title           |         author        |  year |
| 1297 |  The Works of Archimedes |            Archimedes |  -212 |
|   81 |     Corpus Aristotelicum |             Aristotle |  -400 |
|  140 |                  Poetics |             Aristotle |  -335 |
|  517 |   The Nicomachean Ethics |             Aristotle |  -400 |
|  185 |                  I Ching |                 China | -1400 |
|   89 |                 Analects |             Confucius |  -350 |
|   52 |                   Annals |     Cornelius Tacitus |   120 |
|  172 |                Histories |     Cornelius Tacitus |   110 |
|  524 |                 Germania |     Cornelius Tacitus |    98 |
|  235 |        Euclid's Elements |                Euclid |  -280 |
|   34 | The Histories of Herodot |             Herodotus |  -420 |
|  154 |       Hippocratic Corpus |           Hippocrates |  -300 |
|   92 |             Tao Te Ching |               Lao Tsu |  -300 |
|  333 | Titi Livi Ab urbe condit |                  Livy |    -9 |
|   41 |              Meditations |       Marcus Aurelius |   167 |
|  918 |        Catiline Orations | Marcus Tullius Cicero |   -63 |
|  427 |                  Mencius |               Mencius |  -400 |
| 1149 | Allegorical Expositions  |   Philo of Alexandria |    50 |
|   10 | The Complete Works of Pl |                 Plato |  -347 |
|  114 |                  Apology |                 Plato |  -399 |
|  628 |                    Crito |                 Plato |  -399 |
|  629 |                Euthyphro |                 Plato |  -399 |
|  630 |                   Phaedo |                 Plato |  -399 |
| 1255 |                Symposium |                 Plato |  -370 |
| 1279 |          Natural History |     Pliny (the Elder) |    79 |
|   61 |           Parallel Lives |              Plutarch |   120 |
|  391 |                 Almagest |               Ptolemy |   150 |
|  593 |                Geography |               Ptolemy |   150 |
| 1281 | Catiline's War, The Jugu |               Sallust |   -35 |
| 1104 |   Outlines of Pyrrhonism |      Sextus Empiricus |   175 |
|  422 | Records of the Grand His |             Sima Qian |   -91 |
|  102 |     Lives of the Caesars |             Suetonius |   121 |
|   38 |           The Art of War |                Sun Zi |  -200 |
|   57 | The History of the Pelop |            Thucydides |  -450 |


Questions, comments? Connect.