Books and Time: When do authors write their masterpieces?
Published:
Authors, their books and chronological patterns
Hey everyone!
A question popped into my mind recently and I decided to use some data visualizations to try and understand it: how old are authors when they publish their best books? Intuitively, and from what stories people tell about raw talent, it would seem the answer is “very young”. For instance, Einstein made his most important discoveries in his ‘Miracle Year’ at 26, and Mary Shelley wrote “Frankenstein” at 18, but are these examples simply outliers? Or is it true that, at least in literature, the best work is written by very young authors? Let’s dive in!
Data and Tools
In this story I will be using Jupyter Notebooks, Python, Seaborn and Pandas to analyze and visualize the data. I will also be using two datasets:
- A ‘Books’ dataset, containing about 52000 records of the world’s masterpiece books, along with the date when they were first published and author name. The original data can be found here: https://zenodo.org/record/4265096#.YuyCqC-B10t.
- An ‘Authors’ dataset from the GoodReads API containing information about authors, including names and birth dates. You can find the original dataset here: https://www.kaggle.com/datasets/choobani/goodread-authors
Preparing the Data
Since the Books dataset does not contain the authors’ birth dates, we need to do an inner join of the two datasets using the author names column as a key. Here is the result using a few of the columns:
In addition, to calculate the authors’ age when their books were published, we need to convert the current values of ‘firstPublishDate’ into a YYYY format. This can be done using a string “split” method. Additionally, the “born” column also needs to be converted to a YYYY format.
Now, we can calculate the AgeWrote column and move on to the visualization parts.
Visualizing
Now, on to the visualizations! Let’s first look at a histogram of the ages when authors wrote their books:
As you can see, the median age is about 47, with a standard deviation of 13.594206849987206. Interestingly, this seems to suggest that most books are written by people in their middle age. Who are the youngest and oldest authors in the dataset?
- Youngest: Akiane Kramarik who shares her paintings and her story in her book “Akiane: her life, her art, her poetry” written in 2006 when she was 12.
- Oldest: Bertrand Russell, who published his autobiography at 96 (in 1968).
Since the data also has a geographical dimension to it, let’s check how the mean age is distributed based on different countries. Here is a bar chart of the top 20 youngest countries in terms of publishing age:
Interestingly, there doesn’t seem to be a pattern of where the countries are located. On the contrary, there is countries in Africa, Asia, the Caribbean and also my home country of Albania.
And here is a bar chart of the 20 oldest:
In this case, I notice many more European countries in the list, such as Estonia, Greece, North Macedonia, Latvia and Austria. Again, I can’t find an intuitive explanation for this, but if you have an idea, please feel free to comment below.
Another aspect of the books I am interested in is the mean age throughout the years. Have authors been writing their books later in life, or has there been an influx of younger people taking up writing? Here is a visualization:
Interestingly, it does seem that the mean age of the authors goes up as we get closer to recent years. There is a sharp increase of the mean age right before the 1920s, which might be somehow related to World War I. There is also a fluctuation in the 1930s, with the average age suddenly dropping down.
The dataset also contains a column describing the number of pages a book has. Is there some relation between a book’s length and the age of the author when publishing it? Since there are many data points, there would be significant overlap in the scatter plot, so I am instead using a KDE plot for this purpose:
From a first inspection of this plot, it does not seem like there is much of a relationship between the age in which a book was written and the number of pages it contains. In fact, it seems that most of the data points are concentrated around the 300-page range.
Conclusion
To end this story, I think there is at least some evidence that the people writing the best books are in fact in their 40s, with some differences between countries and years when the books were written. I was a bit surprised by the results, as I was expecting a younger mean age, and it would be interesting to dive in the oscillations in the mean writer age across years. Let me know what you think and thanks for reading!