Code used to develop this analysis can be found on my Github
Intro and Data Processing
Recently, I came across an editorial on Medium.com entitled “What I learned from Jeff Bezos after reading every Amazon shareholder letter”. As the title suggests, the editorial is centered around the author’s take-aways after reading the statements made by Jeff Bezos at the annual Amazon shareholders meeting from 1997 to 2016 link.
Similarly, I thought it would be interesting to detect the thematic elements of Jeff Bezos’ shareholder letters using topic modelling. In order to perform this analysis I opted to use the LDA Latent Dirchlet Allocation topic model with coherence as the primary evaluation measure.
It was a little difficult preparing the data in a format that would be appropriate for the LDA. This was because all of the letters were contained within a single file. However, I was able to use ‘To our share’ as an effective split point. I couldn’t use the full word ‘shareholders’ because in several of the documents Bezos uses the term ‘shareowners’. Nevertheless, this split allowed for each document to be tokenized, stemmed and stripped of stops independently.
By in large, I used the default tokenization provided by spaCy although I did ammend it to handle conjunctions such as “we’re”, “you’ll”, etc.
The first step I took after processing the text was constructing a frequency distribution of all the unique words in all the documents.
This frequency distribution served two functions. First, was to give me a first look at the subject of the documents also it helped to highlight stopwords that were domain specific. While, in some instances it is prudent to remove the words in the top/bottom n percent from each document before running the LDA because they can cause the topics modelled to be unusually similar, in this instance given the integral nature of customers to every facet of the company it seemed organic to allow the topics to share customers in many instances. As such from the frequency distribution it is easy to see that customers along with services and timeliness play a big role in the companys’ operations.
Topic Modelling and Analysis
In performing the LDA it is necessary to decide on the number of topics to return and in order to optimally set this parameter I used the default Coherence measure used with gensim to evaluate several prospective models each with a different number of topics to return.