2.2 - Manipulating text data from dreams

By Christian Ryan

January 14, 2020

In the previous post on ‘pulling text data from the internet’, I experimented with pulling out the dream text from a sample of dreams from the website “DreamBank” at: http://www.dreambank.net/random_sample.cgi.

In this follow-up post, I will demonstrate some of the methods presented in Julia Silge and David Robinson’s book ‘Text Mining with R’ for processing text data, as applied to 400 dreams sampled from 4 collections in the dreambank. I used the methods described in the last post to pull out a random sample of 100 dreams from each of the following 4 groups:

  • college_women (this was the sample used last time)
  • hall_female
  • hall_male
  • vietnam_vet

The first set of dreams were recorded by college women by Calvin Hall from undergraduates in a course on personality at Western Reserve University in 1947 and 1948.

The second and third samples are also dreams collected by Calvin Hall and Robert L. Van de Castle, on which they based female and male norms in their book The Content Analysis of Dreams.

The sample listed as vietnam_vet are from the dreams of an American veteran of the Vietnam war, who suffered PTSD. The website has over 400 of his dreams which he donated from records he kept not long after returning from Vietnam.

Let’s begin by loading the three packages we are likely to use.

library(tidyverse)
library(tidytext)
library(stringr)

If you want to follow along with this post, the dataset I am about to load is “dream_df.csv”, which can be found on my github page: https://github.com/Christian-Ryan/netsite/tree/master/public/post

df <- read_csv("dreams_df.csv")
df <- df[,2:3]
df$sample <- as.factor(df$sample)

After sampling the four dream sets, using the techniques described in the last post, we now have a dataframe called df with two variables - sample and dream. We will use our custom_view() function we created last time to display snippets of dreams neatly formatted. We can also use the some() function from the car package to take a quick look at a selection of dreams across the dataframe. The some() function is very like head() and tail(), but has the advantage of returning a selection across the dataset, which allows us to see examples from each of the samples simultaneously.

custom_view <- function(x) data.frame(lapply(x, substr, 1, 56))
car::some(df) %>%
  custom_view()
##           sample                                                    dream
## 1  college_women I dreamed that I was in one of my classes and we had jus
## 2  college_women I dreamed I was in N__ Y__ with my family. We were out a
## 3  college_women I dreamt that our cleanlng woman, about 50, who has come
## 4    vietnam_vet I'm on the second floor of a mail fulfillment center, wa
## 5      hall_male I was in a big building in which a lot of people lived. 
## 6    hall_female I was working in a jewelry store. A man whom I knew in h
## 7    hall_female I was skating on the outdoor ice pond that used to be ac
## 8    hall_female I dreamed that I was driving along the street and the tr
## 9    hall_female Several people were telling me of somebody's death. Evid
## 10   hall_female I was at a summer resort. As I was going down the stairs

Julia Silge and David Robinson’s book Text Mining with R - A tidy approach sets off at a cracking pace, at least for relatively newbies to R such as myself. They assumes a degree of familiarity with tidyverse concepts and when they introduce concepts such as tidytext format, they can sometimes address three or four steps in one example. I will unpack some of these as individual steps to illustrate what is going on, while using our dream data as the material for processing.

At the moment our df only contains the sample name (a categorical variable with four values) and the text of the dream. It might be helpful to index the dreams before we tokenise the text in them. So let’s introduce a new variable that we will call dream_number. This will index each dream between 1 - 400 in the dataframe.

df <-  df %>%
  mutate(dream_number = row_number())

Now we have the dream_number variable added, we can unnest the tokens (split the text variable into individual words). The syntax for the unnest_tokens() function is to pipe in the dataframe (df), then supply the name of the variable to be created (word), followed by the variable containing the text we are going to tokenise - in this case “dream”.

df_word <- df %>%
  unnest_tokens(word, dream)
head(df_word)
## # A tibble: 6 × 3
##   sample        dream_number word   
##   <fct>                <int> <chr>  
## 1 college_women            1 i      
## 2 college_women            1 dreamed
## 3 college_women            1 that   
## 4 college_women            1 i      
## 5 college_women            1 was    
## 6 college_women            1 in

See that the word variable has replaced our dream variable and now each word is on a separate row - this is the tidytext format. unnest_tokens() has kept the variables sample and dream_number - it only transforms the input variable (dream) into the output variable (word). Notice also that the function has transformed into lower-case all the words in the word variable.

Tokenisation and N-Grams

It should be noted that when we use unnest_tokens() we are using a range of default values. We could have specified something other than single words in our output. The default value of the token argument is ‘word’. We can change this to ‘ngram’ and use an ‘n=’ to specify how many words should be kept as a group. Let us try a quick run with 3-word tokens instead of single words to demonstrate this behaviour.

df_trigrams <- df %>%
  unnest_tokens(trigrams, dream, token = "ngrams", n = 3)
head(df_trigrams)
## # A tibble: 6 × 3
##   sample        dream_number trigrams      
##   <fct>                <int> <chr>         
## 1 college_women            1 i dreamed that
## 2 college_women            1 dreamed that i
## 3 college_women            1 that i was    
## 4 college_women            1 i was in      
## 5 college_women            1 was in the    
## 6 college_women            1 in the office

So here we have set our output variable to ‘trigrams’ and specified the token argument to be equal to ‘ngrams’, and we have saved this as a new dataframe called ‘df_trigrams’. That gives us a better sense of the nature of the text. We can also run a count on this after grouping by sample.

df_trigrams %>%
  group_by(sample) %>%
  count(trigrams, sort = TRUE) %>%
  ungroup()
## # A tibble: 50,935 × 3
##    sample        trigrams              n
##    <fct>         <chr>             <int>
##  1 vietnam_vet   i tell him           33
##  2 hall_female   i was in             29
##  3 college_women i was in             25
##  4 vietnam_vet   the scene changes    23
##  5 college_women and i was            22
##  6 hall_female   seemed to be         19
##  7 college_women that i was           18
##  8 hall_female   that i was           17
##  9 hall_male     seemed to be         17
## 10 hall_female   and i was            16
## # … with 50,925 more rows

Here we can see that in the Vietnam veteran dream sample, the most common three word phrase was “I tell him”, whereas for the Hall Female and College Women the most common phrase was “I was in”. Using ngrams (units larger than one word), can be useful in exploring most frequently occurring phrases. It is notable that the phrase for the Vietnam vet was in the present tense, giving a sense of the immediacy and immersion of the dream experience, whereas those most frequent phrases of the other samples are in the past tense.

Single words (Bag of words approach)

We have not removed stop-words yet as this would undermine our exploration of ngrams. But this is the next step for our df_word dataset. The anti_join() function, takes two dataframes and keeps only those words that don’t occur in both dataframes. So this forms a convenient and easy way to filter out unwanted stopwords.

df_word <- df_word %>%
  anti_join(stop_words)

Then we can count the words and sort them into descending order.

df_word %>%
  count(word, sort = TRUE)
## # A tibble: 5,331 × 2
##    word         n
##    <chr>    <int>
##  1 house      133
##  2 dream      132
##  3 remember   125
##  4 car        118
##  5 people     110
##  6 girl       108
##  7 friend     101
##  8 time        95
##  9 woman       93
## 10 mother      85
## # … with 5,321 more rows

But before we create some plots of these words, we should check for any anomalies in the word variable of df_word. The sorted count is likely to give back expected results (high frequency genuine words). But there can be other text elements that we may want to filter out. This will become obvious if we count, but don’t sort.

df_word %>%
  count(word)
## # A tibble: 5,331 × 2
##    word       n
##    <chr>  <int>
##  1 ___        1
##  2 ______     1
##  3 00         2
##  4 1          4
##  5 1,500      1
##  6 10        13
##  7 100        3
##  8 105        1
##  9 107th      1
## 10 109        1
## # … with 5,321 more rows

The word variable contains some text elements that we would not regard as words. Let’s check where the underscores came from. To do this we must go back to our original (untokenised) dataset df, as we want to see the underscores in the context of the dream. We can use the str_which() function to identify which dreams contain underscores, matched to the pattern '___'. Then we can use this as an index on the df$dream variable, so that it just returns the context of the dreams with underscores. As there are three dreams with underscores, we will store this sequence of dreams and then take a look at the first one.

underscores <- df$dream[str_which(df$dream, pattern = "___")]
underscores[1]
## [1] "I dreamed about a young married couple whom I have known for a long time. They came to see us at our home. Although the home was ours, it resembled my Uncle's home in C___ and yet the dream seemed to take place in C ___.. They drove up in a Model A Ford & parked it in the front yard. We were in the living room talking when another Model A Ford drove up & in it were my sister & a friend of mine. I went out in the front yard, got in this couple's car, and started to talk to my sister. D___ my sister, asked me if I wanted to go to a play with J. She said that she and her husband weren't going. I realized that I would have to go with him alone, so I refused. Then they drove away and the wife came out in the yard. She seemed perturbed at my getting into their car, so she got into the car and backed it away. The car then suddenly changed into an old-fashioned bicycle. It was at this time that I felt antagonistic towards this couple."

So the pattern here seems to be that underscores are used to disguise the identity of named people in the dreams. We can choose to filter these out as they are not relevant to our analysis. But before we do this filtering, let’s also consider the numbers in the word variable column - again in a bag-of-words approach one could argue that these are not words and so are irrelevant. We want to create a pattern that identifies both digits and underscores, and then use a function to transform our word variable in the df_word dataframe.

Create pattern to remove numbers and underscores

We can use the function str_subset() to identify the elements of the word variable that we wish to remove. Let’s create a pattern that deals initially with the underscores and try str_subset() with it. The ‘+’ is not strictly necessary here, but it illustrates that we can identify at least one underscore by this combination.

str_subset(df_word$word, pattern = '_+')
##  [1] "n__"    "y__"    "c___"   "___"    "d___"   "h___"   "a___"   "a___"  
##  [9] "h__"    "______"

This has found ten instances of the underscore in the word variable. Now we want to find all the digits. We could use the regex shorthand [\d] or [:digit:]. Let’s use the latter first with str_subset to check it works.

str_subset(df_word$word, pattern = '[:digit:]')
##   [1] "169"    "80"     "90"     "30"     "60"     "40"     "45"     "4"     
##   [9] "20"     "4"      "5"      "2"      "34"     "34"     "309"    "219"   
##  [17] "6"      "5.00"   "8"      "5"      "45"     "20"     "45"     "22"    
##  [25] "50"     "23"     "45"     "22"     "30"     "45"     "22"     "11"    
##  [33] "8"      "8"      "12"     "3rd"    "26"     "20"     "20"     "22"    
##  [41] "60"     "70"     "20"     "25"     "20"     "27"     "23"     "52"    
##  [49] "23"     "7"      "30"     "2nd"    "2nd"    "7"      "10"     "10"    
##  [57] "2"      "1"      "1"      "2"      "999"    "e1"     "10"     "2"     
##  [65] "4"      "2"      "2"      "2"      "25"     "20"     "5"      "22"    
##  [73] "20"     "8"      "30"     "6"      "8"      "30"     "8"      "30"    
##  [81] "50"     "4"      "35"     "4"      "00"     "40"     "20"     "5"     
##  [89] "10"     "2"      "80"     "45"     "48"     "55"     "22"     "40"    
##  [97] "1992"   "200"    "300"    "100"    "20s"    "30s"    "1950s"  "2001"  
## [105] "2012"   "10"     "12"     "1990s"  "50s"    "1972"   "1950s"  "800"   
## [113] "45"     "60s"    "1970"   "45"     "1960s"  "105"    "1st"    "109"   
## [121] "110"    "116"    "121"    "122"    "2001"   "2012"   "138"    "139"   
## [129] "152"    "m16"    "m60"    "59"     "2001"   "2012"   "39"     "244"   
## [137] "1200"   "207"    "208"    "209"    "211"    "214"    "215"    "216"   
## [145] "800"    "411"    "42nd"   "217"    "218"    "219"    "2am"    "123"   
## [153] "220"    "1950s"  "2"      "20"     "20"     "19"     "20"     "22"    
## [161] "8"      "27"     "3"      "1,500"  "50"     "17"     "26"     "30"    
## [169] "10"     "70"     "6"      "3"      "4"      "30"     "33"     "45"    
## [177] "4"      "12"     "12"     "160"    "10"     "11"     "85"     "22"    
## [185] "11"     "10"     "50"     "300"    "30"     "10"     "20"     "440"   
## [193] "880"    "10"     "20"     "30"     "3000"   "3"      "3"      "3"     
## [201] "11"     "12"     "12"     "2"      "13"     "26"     "8"      "30"    
## [209] "11"     "30"     "19"     "7"      "30"     "8"      "30"     "28"    
## [217] "50"     "30"     "18"     "18"     "15"     "20"     "21"     "20"    
## [225] "6"      "30"     "19"     "16"     "2"      "23"     "25"     "35"    
## [233] "40"     "25"     "5"      "3"      "2"      "23"     "50"     "3"     
## [241] "3"      "1"      "2"      "2"      "23"     "40"     "35"     "8"     
## [249] "8"      "2"      "107th"  "16"     "27"     "10"     "8"      "60"    
## [257] "21"     "20"     "50"     "2"      "11"     "15"     "11"     "17"    
## [265] "17"     "2"      "34"     "45"     "49"     "52"     "55"     "3"     
## [273] "5"      "20"     "26"     "75.00"  "2"      "6"      "27"     "3"     
## [281] "4"      "00"     "1st"    "2nd"    "3rd"    "10"     "20"     "10"    
## [289] "30"     "50"     "50"     "11,000" "1"      "48th"   "4"      "6"     
## [297] "25th"   "100"    "100"

This works very nicely as well. However, to use these patterns with the tidyverse pipe, it is easier to use the fitler() function rather than str_subset(), and since it is convenient to chain steps in the pipe, we can use two calls to filter(), first by underscores and secondly by digits. And as we don’t want either of these in our dataset, we will set the “negate” argument to TRUE in both cases. An alternative method to delete the digits would be to use the capital “D” in the regex, but this way keeps our filters more uniform, both with a “negate = TRUE” argument.

df_word %>%
  filter(str_detect(word, pattern = "_", negate = TRUE)) %>%
  filter(str_detect(word, pattern = '[\\d]', negate = TRUE))
## # A tibble: 17,953 × 3
##    sample        dream_number word      
##    <fct>                <int> <chr>     
##  1 college_women            1 dreamed   
##  2 college_women            1 office    
##  3 college_women            1 directress
##  4 college_women            1 nurses    
##  5 college_women            1 nursing   
##  6 college_women            1 school    
##  7 college_women            1 forty     
##  8 college_women            1 told      
##  9 college_women            1 results   
## 10 college_women            1 i.q       
## # … with 17,943 more rows

Plot word frequencies

Now we have done some tidying on the dataset, we can plot the word frequencies - a simple way is to pass them through a filter so we only retain those words with a frequency greater than say n = 60. Notice we use mutate to create the new variables for the plot word (in the order of frequency) and n. We then filter by frequency, and pass the two new variables to the ggplot function. We also have to switch syntax at this point from the pipe ( %>% ) to the + sign between the layers of the ggplot() function. We flip the coordinates, as it allows us to keep the words in the horizontal aspect and makes it the plot easier to read.

df_word %>%
  count(word, sort = TRUE) %>%
  mutate(word = reorder(word, n)) %>%
  filter(n > 60) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col()+
  coord_flip()

This gives us an overview of the most commonly used words in dreams recalled by all four samples. But it would be more interesting to see how the word use differs between the samples. However, we should be prepared for the possibility that the length of dreams may vary between samples. To control for this, we might want to convert our raw counts of words to proportions from the dream text. Let’s check for the variety of dream lengths by using str_count() function on our original dataset df - hence before we removed our stopwords. We will count the words in each dream and store the result in a vector called dream_lengths. The default for str_count is for the function to count characters if no pattern is given to match. However, if we pass it a second argument, specifying the regex for all sequences of non-space characters, it will count words instead. The regex includes the code for any non-white space character ‘\S’, with the addition of ‘+’ sign to indicate one or more non-white space characters, and the initial escape character ‘\’ as ‘\S’ is not recognised as an escape character without it.

dream_lengths <- str_count(df$dream, "\\S+")
plot(dream_lengths, xlab = "Dream Number")

This is a good example of the use of the plot() function with a single vector in R. The default behaviour is to plot the values of the vector against the y-axis - dream_lengths in this case, and then use the index number (ie. the order in which each value occurs in the vector) as the x value. So our x-axis simple represents the order of the dreams, or as we have named this, the dream number. We can see here the range of dream lengths with the minimum being about 35 words and the maximum around 290 words. We could take the min, max, mean and SD if we wanted to be more specific.

min(dream_lengths); max(dream_lengths); mean(dream_lengths); sd(dream_lengths)
## [1] 38
## [1] 288
## [1] 141.0325
## [1] 45.09413

There is a great deal of variability in the dream lengths, so proportions will be better than raw counts to represent the frequency of each word.

Calculating word frequencies as proportions

We will want to count proportions after stopwords are removed. We have a choice here whether we want to express the frequency of individual words by proportion of a dream or proportion of a sample. These would have different interpretations. If the texts (in our case dreams) were much longer, proportion by text might be the better way to represent the data, but I suspect proportion by dream may not be very informative. Let’s try it and see what the results look like. We will group_by() dream_number so as to create proportion by dream. Then we use a summarise function to create a word count, and we use mutate to convert this to percentage. I used a second mutate to clean this up into two decimal places with the round() function. Finally, we use the tidyverse equivalent of sort() which is the arrange() function - but because we want this to be largest-to-smallest, we also include the desc() descending function.

df_word %>%
  group_by(dream_number, word) %>%
  summarise(n = n()) %>%
  mutate(percent = (n / sum(n))*100) %>%
  mutate(percent = round(percent, 2)) %>%
  arrange(desc(percent)) %>%
  ungroup
## `summarise()` has grouped output by 'dream_number'. You can override using the `.groups` argument.
## # A tibble: 15,500 × 4
##    dream_number word         n percent
##           <int> <chr>    <int>   <dbl>
##  1            6 remember     3    21.4
##  2          358 office       5    20  
##  3          355 dog          7    19.4
##  4          381 bus          6    18.2
##  5            2 hair         3    16.7
##  6           20 bed          3    16.7
##  7           83 store        5    16.7
##  8          260 car          6    16.7
##  9          399 test         6    15.8
## 10           13 dream        2    15.4
## # … with 15,490 more rows

So in dream number 6 the word ‘remember’ accounted for 21% of the non-stopwords used. That seems like a high proportion. It might be more useful to look at the data aggregated across samples. We can change the code to group_by sample instead of dream_number, then recalculate the most frequently occurring words as a proportion of words by sample.

df_word %>%
  group_by(sample, word) %>%
  summarise(n = n()) %>%
  mutate(percent = (n / sum(n))*100) %>%
  mutate(percent = round(percent, 2)) %>%
  arrange(desc(percent)) %>%
  ungroup()
## `summarise()` has grouped output by 'sample'. You can override using the `.groups` argument.
## # A tibble: 8,118 × 4
##    sample        word         n percent
##    <fct>         <chr>    <int>   <dbl>
##  1 college_women remember    55    1.63
##  2 hall_male     dream       57    1.31
##  3 hall_male     car         51    1.17
##  4 hall_male     house       49    1.13
##  5 vietnam_vet   woman       64    1   
##  6 hall_female   remember    40    0.96
##  7 college_women car         32    0.95
##  8 college_women dream       32    0.95
##  9 hall_female   dream       38    0.92
## 10 hall_female   house       36    0.87
## # … with 8,108 more rows

We can see that for the college women, the word ‘remember’ features the most frequently across the whole sample of 100 dreams and makes up roughly 1.6% of the non-stopwords in the dreams recorded.

Conclusion

We have explored how to tokenise texts, do some basic text cleaning and creating counts and proportions and finally graphed the simple word counts. In the next post in this series, I will explore the dream data using a clever technique from Julia Silge and David Robinson’s book that involves the spread() and gather() functions.

Posted on:
January 14, 2020
Length:
18 minute read, 3691 words
See Also: