Looking at Stop Words: Why You Shouldn’t Blindly Trust Model Defaults

class: center, middle, title-slide

# Looking at Stop Words: Why You Shouldn’t Blindly Trust Model Defaults
## SLC RUG December 2020
### Emil Hvitfeldt

---

background-image: url("fish/sea.jpg")
background-size: 100%
background-position: 0% 20%

.footnote[
Photo by <a href="https://unsplash.com/@hellocolor?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Pawel Nolbert</a> on <a href="https://unsplash.com/@emilhvitfeldt/likes?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a>
]

.blue {
 color: #006766;
}
</style>

---

background-image: url("fish/sea.jpg")
background-size: 100%
background-position: 0% 30%

---

background-image: url("fish/sea.jpg")
background-size: 100%
background-position: 0% 40%

---

background-image: url("fish/sea.jpg")
background-size: 100%
background-position: 0% 50%

---

background-image: url("fish/sea.jpg")
background-size: 100%
background-position: 0% 60%

---

background-image: url("fish/sea.jpg")
background-size: 100%
background-position: 0% 70%

---

background-image: url("fish/sea.jpg")
background-size: 100%
background-position: 0% 80%

---

background-image: url("fish/sea.jpg")
background-size: 100%
background-position: 0% 90%

---

background-image: url("fish/sea.jpg")
background-size: 100%
background-position: 0% 100%

---

background-image: linear-gradient( rgba(0, 0, 0, 0.2), rgba(0, 0, 0, 0.2) ), url("fish/sea.jpg")
background-size: 100%
background-position: 0% 100%

---

background-image: linear-gradient( rgba(0, 0, 0, 0.4), rgba(0, 0, 0, 0.4) ), url("fish/sea.jpg")
background-size: 100%
background-position: 0% 100%

---

background-image: linear-gradient( rgba(0, 0, 0, 0.6), rgba(0, 0, 0, 0.6) ), url("fish/sea.jpg")
background-size: 100%
background-position: 0% 100%

---

background-image: linear-gradient( rgba(0, 0, 0, 0.8), rgba(0, 0, 0, 0.8) ), url("fish/sea.jpg")
background-size: 100%
background-position: 0% 100%

---

.center[
# What are stop words?
]

---

background-image: url("fish/fish05.png")
background-size: 60%
background-position: 35% 90%

.center[
# What are stop words?
]

.footnote[
Photo by <a href="https://unsplash.com/@davidclode?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">David Clode</a> on <a href="https://unsplash.com/@emilhvitfeldt/likes?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a>
]

---

background-image: url("fish/fish12.png")
background-size: 40%
background-position: 0% 80%

.center[
# History
]

Hans Peter Luhn, one of the pioneers in information retrieval, is credited with coining the phrase and using the concept.

.footnote[
Photo by <a href="https://unsplash.com/@slinger?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Paco Joss</a> on <a href="https://unsplash.com/?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a>
]

---

# Definitions from the Web

> "In natural language processing, useless words (data), are referred to as stop words."

--
 
> "In computing, stop words are words that are filtered out before or after the natural language data (text) are processed."

> "Stopwords are the words in any language which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence"

---
 
.center[
<span, style = 'font-size:200px;'>🤔
]

## this gives the illusion that stop words are easy to work with and are without problems

---

background-image: url("fish/fish14.png")
background-size: 50%
background-position: 100% 20%

# This is not the case!

.footnote[
Photo by <a href="https://unsplash.com/@tangzhengtao?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">zhengtao tang</a> on <a href="https://unsplash.com/@emilhvitfeldt/likes?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a>
]

---

background-image: url("fish/fish14.png")
background-size: 50%
background-position: 100% 20%

# what is stop words really?

.pull-left[
Low information words that contribute little value to task

The information of words lives on a continuum
]

---

.pull-left[
## Word information

Each rectangle represents a word in 1 document

We will illustrate the information that word carries with color.

<span, style = 'color:#3E049CFF;'>low information words

<span, style = 'color:#FCCD25FF;'>high information words

]

.pull-right[
<img src="index_files/figure-html/unnamed-chunk-4-1.png" width="80%" style="display: block; margin: auto;" />
]

---

.pull-left[
## Word information

Uniform information

If this was true then it would hurt to remove any words

# 👎
]

.pull-right[
<img src="index_files/figure-html/unnamed-chunk-5-1.png" width="80%" style="display: block; margin: auto;" />
]

---

.pull-left[
## Word information

Random information

No way to figure out which words to remove

# 👎
]

.pull-right[
<img src="index_files/figure-html/unnamed-chunk-6-1.png" width="80%" style="display: block; margin: auto;" />
]

---

.pull-left[
## Word information

Random information

No way to figure out which words to remove

# 👎
]

.pull-right[
<img src="index_files/figure-html/unnamed-chunk-7-1.png" width="80%" style="display: block; margin: auto;" />
]

---

.pull-left[
## Word information

High variance information
(diamonds in the rough)

Few words have a lot of information

most words have no information

# 👍
]

.pull-right[
<img src="index_files/figure-html/unnamed-chunk-8-1.png" width="80%" style="display: block; margin: auto;" />
]

---

.pull-left[
## Word information

High variance information
(diamonds in the rough)

Few words have a lot of information

most words have no information

# 👍
]

.pull-right[
<img src="index_files/figure-html/unnamed-chunk-9-1.png" width="80%" style="display: block; margin: auto;" />
]

---

.pull-left[
## Word information

Low variance information

Smooth transition between low and high information words

# 👍
]

.pull-right[
<img src="index_files/figure-html/unnamed-chunk-10-1.png" width="80%" style="display: block; margin: auto;" />
]

---

.pull-left[
## Word information

Low variance information

Smooth transition between low and high information words

# 👍
]

.pull-right[
<img src="index_files/figure-html/unnamed-chunk-11-1.png" width="80%" style="display: block; margin: auto;" />
]

---

.center[
# Information distribution
]

---

.center[
# Information distribution
]

---

.center[
# Information distribution
]

---

.center[
# Information distribution
]

---

.center[
# Information distribution
]

---

background-image: url("fish/fish06.png")
background-size: 60%
background-position: 100% 60%

# We need to strike a balance between .orange[speed] and .orange[performance]

---

background-image: url("fish/fish01.png")
background-size: 40%
background-position: 100% 0%

# How can we handle this

- pre-made lists
- homemade list
- Super secret master method???

.footnote[
Photo by <a href="https://unsplash.com/@antoinepeltier?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Antoine Peltier</a> on <a href="https://unsplash.com/@emilhvitfeldt/likes?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a>
]

---

background-image: url("fish/fish03.png")
background-size: 40%
background-position: 10% 40%

.pull-right[
# Premade list

I have talked about stop words as if there is only a handful lists out there

And each list is well constructed
]

---

.center[
# This is WRONG
]

.center[
# And I'll tell you why
]

---

background-image: url("fish/fish11.png")
background-size: 35%
background-position: 100% 0%

.pull-left[

## Why would you choose a premade stop word list?

### Pro
- Fast
- easy

### Con
- General
]

.footnote[
Photo by <a href="https://unsplash.com/@gaspanik?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Masaaki Komori</a> on <a href="https://unsplash.com/@emilhvitfeldt/likes?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a>
]

---

# English stop word lists

.pull-left[
- Galago (forumstop)
- EBSCOhost
- CoreNLP (Hardcoded)
- Ranks NL (Google)
- Lucene, Solr, Elastisearch
- MySQL (InnoDB)
- Ovid (Medical information services)
]

.pull-right[
- Bow (libbow, rainbow, arrow, crossbow)
- LingPipe
- Vowpal Wabbit (doc2lda)
- Text Analytics 101
- LexisNexis®
- Okapi (gsl.cacm)
- TextFixer
- DKPro
]

---

# English stop word lists

.pull-left[
- Postgres
- CoreNLP (Acronym)
- NLTK
- Spark ML lib
- MongoDB
- Quanteda
- Ranks NL (Default)
- Snowball (Original)
]

.pull-right[
- Xapian
- 99webTools
- Reuters Web of Science™
- Function Words (Cook 1988)
- Okapi (gsl.sample)
- Snowball (Expanded)
- Galago (stopStructure)
- DataScienceDojo
]

---

# English stop word lists

.pull-left[
- CoreNLP (stopwords.txt)
- OkapiFramework
- ATIRE (NCBI Medline)
- scikit-learn
- Glasgow IR
- Function Words (Gilner, Morales 2005)
- Gensim
]

.pull-right[
- Okapi (Expanded gsl.cacm)
- spaCy
- C99 and TextTiling
- Galago (inquery)
- Indri
- Onix, Lextek
- GATE (Keyphrase Extraction)
]

---

## He got candy. He shouldn't have, but he did.

### snowball (175)

<div style="display: flex; justify-content: space-between;">
"he"
"got"
"candy"
"he"
"shouldn't"
"have"
"but"
"he"
"did"
</div>

### SMART (571)

<div style="display: flex; justify-content: space-between;">
"he"
"got"
"candy"
"he"
"shouldn't"
"have"
"but"
"he"
"did"
</div>

### NLTK (179)

---

## He got candy. He shouldn't have, but he did.

### ISO (1298)

### CoreNLP (29)

<div style="display: flex; justify-content: space-between;">
"he"
"got"
"candy"
"he"
"shouldn't"
"have"
"but"
"he"
"did"
</div>

### Galago (15)

<div style="display: flex; justify-content: space-between;">
"he"
"got"
"candy"
"he"
"shouldn't"
"have"
"but"
"he"
"did"
</div>

---

background-image: url("fish/fish02.png")
background-size: 50%
background-position: 50% 90%

## But wait, what about capitalization?

This is normally done during tokenization

The way we tokenize the text will matter a lot for certain stop word lists

---

# No conversion to lowercase

### snowball (175)

<div style="display: flex; justify-content: space-between;">
"He"
"got"
"candy"
"He"
"shouldn't"
"have"
"but"
"he"
"did"
</div>

### SMART (571)

<div style="display: flex; justify-content: space-between;">
"He"
"got"
"candy"
"He"
"shouldn't"
"have"
"but"
"he"
"did"
</div>

### NLTK (179)

---

# Space separated tokenizer

### snowball (175)

<div style="display: flex; justify-content: space-between;">
"He"
"got"
"candy."
"He"
"shouldn't"
"have,"
"but"
"he"
"did."
</div>

### SMART (571)

<div style="display: flex; justify-content: space-between;">
"He"
"got"
"candy."
"He"
"shouldn't"
"have,"
"but"
"he"
"did."
</div>

### NLTK (179)

---

# Split on non-word characters

### snowball (175)

<div style="display: flex; justify-content: space-between;">
"He"
"got"
"candy"
""
"He"
"shouldn"
"t"
"have"
""
"but"
"he"
"did"
</div>

### SMART (571)

<div style="display: flex; justify-content: space-between;">
"He"
"got"
"candy"
""
"He"
"shouldn"
"t"
"have"
""
"but"
"he"
"did"
</div>

### NLTK (179)

<div style="display: flex; justify-content: space-between;">
"He"
"got"
"candy"
""
"He"
"shouldn"
"t"
"have"
""
"but"
"he"
"did"
</div>

---

background-image: url("fish/fish07.png")
background-size: 40%
background-position: 80% 90%

.pull-left[
## Your tokenizer and stop word list should be .orange[compatible]

stemming and lemmatization also changes the tokens

Know the order in which you should remove stop words and perform stemming
]

.footnote[
Photo by <a href="https://unsplash.com/@kyawthutun?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Kyaw Tun</a> on <a href="https://unsplash.com/@emilhvitfeldt/likes?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a>
]

---

background-image: url("fish/fish08.png")
background-size: 70%
background-position: 50% 90%

.center[
# LOOK AT YOUR STOP WORD LIST
]

---

# Non-English stop word lists

- Make sure that your list works in the target language
- Direct translation of English stop word list will not be sufficiant
- Know the target language or
- Hire consultant that knows the langauge

---

# funky stop words quiz #1

.pull-left[
- he's
- she's
- himself
- herself
]

---

# funky stop words quiz #1

.pull-left[
- he's
- .orange[she's]
- himself
- herself
]

.pull-right[
.orange[she's] doesn't appear in the SMART list
]

---

# funky stop words quiz #2

.pull-left[
- owl
- bee
- fify
- system1
]

---

# funky stop words quiz #2

.pull-left[
- owl
- bee
- .orange[fify]
- system1
]

.pull-right[
.orange[fify] was left undetected for 3 years (2012 to 2015) in scikit-learn
]

---

# funky stop words quiz #3

.pull-left[
- substantially
- successfully
- sufficiently
- statistically
]

---

# funky stop words quiz #3

.pull-left[
- substantially
- successfully
- sufficiently
- .orange[statistically]
]

.pull-right[
.orange[statistically] doesn't appear in the Stopwords ISO list
]

---

background-image: url("fish/fish04.png")
background-size: 70%
background-position: 80% 90%

# How did these words appear in this list?

Generation process

grammar based or frequency based

These lists are meant to be GENERAL*

---

background-image: url("fish/fish08.png")
background-size: 70%
background-position: 50% 90%

.center[
# LOOK AT YOUR STOP WORD LIST
]

---

.center[
# Stop word lists

## General

### vs

## domain specific
]

---

### Most stop word lists are made to work with "general language"

### domain specific stop words will work differently

---

# Home-made lists

Use a count based approach to construct a stop word list

use domain knowlegde to add words you know won't be important

Here you have a change to weed out a little of the bias

Make sure to filter by min number of words seen

---

.center[
# Secret method??
]

---

background-image: url("fish/fish09.png")
background-size: 40%
background-position: 0% 90%

.pull-right[
# Combine the two

Start with a conservative list and add words you find appropriate

best of both worlds
]

---

# Default stop words

What are the default stop words in tidytext, textrecipes, quanteda

re-exports stopwords::stopwords(). Defaults to snowball

```r
quanteda::stopwords()
```

Tibble with onix, SMART and snowball

```r
tidytext::stop_words
```

Defaults to snowball

```r
textrecipes::step_stopwords()
```

---

# defaults in modeling

Every modeling software/library claims to have "good defaults"

They might be different then from what you want

And they might be different then from what you THINK it did

---

# Function arguments

we have .orange[required arguments] and .orange[additional arguments]

having more required arguments will make your software less prone to surprises from the user

but will also lead to more frustration by the user if they have to fill in arguments similarly all the time

---

# glmnet

`glmnet::glmnet()` has 26 arguments, with 2 of them being required

## Only filling in required

```r
glmnet::glmnet(x_mat, y_mat)
```

---

# glmnet

`glmnet::glmnet()` has 26 arguments, with 2 of them being required

## filling in all arguments

```r
glmnet::glmnet(
  x_mat, y_mat, family = "gaussian", weights = NULL, offset = NULL, 
  lpha = 1, nlambda = 100, lambda = NULL, standardize = TRUE, 
  intercept = TRUE, thresh = 1e-07, exclude = NULL, lower.limits = -Inf, 
  upper.limits = Inf, maxit = 1e+05, standardize.response = FALSE, 
  type.multinomial = c("ungrouped"), relax = FALSE, trace.it = 0)
```

---

## In my experience most software leans towards

.center[
<span, style = "font-size:45px;">
`length(required) << length(additional)`

]

---

# recipes defaults

What does `step_pca()` do?

PCA yes, but what does it return?

```r
step_pca
```

```
## function (recipe, ..., role = "predictor", trained = FALSE, num_comp = 5, 
## threshold = NA, options = list(), res = NULL, prefix = "PC", 
## skip = FALSE, id = rand_id("pca")) 
## {
## if (!is_tune(threshold) & !is_varying(threshold)) {
## if (!is.na(threshold) && (threshold > 1 | threshold <= 
## 0)) {
## rlang::abort("`threshold` should be on (0, 1].")
## }
## }
## add_step(recipe, step_pca_new(terms = ellipse_check(...), 
## role = role, trained = trained, num_comp = num_comp, 
## threshold = threshold, options = options, res = res, 
## prefix = prefix, skip = skip, id = id))
## }
## <bytecode: 0x7fd012d4fb60>
## <environment: namespace:recipes>
```

---

# recipes defaults

Most if not all steps will work out of the box without having to set any arguments

Does this mean that the arguments ar perfect? NO! But they are a good stepping stone and building block

Once you get some that runs then you can adjust the values

---

# parsnip defaults

```r
library(parsnip)

linear_reg() %>%
  fit(mpg ~ ., data = mtcars)
```

```
## Warning: Engine set to `lm`.
```

```
## parsnip model object
## 
## Fit time:  5ms 
## 
## Call:
## stats::lm(formula = mpg ~ ., data = data)
## 
## Coefficients:
## (Intercept)          cyl         disp           hp         drat           wt  
##    12.30337     -0.11144      0.01334     -0.02148      0.78711     -3.71530  
##        qsec           vs           am         gear         carb  
##     0.82104      0.31776      2.52023      0.65541     -0.19942
```

---

# parsnip defaults

```r
library(parsnip)

nearest_neighbor() %>%
  fit(mpg ~ ., data = mtcars)
```

```
## Error: Please set the mode in the model specification.
```

Some of the modeling functions will error if a mode is not set.

This is another balancing act

---

background-image: url("fish/fish10.png")
background-size: 40%
background-position: 100% 90%

# Naming conventions
.pull-left[
]

---

background-image: url("fish/fish10.png")
background-size: 40%
background-position: 100% 90%

# Naming conventions
.pull-left[

How do we name well established functions and arguments
]

---

background-image: url("fish/fish10.png")
background-size: 40%
background-position: 100% 90%

# Naming conventions
.pull-left[

How do we name well established functions and arguments

This seems like it should be a simple problem
]

---

background-image: url("fish/fish10.png")
background-size: 40%
background-position: 100% 90%

# Naming conventions
.pull-left[

How do we name well established functions and arguments

This seems like it should be a simple problem

but it isn't 😢
]

---

### Argument names for boosted trees in R

|parsnip        |xgboost              |C5.0         |spark                               |
|:--------------|:--------------------|:------------|:-----------------------------------|
|tree_depth     |max_depth (6)        |NA           |max_depth (5)                       |
|trees          |nrounds (15)         |trials (15)  |max_iter (20)                       |
|learn_rate     |eta (0.3)            |NA           |step_size (0.1)                     |
|mtry           |colsample_bytree (1) |NA           |feature_subset_strategy (see below) |
|min_n          |min_child_weight (1) |minCases (2) |min_instances_per_node (1)          |
|loss_reduction |gamma (0)            |NA           |min_info_gain (0)                   |
|sample_size    |subsample (1)        |sample (0)   |subsampling_rate (1)                |
|stop_iter      |early_stop           |NA           |NA                                  |

---

# Miss-match between names

<blockquote class="twitter-tweet">By default, logistic regression in scikit-learn runs w L2 regularization on and defaulting to magic number C=1.0. How many millions of ML/stats/data-mining papers have been written by authors who didn&#39;t report (&amp; honestly didn&#39;t think they were) using regularization?&mdash; Zachary Lipton (@zacharylipton) <a href="https://twitter.com/zacharylipton/status/1167298276686589953?ref_src=twsrc%5Etfw">August 30, 2019</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>