-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy path09-intro_to_randomization_2-web.Rmd
663 lines (405 loc) · 30.6 KB
/
09-intro_to_randomization_2-web.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
# Introduction to randomization, Part 2 {#randomization2}
<!-- Please don't mess with the next few lines! -->
<style>h5{font-size:2em;color:#0000FF}h6{font-size:1.5em;color:#0000FF}div.answer{margin-left:5%;border:1px solid #0000FF;border-left-width:10px;padding:25px} div.summary{background-color:rgba(30,144,255,0.1);border:3px double #0000FF;padding:25px}</style>`r options(scipen=999)`<p style="color:#ffffff">`r intToUtf8(c(50,46,48))`</p>
<!-- Please don't mess with the previous few lines! -->
::: {.summary}
### Functions introduced in this chapter {-}
`sample`, `specify`, `hypothesize`, `generate`, `calculate`, `visualize`, `shade_p_value`, `get_p_value`
:::
## Introduction {#randomization2-intro}
In this chapter, we'll learn more about randomization and simulation. Instead of flipping coins, though, we'll randomly shuffle data around in order to explore the effects of randomizing a predictor variable.
### Install new packages {#randomization2-install}
If you are using RStudio Workbench, you do not need to install any packages. (Any packages you need should already be installed by the server administrators.)
If you are using R and RStudio on your own machine instead of accessing RStudio Workbench through a browser, you'll need to type the following commands at the Console:
```
install.packages("openintro")
install.packages("infer")
```
### Download the R notebook file {#randomization2-download}
Check the upper-right corner in RStudio to make sure you're in your `intro_stats` project. Then click on the following link to download this chapter as an R notebook file (`.Rmd`).
<a href = "https://vectorposse.github.io/intro_stats/chapter_downloads/09-intro_to_randomization_2.Rmd" download>https://vectorposse.github.io/intro_stats/chapter_downloads/09-intro_to_randomization_2.Rmd</a>
Once the file is downloaded, move it to your project folder in RStudio and open it there.
### Restart R and run all chunks {#randomization2-restart}
In RStudio, select "Restart R and Run All Chunks" from the "Run" menu.
## Load packages {#randomization2-load}
We'll load `tidyverse` as usual along with the `janitor` package to make tables (with `tabyl`). The `openintro` package has a data set called `sex_discrimination` that we will explore. Finally, the `infer` package will provide tools that we will use in nearly every chapter for the remainder of the book.
```{r}
library(tidyverse)
library(janitor)
library(openintro)
library(infer)
```
## Our research question {#randomization2-question}
An interesting study was conducted in the 1970s that investigated gender discrimination in hiring.^[Rosen B and Jerdee T. 1974. Influence of sex role stereotypes on personnel decisions. *Journal of Applied Psychology* 59(1):9-14.] The researchers brought in 48 male bank supervisors and asked them to evaluate personnel files. Based on their review, they were to determine if the person was qualified for promotion to branch manager. The trick is that all the files were identical, but half listed the candidate as male and half listed the candidate as female. The files were randomly assigned to the 48 supervisors.
The research question is whether the files supposedly belonging to males were recommended for promotion more than the files supposedly belonging to females.
##### Exercise 1 {-}
Is the study described above an observational study or an experiment? How do you know?
::: {.answer}
Please write up your answer here.
:::
##### Exercise 2(a) {-}
Identify the sample in the study. In other words, how many people were in the sample and what are the important characteristics common to those people.
::: {.answer}
Please write up your answer here.
:::
##### Exercise 2(b) {-}
Identify the population of interest in the study. In other words, who is the sample supposed to represent? That is, what group of people that this study is trying to learn about?
::: {.answer}
Please write up your answer here.
:::
##### Exercise 2(c) {-}
In your opinion, does the sample from this study truly represent the population you identified above?
::: {.answer}
Please write up your answer here.
:::
## Exploratory data analysis {#randomization2-eda}
Here is the data:
```{r}
sex_discrimination
```
```{r}
glimpse(sex_discrimination)
```
##### Exercise 3 {-}
Which variable is the response variable and which variable is the predictor variable?
::: {.answer}
Please write up your answer here.
:::
*****
Here is a contingency table with `decision` as the row variable and `sex` as the column variable. (Recall that we always list the response variable first. That way, the column sums will show us how many are in each of the predictor groups.)
```{r}
tabyl(sex_discrimination, decision, sex) %>%
adorn_totals()
```
##### Exercise 4 {-}
Create another contingency table of `decision` and `sex`, this time with percentages (*not* proportions) instead of counts. You'll probably have to go back to the "Categorical data" to review the syntax. (Hint: you should have three separate `adorn` functions on the lines following the `tabyl` command.)
::: {.answer}
```{r}
# Add code here to create a contingency table of percentages
```
:::
*****
Although we can read off the percentages in the contingency table, we need to do computations using the proportions. (Remember that we use percentages to communicate with other human beings, but we do math with proportions.) Fortunately, the output of `tabyl` is a tibble! So we can manipulate and grab the elements we need.
Let's create and store the `tabyl` output with proportions. We don't need the marginal distribution, so we can dispense with `adorn_totals`.
```{r}
decision_sex_tabyl <- tabyl(sex_discrimination, decision, sex) %>%
adorn_percentages("col")
decision_sex_tabyl
```
##### Exercise 5 {-}
Interpret these proportions in the context of the data. In other words, what do these proportions say about the male files that were recommended for promotion versus the female files recommended for promotion?
::: {.answer}
Please write up your answer here.
:::
*****
The real statistic of interest to us is the difference between these proportions. We can use the `mutate` command from `dplyr` variable compute the difference for us.
```{r}
decision_sex_tabyl %>%
mutate(diff = male - female)
```
As a matter of fact, once we know the difference in promotion rates, we don't really need the individual proportions anymore. The `transmute` verb is a version of `mutate` that gives us exactly what we want. It will create a new column just like `mutate`, but then it keeps only that new column. We'll call the resulting output `decision_sex_diff`.
```{r}
decision_sex_diff <- decision_sex_tabyl %>%
transmute(diff = male - female)
decision_sex_diff
```
Notice the order of subtraction: we're doing the men's rates minus the women's rates.
This computes both the difference in promotion rates (in the first row) and the difference in not-promoted rates (in the second row). Let's just keep the first row, since we care more about promotion rates. (That's our success category.) We can use `slice` to grab the first row:
```{r}
decision_sex_diff %>%
slice(1)
```
This means that there is a 29% difference between the male files that were promoted and the female files that were promoted. The difference was computed as males minus females, so the fact that the number is positive means that male files were *more* likely to recommended for promotion.
## Permuting {#randomization2-permuting}
One way to see if there is evidence of an association between promotion decisions and sex is to assume, temporarily, that there is no association. If there were truly no association, then the difference between the promotion rates between the male files and female files should be 0%. Of course, the number of people promoted in the data was 35, an odd number, so the number of male files promoted and female files promoted cannot be the same. Therefore, the difference in proportions can't be exactly 0 in this data. Nevertheless, we would expect---under the assumption of no association---the number of male files promoted to be *close* to the number of female files promoted, giving a difference around 0%.
Now, we saw a difference of about 29% between the two groups in the data. Then again, non-zero differences---sometimes even large ones--- can just come about by pure chance alone. We may have accidentally sampled more bank managers who just happened to prefer the male candidates. This could happen for sexist reasons; it's possible our sample of bank managers are, by chance, more sexist than bank managers in the general population during the 1970s. Or it might be for more benign reasons; perhaps the male applications got randomly steered to bank managers who were more likely to be impressed with any application, and therefore, they were more likely to promote anyone regardless of the gender listed. We have to consider the possibility that our observed difference seems large even though there may have been no association between promotion and sex in the general population.
So how do we test the range of values that could arise from just chance alone? In other words, how do we explore sampling variability?
One way to force the variables to be independent is to "permute"---in other words, shuffle---the values of `sex` in our data. If we ignore the sex listed in the file and give it a random label (independent of the *actual* sex listed in the file), we know for sure that such an assignment is random and not due to any actual evidence of sexism. In that case, promotion is equally likely to occur in both groups.
Let's see how permuting works in R. To begin with, look at the actual values of `sex` in our data:
```{r}
sex_discrimination$sex
```
All the males happen to be listed first, followed by all the females.
Now we permute all the values around (using the `sample` command). As explained in an earlier chapter, we will set the seed so that our results are reproducible.
```{r}
set.seed(3141593)
sample(sex_discrimination$sex)
```
Do it again without the seed, just to make sure it's truly random:
```{r}
sample(sex_discrimination$sex)
```
## Randomization {#randomization2-randomization}
The idea here is to keep the promotion status the same for each file, but randomly permute the sex labels. There will still be the same number of male and female files, but now they will be randomly matched with promoted files and not promoted files. Since this new grouping into "males" and "females" is completely random and arbitrary, we expect the likelihood of promotion to be equal for both groups.
A more precise way of saying this is that the expected difference under the assumption of independent variables is 0%. If there were truly no association, then the percentage of people promoted would be independent of sex. However, sampling variability means that we are not likely to see an exact difference of 0%. (Also, as we mentioned earlier, the odd number of promotions means the difference will never be exactly 0% anyway in this data.) The real question, then, is how different could the difference be from 0% and still be reasonably possible due to random chance.
Let's perform a few random simulations. We'll walk through the steps one line at a time. The first thing we do is permute the `sex` column:
```{r}
set.seed(3141593)
sex_discrimination %>%
mutate(sex = sample(sex))
```
Then we follow the steps from earlier, generating a contingency table with proportions. This is accomplished by simply adding two lines of code to the previous code:
```{r}
set.seed(3141593)
sex_discrimination %>%
mutate(sex = sample(sex)) %>%
tabyl(decision, sex) %>%
adorn_percentages("col")
```
Note that the proportions in this table are different from the ones in the real data.
Then we calculate the difference between the male and female columns by adding a line with `transmute`:
```{r}
set.seed(3141593)
sex_discrimination %>%
mutate(sex = sample(sex)) %>%
tabyl(decision, sex) %>%
adorn_percentages("col") %>%
transmute(diff = male - female)
```
In this case, the first row happens to be negative, but that's okay. This particular random shuffling had more females promoted than males. (Remember, though, that the permuted sex labels are now meaningless.)
Finally, we grab the entry in the first row with `slice`:
```{r}
set.seed(3141593)
sex_discrimination %>%
mutate(sex = sample(sex)) %>%
tabyl(decision, sex) %>%
adorn_percentages("col") %>%
transmute(diff = male - female) %>%
slice(1)
```
We'll repeat this code a few more times, but without the seed, to get new random observations.
```{r}
sex_discrimination %>%
mutate(sex = sample(sex)) %>%
tabyl(decision, sex) %>%
adorn_percentages("col") %>%
transmute(diff = male - female) %>%
slice(1)
```
```{r}
sex_discrimination %>%
mutate(sex = sample(sex)) %>%
tabyl(decision, sex) %>%
adorn_percentages("col") %>%
transmute(diff = male - female) %>%
slice(1)
```
```{r}
sex_discrimination %>%
mutate(sex = sample(sex)) %>%
tabyl(decision, sex) %>%
adorn_percentages("col") %>%
transmute(diff = male - female) %>%
slice(1)
```
```{r}
sex_discrimination %>%
mutate(sex = sample(sex)) %>%
tabyl(decision, sex) %>%
adorn_percentages("col") %>%
transmute(diff = male - female) %>%
slice(1)
```
Think carefully about what these random numbers mean. Each time we randomize, we get a simulated difference in the proportion of promotions between male files and female files. The `sample` part ensures that there is no actual relationship between promotion and sex among these randomized values. We expect each simulated difference to be close to zero, but we also expect deviations from zero due to randomness and chance.
## The `infer` package {#randomization2-infer}
The above code examples show the nuts and bolts of permuting data around to break any association that might exist between two variables. However, to do a proper randomization, we need to repeat this process many, many times (just like how we flipped thousands of "coins" in the last chapter).
Here we introduce some code from the `infer` package that will help us automate this procedure. The added benefit of introducing `infer` now is that we will continue to use it in nearly every chapter of the book that follows.
Here is the code template, starting with setting the seed:
```{r}
set.seed(3141593)
sims <- sex_discrimination %>%
specify(decision ~ sex, success = "promoted") %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in props", order = c("male", "female"))
sims
```
We will learn more about all these lines of code in future chapters. By the end of the course, running this type of analysis will be second nature. For now, you can copy and paste the code chunk above and make minor changes as you need. Here are the three things you will need to look out for for doing this with different data sets in the future:
1. The second line (after setting the seed) will be your new data set.
2. In the `specify` line, you will have a different response variable, predictor variable, and success condition that will depend on the context of your new data.
3. In the `calculate` line, you will have two different levels that you want to compare. Be careful to list them in the order in which you want to subtract them.
## Plot results {#randomization2-plot}
A histogram will show us the range of possible values under the assumption of independence of the two variables. We can get one from our `infer` output using `visualize`. (This is a lot easier than building a histogram with `ggplot`!)
```{r}
sims %>%
visualize()
```
The bins aren't great in the picture above. There is no way currently to set the binwidth or boundary as we've done before, but we can experiment with the total number of bins. 9 seems to be a good number.
```{r}
sims %>%
visualize(bins = 9)
```
##### Exercise 6 {-}
Why is the mode of the graph above at 0? This has been explained several different times in this chapter, but put it into your own words to make sure you understand the logic behind the randomization.
::: {.answer}
Please write up your answer here.
:::
*****
Let's compare these simulated values to the observed difference in the real data. We've computed the latter already, but let's use `infer` tools to find it. We'll give the answer a name, `obs_diff`.
```{r}
obs_diff <- sex_discrimination %>%
observe(decision ~ sex, success = "promoted",
stat = "diff in props", order = c("male", "female"))
obs_diff
```
Now we can graph the observed difference in the data alongside the simulated values under the assumption of independent variables. The name of the function `shade_p_value` is a little cryptic for now, but it will become clear within a few chapters.
```{r}
sims %>%
visualize(bins = 9) +
shade_p_value(obs_stat = obs_diff, direction = "greater")
```
## By chance? {#randomization2-chance}
How likely is it that the observed difference (or a difference even more extreme) could have resulted from chance alone? Because `sims` contains simulated results after permuting, the values in the `stat` column assume that promotion is independent of sex. In order to assess how plausible our observed difference is under that assumption, we want to find out how many of the simulated values are at least as big, if not bigger, than the observed difference, 0.292.
Look at the randomized differences sorted in decreasing order:
```{r}
sims %>%
arrange(desc(stat))
```
Of the 1000 simulations, the most extreme difference of 37.5% occurred four times, just by chance. That seems like a pretty extreme value when expecting a value of 0%, but the laws of probability tell us that extreme values will be observed from time to time, even if rarely. Also recall that the observed difference in the actual data was 29.2%. This specific value came up quite a bit in our simulated data. In fact, the 31st entry of the sorted data above is the last occurrence of the value 0.292. After that, the next higher larger value is 0.208.
So let's return to the original question. How many simulated values are as large---if not larger---than the observed difference? Apparently, 31 out of 1000, which is 0.031. In other words 3% of the simulated data is as extreme or more extreme than the actual difference in promotion rates between male files and female files in the real data. That's not very large. In other words, a difference like 29.2% could occur just by chance---like flipping 10 out of 10 heads or something like that. But it doesn't happen very often.
We can automate this calculation using the function `get_p_value` (similar to `shade_p_value` above) even though we don't yet know what "p value" means.
```{r}
sims %>%
get_p_value(obs_stat = obs_diff, direction = "greater")
```
**COPY/PASTE WARNING**: If the observed difference were negative, then extreme values of interest would be *less* than, say, -0.292, not greater than 0.292. You must note if the observed difference is positive or negative and then use "greater" or "less" as appropriate!
Again, 0.031 is a small number. This shows us that if there were truly no association between promotion and sex, then our data is a rare event. (An observed difference this extreme or more extreme would only occur about 3% of the time by chance.)
Because the probability above is so small, it seems unlikely that our variables are independent. Therefore, it seems more likely that there is an association between promotion and sex. We have evidence of a statistically significant difference between the chance of getting recommended for promotion if the file indicates male versus female.
Because this is an experiment, it's possible that a causal claim could be made. If everything in the application files was identical except the indication of gender, then it stands to reason that gender *explains* why more male files were promoted over female files. But all that depends on the experiment being a well-designed experiment.
##### Exercise 7 {-}
Although we are not experts in experimental design, what concerns do you have about generalizing the results of this experiment to broad conclusions about sexism in the 1970s?
(To be clear, I'm not saying that sexism wasn't a broad problem in the 1970s. It surely was---and still is. I'm only asking you to opine as to why the results of this one study might not be conclusive in making an overly broad statement.)
::: {.answer}
Please write up your answer here.
:::
## Your turn {#randomization2-your-turn}
In this section, you'll explore another famous data set related to the topic of gender discrimination. (Also from the 1970s!)
The following code will download admissions data from the six largest graduate departments at the University of California, Berkeley in 1973. We've seen the `read_csv` command before, but we've added some extra stuff in there to make sure all the columns get imported as factor variables (rather than having to convert them ourselves later).
```{r}
ucb_admit <- read_csv("https://vectorposse.github.io/intro_stats/data/ucb_admit.csv",
col_types = list(
Admit = col_factor(),
Gender = col_factor(),
Dept = col_factor()))
```
```{r}
ucb_admit
```
```{r}
glimpse(ucb_admit)
```
As you go through the exercises below, you should carefully copy and paste commands from earlier in the chapter, making the necessary changes.
**Remember that R is case sensitive! In the `sex_discrimination` data, all the variables and levels started with lowercase letters. In the `ucb_admit` data, they all start with uppercase letters, so you'll need to be careful to change that after you copy and paste code examples from above.**
##### Exercise 8(a) {-}
Is this data observational or experimental? How do you know?
::: {.answer}
Please write up your answer here.
:::
##### Exercise 8(b) {-}
Exploratory data analysis: make two contingency tables with `Admit` as the response variable and `Gender` as the explanatory variable. One table should have counts and the other table should have percentages. (Both tables should include the marginal distribution at the bottom.)
::: {.answer}
```{r}
# Add code here to make a contingency table with counts.
```
```{r}
# Add code here to make a contingency table with percentages.
```
:::
##### Exercise 8(c) {-}
Use `observe` from the `infer` package to calculate the observed difference in proportions between males who were admitted and females who were admitted. Do the subtraction in that order: males minus females. Store your output as `obs_diff2` so that it doesn't overwrite the variable `obs_diff` we created earlier.
::: {.answer}
```{r}
# Add code here to calculate the observed difference.
# Store this as obs_diff2.
```
:::
##### Exercise 8(d) {-}
Simulate 1000 outcomes under the assumption that admission is independent of gender. Use the `specify`, `hypothesize`, `generate`, and `calculate` sequence from the `infer` package as above. Call the simulated data frame `sims2` so that it doesn't conflict with the earlier `sims`. Don't touch the `set.seed` command. That will ensure that all students get the same randomization.
::: {.answer}
```{r}
set.seed(10101)
# Add code here to simulate 1000 outcomes
# under the independence assumption
# and store the simulations in a data frame called sims2.
```
:::
##### Exercise 8(e) {-}
Plot the simulated values in a histogram using the `visualize` verb from `infer`. When you first run the code, remove the `bins = 9` we had earlier and let `visualize` choose the number of bins. If you are satisfied with the graph, you don't need to specify a number of bins. If you are not satisfied, you can experiment with the number of bins until you find a number that seems reasonable.
Be sure to include a vertical line at the value of the observed difference using the `shade_p_value` command. Don't forget that the location of that line is `obs_diff2` now.
::: {.answer}
```{r}
# Add code here to plot the results.
```
:::
##### Exercise 8(f) {-}
Finally, comment on what you see. Based on the histogram above, is the observed difference in the data rare? In other words, under the assumption that admission and gender are independent, are we likely to see an observed difference as far away from zero as we actually see in the data? So what is your conclusion then? Do you believe there was an association between admission and gender in the UC Berkeley admissions process in 1973?
::: {.answer}
Please write up your answer here.
:::
## Simpson's paradox {#randomization2-simpson}
The example above from UC Berkeley seems like an open and shut case. Male applicants were clearly admitted at a greater rate than female applicants. While we never expect the application rates to be *exactly* equal---even under the assumption that admission and gender are independent---the randomization exercise showed us that the observed data was *way* outside the range of possible differences that could have occurred just by chance.
But we also know this is observational data. Association is not causation.
##### Exercise 9 {-}
Note that we didn't say "correlation is not causation". The latter is also true, but why does it not apply in this case? (Think about the conditions for correlation.)
::: {.answer}
Please write up your answer here.
:::
*****
Since we don't have data from a carefully controlled experiment, we always have to be worried about lurking variables. Could there be a third variable apart from admission and gender that could be driving the association between them? In other words, the fact that males were admitted at a higher rate than females might be sexism, or it might be spurious.
Since we have access to a third variable, `Dept`, let's analyze it as well. The `tabyl` command will happily take a third variable and create a *set* of contingency tables, one for each department.
Here are the tables with counts:
```{r}
tabyl(ucb_admit, Admit, Gender, Dept) %>%
adorn_totals()
```
And here are the tables with percentages:
```{r}
tabyl(ucb_admit, Admit, Gender, Dept) %>%
adorn_totals() %>%
adorn_percentages("col") %>%
adorn_pct_formatting()
```
##### Exercise 10 {-}
Look at the contingency tables with percentages. Examine each department individually. What do you notice about the admit rates (as percentages) between males and females for most of the departments listed? Identify the four departments where female admission rates were higher than male admission rates.
::: {.answer}
Please write up your answer here.
:::
*****
This is completely counterintuitive. How can males be admitted at a higher rate overall, and yet in most departments, females were admitted at a higher rate.
This phenomenon is often called *Simpson's Paradox*. Like almost everything in statistics, this is named after a person (Edward H. Simpson) who got the popular credit for writing about the phenomenon, but not being the person who actually discovered the phenomenon. (There does not appear to be a primeval reference for the first person to have studied it. Similar observations had appeared in various sources more than 50 years before Simpson wrote his paper.)
##### Exercise 11 {-}
Look at the contingency tables with counts. Focus on the four departments you identified above. What is true of the total number of male and female applicants for those four department (and not for the other two departments)?
::: {.answer}
Please write up your answer here.
:::
##### Exercise 12(a) {-}
Now create a contingency table with percentages that uses `Admit` for the row variable and `Dept` as the column variable.
::: {.answer}
```{r}
# Add code here to create a contingency table with percentages
# for Dept and Admit
```
:::
##### Exercise 12(b) {-}
According to the contingency table above, which two departments were (by far) the least selective? (In other words, which two departments admitted a vast majority of their applicants?)
::: {.answer}
Please write up your answer here.
:::
##### Exercise 12(c) {-}
Earlier, you identified four departments where male applicants outnumbered female applicants. (These were the same departments that had higher admission rates for females.) But for which two departments was the difference between the number of male and female applicants the largest?
::: {.answer}
Please write up your answer here.
:::
*****
Your work in the previous exercises begins to paint a picture that explains what's going on with this "paradox". Males applied in much greater numbers to a few departments with high acceptance rates. As a result, more male students overall got in to graduate school. Females applied in greater numbers to departments that were more selective. Overall, then, fewer females got in to graduate school. But on a department-by-department basis, female applicants were usually more likely to get accepted.
None of this suggests that sexism fails to exist. It doesn't even prove that sexism wasn't a factor in some departmental admission procedures. What it does suggest is that when we don't take into account possible lurking variables, we run the risk of oversimplifying issues that are potentially complex.
In our analysis of the UC Berkeley data, we've exhausted all the variables available to us in the data set. There remains the potential for *unmeasured confounders*, or variables that could still act as lurking variables, but we have no idea about them because they aren't in our data. This is an unavoidable peril of working with observational data. If we aren't careful to "control" for a reasonable set of possible lurking variables, we must be very careful when trying to make broad conclusions.
## Conclusion {#randomization2-conclusion}
Here we used randomization to explore the idea of two variables being independent or associated. When we assume they are independent, we can explore the sampling variability of the differences that could occur by pure chance alone. We expect the difference to be zero, but we know that randomness will cause the simulated differences to have a range of values. Is the difference in the observed data far away from zero? In that case, we can say we have evidence that the variables are not independent; in other words, it is more likely that our variables are associated.
### Preparing and submitting your assignment {#randomization2-prep}
1. From the "Run" menu, select "Restart R and Run All Chunks".
2. Deal with any code errors that crop up. Repeat steps 1–-2 until there are no more code errors.
3. Spell check your document by clicking the icon with "ABC" and a check mark.
4. Hit the "Preview" button one last time to generate the final draft of the `.nb.html` file.
5. Proofread the HTML file carefully. If there are errors, go back and fix them, then repeat steps 1--5 again.
If you have completed this chapter as part of a statistics course, follow the directions you receive from your professor to submit your assignment.