-
-
Notifications
You must be signed in to change notification settings - Fork 40
/
Copy path11-data.Rmd
3160 lines (1981 loc) · 118 KB
/
11-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Data
Data can be defined broadly as any set of values, facts, or statistics that can be used for reference, analysis, and drawing inferences. In research, data drives the process of understanding phenomena, testing hypotheses, and formulating evidence-based conclusions. Choosing the right type of data (and understanding its strengths and limitations) is critical for the validity and reliability of findings.
## Data Types
### Qualitative vs. Quantitative Data
A foundational way to categorize data is by whether it is **qualitative** (non-numerical) or **quantitative** (numerical). These distinctions often guide research designs, data collection methods, and analytical techniques.
| **Qualitative** | **Quantitative** |
|-----------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|
| **Examples**: In-depth interviews, focus groups, case studies, ethnographies, open-ended questions, field notes | **Examples**: Surveys with closed-ended questions, experiments, numerical observations, structured interviews |
| **Nature**: Text-based, often descriptive, subjective interpretations | **Nature**: Numeric, more standardized, objective measures |
| **Analysis**: Thematic coding, content analysis, discourse analysis | **Analysis**: Statistical tests, regression, hypothesis testing, descriptive statistics |
| **Outcome**: Rich context, detailed understanding of phenomena | **Outcome**: Measurable facts, generalizable findings (with appropriate sampling and design) |
#### Uses and Advantages of Qualitative Data
- **Deep Understanding**: Captures context, motivations, and perceptions in depth.
- **Flexibility**: Elicits new insights through open-ended inquiry.
- **Inductive Approaches**: Often used to build new theories or conceptual frameworks.
#### Uses and Advantages of Quantitative Data
- **Measurement and Comparison**: Facilitates measuring variables and comparing across groups or over time.
- **Generalizability**: With proper sampling, findings can often be generalized to broader populations.
- **Hypothesis Testing**: Permits the use of statistical methods to test specific predictions or relationships.
#### Limitations of Qualitative and Quantitative Data
- **Qualitative**:
- Findings may be difficult to generalize if samples are small or non-representative.
- Analysis can be time-consuming due to coding and interpreting text.
- Potential for researcher bias in interpretation.
- **Quantitative**:
- May oversimplify complex human behaviors or contextual factors by reducing them to numbers.
- Validity depends heavily on how well constructs are operationalized.
- Can miss underlying meanings or nuances not captured in numeric measures.
#### Levels of Measurement
Even within **quantitative** data, there are further distinctions based on the *level of measurement*. This classification is crucial for determining which statistical techniques are appropriate:
1. **Nominal**: Categorical data with no inherent order (e.g., gender, blood type, eye color).
2. **Ordinal**: Categorical data with a specific order or ranking but without consistent intervals between ranks (e.g., Likert scale responses: "strongly disagree," "disagree," "neutral," "agree," "strongly agree").
3. **Interval**: Numeric data with equal intervals but no true zero (e.g., temperature in Celsius or Fahrenheit).
4. **Ratio**: Numeric data with equal intervals and a meaningful zero (e.g., height, weight, income).
The level of measurement affects which statistical tests (like t-tests, ANOVA, correlations, regressions) are valid and how you can interpret differences or ratios in the data.
------------------------------------------------------------------------
### Other Ways to Classify Data
Beyond **observational structure**, there are multiple other dimensions used to classify data:
#### Primary vs. Secondary Data
- **Primary Data**: Collected directly by the researcher for a specific purpose (e.g., firsthand surveys, experiments, direct measurements).
- **Secondary Data**: Originally gathered by someone else for a different purpose (e.g., government census data, administrative records, previously published datasets).
#### Structured, Semi-Structured, and Unstructured Data
- **Structured Data**: Organized in a predefined manner, typically in rows and columns (e.g., spreadsheets, relational databases).
- **Semi-Structured Data**: Contains organizational markers but not strictly tabular (e.g., JSON, XML logs, HTML).
- **Unstructured Data**: Lacks a clear, consistent format (e.g., raw text, images, videos, audio files).
- Often analyzed using natural language processing (NLP), image recognition, or other advanced techniques.
#### Big Data
- Characterized by the "3 Vs": **Volume** (large amounts), **Variety** (diverse forms), and **Velocity** (high-speed generation).
- Requires specialized computational tools (e.g., Hadoop, Spark) and often cloud-based infrastructure for storage and processing.
- Can be structured or unstructured (e.g., social media feeds, sensor data, clickstream data).
#### Internal vs. External Data (in Organizational Contexts)
- **Internal Data**: Generated within an organization (e.g., sales records, HR data, production metrics).
- **External Data**: Sourced from outside (e.g., macroeconomic indicators, market research reports, social media analytics).
#### Proprietary vs. Public Datas
- **Proprietary Data**: Owned by an organization or entity, not freely available for public use.
- **Public/Open Data**: Freely accessible data provided by governments, NGOs, or other institutions (e.g., data.gov, World Bank Open Data).
### Data by Observational Structure Over Time
Another primary way to categorize data is by *how* observations are collected over time. This classification shapes research design, analytic methods, and the types of inferences we can make. Four major types here are:
1. [Cross-Sectional Data](#sec-cross-sectional-data)
2. [Time Series Data](#sec-time-series-data)
3. [Repeated Cross-Sectional Data](#sec-repeated-cross-sectional-data)
4. [Panel (Longitudinal) Data](#sec-panel-data)
| Type | Advantages | Limitations |
|---------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------|
| [Cross-Sectional Data](#sec-cross-sectional-data) | Simple, cost-effective, good for studying distributions or correlations at a single time point. | Lacks temporal information, can only infer associations, not causal links. |
| [Time Series Data](#sec-time-series-data) | Enables trend analysis, seasonality detection, and forecasting. | Requires handling autocorrelation, stationarity issues, and structural breaks. |
| [Repeated Cross-Sectional Data](#sec-repeated-cross-sectional-data) | Tracks shifts in population-level parameters over time; simpler than panel data. | Cannot track individual changes; comparability depends on consistent methodology. |
| [Panel (Longitudinal) Data](#sec-panel-data) | Allows causal inference, controls for unobserved heterogeneity, tracks individual trajectories. | Expensive, prone to attrition, requires complex statistical methods. |
------------------------------------------------------------------------
## Cross-Sectional Data {#sec-cross-sectional-data}
Cross-sectional data consists of observations on **multiple entities** (e.g., individuals, firms, regions, or countries) at **a single point in time** or over a very short period, where **time is not a primary dimension** of variation.
- Each observation represents a **different entity**, rather than the same entity tracked over time.
- Unlike time series data, the order of observations does not carry temporal meaning.
Examples
- **Labor Economics**: Wage and demographic data for 1,000 workers in 2024.
- **Marketing Analytics**: Customer satisfaction ratings and purchasing behavior for 500 online shoppers surveyed in Q1 of a year.
- **Corporate Finance**: Financial statements of 1,000 firms for the fiscal year 2023.
Key Characteristics
- **Observations are independent (in an ideal setting)**: Each unit is drawn from a population with no intrinsic dependence on others.
- **No natural ordering**: Unlike time series data, the sequence of observations does not affect analysis.
- **Variation occurs across entities, not over time**: Differences in observed outcomes arise from differences between individuals, firms, or regions.
Advantages
- **Straightforward Interpretation**: Since time effects are not present, the focus remains on relationships between variables at a single point.
- **Easier to Collect and Analyze**: Compared to time series or panel data, cross-sectional data is often simpler to collect and model.
- **Suitable for causal inference** (if exogeneity conditions hold).
Challenges
- **Omitted Variable Bias**: Unobserved confounders may drive both the dependent and independent variables.
- **Endogeneity**: Reverse causality or measurement error can introduce bias.
- **Heteroskedasticity**: Variance of errors may differ across entities, requiring robust standard errors.
------------------------------------------------------------------------
A typical cross-sectional regression model:
$$
y_i = \beta_0 + x_{i1}\beta_1 + x_{i2}\beta_2 + \dots + x_{i(k-1)}\beta_{k-1} + \epsilon_i
$$
where:
- $y_i$ is the outcome variable for entity $i$,
- $x_{ij}$ are explanatory variables,
- $\epsilon_i$ is an error term capturing unobserved factors.
------------------------------------------------------------------------
## Time Series Data {#sec-time-series-data}
Time series data consists of observations on the *same variable(s)* recorded over multiple time periods for a single entity (or aggregated entity). These data points are typically collected at consistent intervals---hourly, daily, monthly, quarterly, or annually---allowing for the analysis of trends, patterns, and forecasting.
Examples
- **Stock Market**: Daily closing prices of a company's stock over five years.
- **Economics**: Monthly unemployment rates in a country over a decade.
- **Macroeconomics**: Annual GDP of a country from 1960 to 2020.
Key Characteristics
- The primary goal is to analyze trends, seasonality, cyclic patterns, and forecast future values.
- Time series data requires specialized statistical methods, such as:
- **Autoregressive Integrated Moving Average (ARIMA)**
- **Seasonal ARIMA (SARIMA)**
- **Exponential Smoothing**
- **Vector Autoregression (VAR)**
Advantages
- Captures temporal patterns such as trends, seasonal fluctuations, and economic cycles.
- Essential for forecasting and policy-making, such as setting interest rates based on economic indicators.
Challenges
- **Autocorrelation**: Observations close in time are often correlated.
- **Structural Breaks**: Sudden changes due to policy shifts or economic crises can distort analysis.
- **Seasonality**: Must be accounted for to avoid misleading conclusions.
------------------------------------------------------------------------
A time series typically consists of four key components:
1. **Trend**: Long-term directional movement in the data over time.
2. **Seasonality**: Regular, periodic fluctuations (e.g., increased retail sales in December).
3. **Cyclical Patterns**: Long-term economic cycles that are irregular but recurrent.
4. **Irregular (Random) Component**: Unpredictable variations not explained by trend, seasonality, or cycles.
------------------------------------------------------------------------
A general linear time series model can be expressed as:
$$
y_t = \beta_0 + x_{t1}\beta_1 + x_{t2}\beta_2 + \dots + x_{t(k-1)}\beta_{k-1} + \epsilon_t
$$
Some Common Model Types
1. **Static Model**
A simple time series regression:
$$
y_t = \beta_0 + x_1\beta_1 + x_2\beta_2 + x_3\beta_3 + \epsilon_t
$$
2. **Finite Distributed Lag Model**
Captures the effect of past values of an explanatory variable:
$$
y_t = \beta_0 + pe_t\delta_0 + pe_{t-1}\delta_1 + pe_{t-2}\delta_2 + \epsilon_t
$$
- **Long-Run Propensity**: Measures the cumulative effect of explanatory variables over time:
$$
LRP = \delta_0 + \delta_1 + \delta_2
$$
3. **Dynamic Model**
A model incorporating lagged dependent variables:
$$
GDP_t = \beta_0 + \beta_1 GDP_{t-1} + \epsilon_t
$$
------------------------------------------------------------------------
### Statistical Properties of Time Series Models
For time series regression, standard OLS assumptions must be carefully examined. The following conditions affect estimation:
Finite Sample Properties
- **A1-A3**: OLS remains unbiased.
- **A1-A4**: Standard errors are consistent, and the [Gauss-Markov Theorem] holds (OLS is BLUE).
- **A1-A6**: Finite sample [Wald tests](#sec-wald-test) (e.g., t-tests and F-tests) remain valid.
However, in time series settings, [A3](#a3-exogeneity-of-independent-variables) often fails due to:
- **Spurious Time Trends** (fixable by including a time trend)
- **Strict vs. Contemporaneous Exogeneity** (sometimes unavoidable)
------------------------------------------------------------------------
### Common Time Series Processes
Several key models describe different time series behaviors:
- **Autoregressive Model (AR(p))**: A process where current values depend on past values.
- **Moving Average Model (MA(q))**: A process where past error terms influence current values.
- **Autoregressive Moving Average (ARMA(p, q))**: A combination of AR and MA processes.
- **Autoregressive Conditional Heteroskedasticity (ARCH(p))**: Models time-varying volatility.
- **Generalized ARCH (GARCH(p, q))**: Extends ARCH by including past conditional variances.
------------------------------------------------------------------------
### Deterministic Time Trends
When both the dependent and independent variables exhibit trending behavior, a regression may produce **spurious** results.
**Spurious Regression Example**
A simple regression with trending variables:
$$
y_t = \alpha_0 + t\alpha_1 + v_t
$$
$$
x_t = \lambda_0 + t\lambda_1 + u_t
$$
where
- $\alpha_1 \neq 0$ and $\lambda_1 \neq 1$
- $v_t$ and $u_t$ are independent.
Despite no true relationship between $x_t$ and $y_t$, estimating:
$$
y_t = \beta_0 + x_t\beta_1 + \epsilon_t
$$
results in:
- **Inconsistency**: $plim(\hat{\beta}_1) = \frac{\alpha_1}{\lambda_1}$
- **Invalid Inference**: $|t| \to^d \infty$ for $H_0: \beta_1=0$, leading to rejection of the null hypothesis as $n \to \infty$.
- **Misleading** $R^2$: $plim(R^2) = 1$, falsely implying perfect predictive power.
We can also rewrite the equation as:
$$
\begin{aligned}
y_t &=\beta_0 + \beta_1 x_t + \epsilon_t \\
\epsilon_t &= \alpha_1 t + v_t
\end{aligned}
$$
where $\beta_0 = \alpha_0$ and $\beta_1 = 0$. Since $x_t$ is a deterministic function of time, $\epsilon_t$ is correlated with $x_t$, leading to the usual omitted variable bias.
**Solutions to Spurious Trends**
1. **Include a Time Trend** ($t$) as a Control Variable
- Provides consistent parameter estimates and valid inference.
2. **Detrend Variables**
- Regress both $y_t$ and $x_t$ on time, then use residuals in a second regression.
- Equivalent to applying the [Frisch-Waugh-Lovell Theorem](#Frisch–Waugh–Lovell%20Theorem).
### Violations of Exogeneity in Time Series Models
The **exogeneity assumption** ([A3](#a3-exogeneity-of-independent-variables)) plays a crucial role in ensuring unbiased and consistent estimation in time series models. However, in many cases, the assumption is **violated** due to the inherent nature of time-dependent processes.
In a standard regression framework, we assume:
$$
E(\epsilon_t | x_1, x_2, ..., x_T) = 0
$$
which requires that the error term is uncorrelated with all past, present, and future values of the independent variables.
**Common Violations of Exogeneity**
1. [Feedback Effect](#sec-feedback-effect)
- The error term $\epsilon_t$ **influences future values** of the independent variables.
- A classic example occurs in economic models where past shocks affect future decisions.
2. [Dynamic Specification](#sec-dynamic-specification)
- The dependent variable includes a **lagged version of itself** as an explanatory variable, introducing correlation between $\epsilon_t$ and past $y_{t-1}$.
3. [Dynamic Completeness](#sec-dynamic-completeness-and-omitted-lags)
- In finite distributed lag (FDL) models, failing to include the **correct number of lags** leads to omitted variable bias and correlation between regressors and errors.
------------------------------------------------------------------------
#### Feedback Effect {#sec-feedback-effect}
In a simple regression model:
$$
y_t = \beta_0 + x_t \beta_1 + \epsilon_t
$$
the standard exogeneity assumption ([A3](#a3-exogeneity-of-independent-variables)) requires:
$$
E(\epsilon_t | x_1, x_2, ..., x_t, x_{t+1}, ..., x_T) = 0
$$
However, in the presence of feedback, past errors affect future values of $x_t$, leading to:
$$
E(\epsilon_t | x_{t+1}, ..., x_T) \neq 0
$$
- This occurs when **current shocks** (e.g., economic downturns) influence **future decisions** (e.g., government spending, firm investments).
- Strict exogeneity is **violated**, as we now have dependence across time.
**Implication**:
- Standard OLS estimators become **biased and inconsistent**.
- One common solution is using [Instrumental Variables] to isolate exogenous variation in $x_t$.
------------------------------------------------------------------------
#### Dynamic Specification {#sec-dynamic-specification}
A **dynamically specified model** includes lagged dependent variables:
$$
y_t = \beta_0 + y_{t-1} \beta_1 + \epsilon_t
$$
Exogeneity ([A3](#a3-exogeneity-of-independent-variables)) would require:
$$
E(\epsilon_t | y_1, y_2, ..., y_t, y_{t+1}, ..., y_T) = 0
$$
However, since $y_{t-1}$ depends on $\epsilon_{t-1}$ from the previous period, we obtain:
$$
Cov(y_{t-1}, \epsilon_t) \neq 0
$$
**Implication**:
- Strict exogeneity ([A3](#a3-exogeneity-of-independent-variables)) fails, as $y_{t-1}$ and $\epsilon_t$ are correlated.
- OLS estimates are biased and inconsistent.
- Standard **autoregressive models (AR)** require alternative estimation techniques like **Generalized Method of Moments** or [Maximum Likelihood] Estimation.
------------------------------------------------------------------------
#### Dynamic Completeness and Omitted Lags {#sec-dynamic-completeness-and-omitted-lags}
A finite distributed lag (FDL) model:
$$
y_t = \beta_0 + x_t \delta_0 + x_{t-1} \delta_1 + \epsilon_t
$$
assumes that the included lags **fully capture** the relationship between $y_t$ and past values of $x_t$. However, if we omit relevant lags, the exogeneity assumption ([A3](#a3-exogeneity-of-independent-variables)):
$$
E(\epsilon_t | x_1, x_2, ..., x_t, x_{t+1}, ..., x_T) = 0
$$
**fails**, as unmodeled lag effects create correlation between $x_{t-2}$ and $\epsilon_t$.
**Implication**:
- The regression suffers from **omitted variable bias**, making OLS estimates unreliable.
- **Solution**:
- Include **additional lags** of $x_t$.
- Use **lag selection criteria** (e.g., AIC, BIC) to determine the appropriate lag structure.
------------------------------------------------------------------------
### Consequences of Exogeneity Violations
If strict exogeneity ([A3](#a3-exogeneity-of-independent-variables)) fails, standard OLS assumptions no longer hold:
- OLS is biased.
- [Gauss-Markov Theorem] no longer applies.
- [Finite Sample Properties] (such as unbiasedness) are invalid.
To address these issues, we can:
1. Rely on [Large Sample Properties]: Under certain conditions, consistency may still hold.
2. Use Weaker Forms of Exogeneity: Shift from **strict exogeneity** ([A3](#a3-exogeneity-of-independent-variables)) to **contemporaneous exogeneity** ([A3a](#a3a-weak-exogeneity)).
------------------------------------------------------------------------
If strict exogeneity does not hold, we can instead assume [A3a](#a3a-weak-exogeneity) (Contemporaneous Exogeneity):
$$
E(\mathbf{x}_t' \epsilon_t) = 0
$$
This weaker assumption only requires that $x_t$ is uncorrelated with the error in the same time period.
**Key Differences from Strict Exogeneity**
| Exogeneity Type | Requirement | Allows Dynamic Models? |
|----------------------------|------------------------------------------|------------------------|
| Strict Exogeneity | $E(\epsilon_t | x_1, x_2, ..., x_T) = 0$ | **No** |
| Contemporaneous Exogeneity | $E(\mathbf{x}_t' \epsilon_t) = 0$ | **Yes** |
- With contemporaneous exogeneity, $\epsilon_t$ can be correlated with past and future values of $x_t$.
- This allows for dynamic specifications such as:
$$
y_t = \beta_0 + y_{t-1} \beta_1 + \epsilon_t
$$
while still maintaining **consistency** under certain assumptions.
------------------------------------------------------------------------
**Deriving [Large Sample Properties] for [Time Series](#sec-time-series-data)**
To establish **consistency** and **asymptotic normality**, we rely on the following assumptions:
- [A1](#a1-linearity): Linearity
- [A2](#a2-full-rank): Full Rank (No Perfect Multicollinearity)
- [A3a](#a3a-weak-exogeneity): Contemporaneous Exogeneity
However, the standard [Weak Law of Large Numbers] and [Central Limit Theorem] in OLS depend on [A5](#a5-data-generation-random-sampling) (Random Sampling), which does **not** hold in time series settings.
Since time series data exhibits dependence over time, we replace [A5](#a5-data-generation-random-sampling) (Random Sampling) with a weaker assumption:
- [A5a](#A5a-stationarity-and-weak-dependence-in-time-series): Weak Dependence (Stationarity)
**Asymptotic Variance and Serial Correlation**
- The derivation of **asymptotic variance** depends on [A4](#a4-homoskedasticity) (Homoskedasticity).
- However, in time series settings, we often encounter **serial correlation**:
$$
Cov(\epsilon_t, \epsilon_s) \neq 0 \quad \text{for} \quad |t - s| > 0
$$
- To ensure valid inference, standard errors must be corrected using methods such as [Newey-West HAC](#sec-newey-west-standard-errors) estimators.
------------------------------------------------------------------------
### Highly Persistent Data
In time series analysis, a key assumption for OLS consistency is that the data-generating process exhibits [A5a](#A5a-stationarity-and-weak-dependence-in-time-series) weak dependence (i.e., observations are not too strongly correlated over time). However, when $y_t$ and $x_t$ are highly persistent, standard OLS assumptions break down.
If a time series is **not weakly dependent**, it means:
- $y_t$ and $y_{t-h}$ remain strongly correlated even for large lags ($h \to \infty$).
- [A5a](#A5a-stationarity-and-weak-dependence-in-time-series) (Weak Dependence Assumption) fails, leading to:
- **OLS inconsistency**.
- **No valid limiting distribution** (asymptotic normality does not hold).
**Example:** A classic example of a highly persistent process is a random walk:
$$
y_t = y_{t-1} + u_t
$$
or with drift:
$$
y_t = \alpha + y_{t-1} + u_t
$$
where $u_t$ is a white noise error term.
- $y_t$ does not revert to a mean---it has an infinite variance as $t \to \infty$.
- **Shocks accumulate**, making standard regression analysis unreliable.
------------------------------------------------------------------------
#### Solution: First Differencing
A common way to transform non-stationary series into stationary ones is through first differencing:
$$
\Delta y_t = y_t - y_{t-1} = u_t
$$
- If $u_t$ is a weakly dependent process (i.e., $I(0)$, stationary), then $y_t$ is said to be **difference-stationary** or **integrated of order 1,** $I(1)$.
- If both $y_t$ and $x_t$ follow a random walk ($I(1)$), we estimate:
$$
\begin{aligned}
\Delta y_t &= (\Delta \mathbf{x}_t \beta) + (\epsilon_t - \epsilon_{t-1}) \\
\Delta y_t &= \Delta \mathbf{x}_t \beta + \Delta u_t
\end{aligned}
$$
This ensures **OLS estimation remains valid**.
------------------------------------------------------------------------
### Unit Root Testing {#sec-unit-root-testing}
To formally determine whether a time series contains a **unit root** (i.e., is non-stationary), we test:
$$
y_t = \alpha + \rho y_{t-1} + u_t
$$
**Hypothesis Testing**
- $H_0: \rho = 1$ (unit root, non-stationary)
- OLS is **not consistent or asymptotically normal**.
- $H_a: \rho < 1$ (stationary process)
- OLS is **consistent and asymptotically normal**.
**Key Issues**
- The usual **t-test is not valid** because OLS under $H_0$ does not have a standard distribution.
- Instead, specialized tests such as [Dickey-Fuller](#sec-dickey-fuller-test-for-unit-roots) and [Augmented Dickey-Fuller](#sec-augmented-dickey-fuller-test) tests are required.
------------------------------------------------------------------------
#### Dickey-Fuller Test for Unit Roots {#sec-dickey-fuller-test-for-unit-roots}
The **Dickey-Fuller test** transforms the original equation by subtracting $y_{t-1}$ from both sides:
$$
\Delta y_t = \alpha + \theta y_{t-1} + v_t
$$
where:
$$
\theta = \rho - 1
$$
- **Null Hypothesis** ($H_0: \theta = 0$) → Implies $\rho = 1$ (unit root, non-stationary).
- **Alternative** ($H_a: \theta < 0$) → Implies $\rho < 1$ (stationary).
Since $y_t$ follows a non-standard asymptotic distribution under $H_0$, Dickey and Fuller derived specialized critical values.
**Decision Rule**
- If the test statistic is more negative than the critical value, reject $H_0$ → $y_t$ is stationary.
- Otherwise, fail to reject $H_0$ → $y_t$ has a unit root (non-stationary).
------------------------------------------------------------------------
The standard [DF](#sec-dickey-fuller-test-for-unit-roots) test may fail due to two key limitations:
1. **Simplistic Dynamic Relationship**
- The DF test assumes only one lag in the autoregressive structure.
- However, in reality, higher-order lags of $\Delta y_t$ may be needed.
**Solution:**\
Use the [Augmented Dickey-Fuller](#sec-augmented-dickey-fuller-test) test, which includes extra lags:
$$
\Delta y_t = \alpha + \theta y_{t-1} + \gamma_1 \Delta y_{t-1} + \dots + \gamma_p \Delta y_{t-p} + v_t
$$
- Under $H_0$, $\Delta y_t$ follows an AR(1) process.
- Under $H_a$, $y_t$ follows an AR(2) or higher process.
Including lags of $\Delta y_t$ ensures a better-specified model.
2. **Ignoring Deterministic Time Trends**
If a series exhibits a deterministic trend, failing to include it biases the unit root test.
**Example:** If $y_t$ grows over time, a test without a trend component will falsely detect a unit root.
**Solution:** Include a deterministic time trend ($t$) in the regression:
$$
\Delta y_t = \alpha + \theta y_{t-1} + \delta t + v_t
$$
- Allows for quadratic relationships with time.
- Changes the critical values, requiring an adjusted statistical test.
------------------------------------------------------------------------
#### Augmented Dickey-Fuller Test {#sec-augmented-dickey-fuller-test}
The **ADF test** generalizes the DF test by allowing for:
1. **Lags of** $\Delta y_t$ (to correct for serial correlation).
2. **Time trends** (to handle deterministic trends).
**Regression Equation**
$$
\Delta y_t = \alpha + \theta y_{t-1} + \delta t + \gamma_1 \Delta y_{t-1} + \dots + \gamma_p \Delta y_{t-p} + v_t
$$
where $\theta = 1 - \rho$.
**Hypotheses**
- $H_0: \theta = 0$ (Unit root: non-stationary)
- $H_a: \theta < 0$ (Stationary)
------------------------------------------------------------------------
### Newey-West Standard Errors {#sec-newey-west-standard-errors}
Newey-West standard errors, also known as **Heteroskedasticity and Autocorrelation Consistent (HAC) estimators**, provide valid inference when errors exhibit both heteroskedasticity (i.e., [A4](#a4-homoskedasticity) Homoskedasticity assumption is violated) and serial correlation. These standard errors adjust for dependence in the error structure, ensuring that hypothesis tests remain valid.
**Key Features**
- **Accounts for autocorrelation**: Handles time dependence in error terms.
- **Accounts for heteroskedasticity**: Allows for non-constant variance across observations.
- **Ensures positive semi-definiteness**: Downweights longer-lagged covariances to maintain mathematical validity.
The estimator is computed as:
$$
\hat{B} = T^{-1} \sum_{t=1}^{T} e_t^2 \mathbf{x'_t x_t} + \sum_{h=1}^{g} \left(1 - \frac{h}{g+1} \right) T^{-1} \sum_{t=h+1}^{T} e_t e_{t-h} (\mathbf{x_t' x_{t-h}} + \mathbf{x_{t-h}' x_t})
$$
where:
- $T$ is the sample size,
- $g$ is the chosen **lag truncation parameter** (bandwidth),
- $e_t$ are the residuals from the OLS regression,
- $\mathbf{x}_t$ are the explanatory variables.
------------------------------------------------------------------------
**Choosing the Lag Length** ($g$)
Selecting an appropriate lag truncation parameter ($g$) is crucial for balancing **efficiency** and **bias**. Common guidelines include:
- **Yearly data**: $g = 1$ or $2$ usually suffices.
- **Quarterly data**: $g = 4$ or $8$ accounts for seasonal dependencies.
- **Monthly data**: $g = 12$ or $14$ captures typical cyclical effects.
Alternatively, data-driven methods can be used:
- **Newey-West Rule**: $g = \lfloor 4(T/100)^{2/9} \rfloor$
- **Alternative Heuristic**: $g = \lfloor T^{1/4} \rfloor$
```{r}
# Load necessary libraries
library(sandwich)
library(lmtest)
# Simulate data
set.seed(42)
T <- 100 # Sample size
time <- 1:T
x <- rnorm(T)
epsilon <- arima.sim(n = T, list(ar = 0.5)) # Autocorrelated errors
y <- 2 + 3 * x + epsilon # True model
# Estimate OLS model
model <- lm(y ~ x)
# Compute Newey-West standard errors
lag_length <- floor(4 * (T / 100) ^ (2 / 9)) # Newey-West rule
nw_se <- NeweyWest(model, lag = lag_length, prewhite = FALSE)
# Display robust standard errors
coeftest(model, vcov = nw_se)
```
------------------------------------------------------------------------
#### Testing for Serial Correlation
Serial correlation (also known as **autocorrelation**) occurs when error terms are correlated across time:
$$
E(\epsilon_t \epsilon_{t-h}) \neq 0 \quad \text{for some } h \neq 0
$$
**Steps for Detecting Serial Correlation**
1. **Estimate an OLS regression**:
- Run the regression of $y_t$ on $\mathbf{x}_t$ and obtain residuals $e_t$.
2. **Test for autocorrelation in residuals**:
- Regress $e_t$ on $\mathbf{x}_t$ and its lagged residual $e_{t-1}$:
$$
e_t = \gamma_0 + \mathbf{x}_t' \gamma + \rho e_{t-1} + v_t
$$
- Test whether $\rho$ is significantly different from zero.
3. **Decision Rule**:
- If $\rho$ is statistically significant at the 5% level, reject the null hypothesis of no serial correlation.
**Higher-Order Serial Correlation**
To test for **higher-order autocorrelation**, extend the previous regression:
$$
e_t = \gamma_0 + \mathbf{x}_t' \gamma + \rho_1 e_{t-1} + \rho_2 e_{t-2} + \dots + \rho_p e_{t-p} + v_t
$$
- **Jointly test** $\rho_1 = \rho_2 = \dots = \rho_p = 0$ using an **F-test**.
- If the null is rejected, autocorrelation of order $p$ is present.
Step 1: Estimate an OLS Regression and Obtain Residuals
```{r}
# Load necessary libraries
library(lmtest)
library(sandwich)
# Generate some example data
set.seed(123)
n <- 100
x <- rnorm(n)
y <- 1 + 0.5 * x + rnorm(n) # True model: y = 1 + 0.5*x + e
# Estimate the OLS regression
model <- lm(y ~ x)
# Obtain residuals
residuals <- resid(model)
```
Step 2: Test for Autocorrelation in Residuals
```{r}
# Create lagged residuals
lagged_residuals <- c(NA, residuals[-length(residuals)])
# Regress residuals on x and lagged residuals
autocorr_test_model <- lm(residuals ~ x + lagged_residuals)
# Summary of the regression
summary(autocorr_test_model)
# Test if the coefficient of lagged_residuals is significant
rho <- coef(autocorr_test_model)["lagged_residuals"]
rho_p_value <-
summary(autocorr_test_model)$coefficients["lagged_residuals", "Pr(>|t|)"]
# Decision Rule
if (rho_p_value < 0.05) {
cat("Reject the null hypothesis: There is evidence of serial correlation.\n")
} else {
cat("Fail to reject the null hypothesis: No evidence of serial correlation.\n")
}
```
Step 3: Testing for Higher-Order Serial Correlation
```{r}
# Number of lags to test
p <- 2 # Example: testing for 2nd order autocorrelation
# Create a matrix of lagged residuals
lagged_residuals_matrix <- sapply(1:p, function(i) c(rep(NA, i), residuals[1:(n - i)]))
# Regress residuals on x and lagged residuals
higher_order_autocorr_test_model <- lm(residuals ~ x + lagged_residuals_matrix)
# Summary of the regression
summary(higher_order_autocorr_test_model)
# Joint F-test for the significance of lagged residuals
f_test <- car::linearHypothesis(higher_order_autocorr_test_model,
paste0("lagged_residuals_matrix", 1:p, " = 0"))
# Print the F-test results
print(f_test)
# Decision Rule
if (f_test$`Pr(>F)`[2] < 0.05) {
cat("Reject the null hypothesis: There is evidence of higher-order serial correlation.\n")
} else {
cat("Fail to reject the null hypothesis: No evidence of higher-order serial correlation.\n")
}
```
------------------------------------------------------------------------
**Corrections for Serial Correlation**
If serial correlation is detected, the following adjustments should be made:
| **Problem** | **Solution** |
|----------------------------------------|--------------------------------------------------------------------------------------------------------------------|
| **Mild serial correlation** | Use [Newey-West standard errors](#sec-newey-west-standard-errors) |
| **Severe serial correlation** | Use [Generalized Least Squares] or Prais-Winsten transformation |
| **Autoregressive structure in errors** | Model as an **ARMA** process |
| **Higher-order serial correlation** | Include lags of dependent variable or use [HAC](#sec-newey-west-standard-errors) estimators with higher lag orders |
------------------------------------------------------------------------
## Repeated Cross-Sectional Data {#sec-repeated-cross-sectional-data}
Repeated cross-sectional data consists of **multiple independent cross-sections** collected at different points in time. Unlike panel data, where the same individuals are tracked over time, repeated cross-sections **draw a fresh sample in each wave**.
This approach allows researchers to analyze **aggregate trends over time**, but it does **not track individual-level changes**.
**Examples**
- **General Social Survey (GSS) (U.S.)** -- Conducted every two years with a new sample of respondents.
- **Political Opinion Polls** -- Monthly voter surveys to track shifts in public sentiment.
- **National Health Surveys** -- Annual studies with fresh samples to monitor **population-wide** health trends.
- **Educational Surveys** -- Sampling different groups of students each year to assess learning outcomes.
------------------------------------------------------------------------
### Key Characteristics
1. **Fresh Sample in Each Wave**
- Each survey represents an **independent** cross-section.
- No respondent is tracked across waves.
2. **Population-Level Trends Over Time**
- Researchers can study **how the distribution of characteristics (e.g., income, attitudes, behaviors) changes over time**.
- However, individual trajectories **cannot** be observed.
3. **Sample Design Consistency**
- To ensure comparability across waves, researchers must maintain **consistent**:
- Sampling methods
- Questionnaire design
- Definitions of key variables
------------------------------------------------------------------------
### Statistical Modeling for Repeated Cross-Sections
Since repeated cross-sections do not track the same individuals, specific regression methods are used to **analyze changes over time**.
1. **Pooled Cross-Sectional Regression (Time Fixed Effects)**
Combines multiple survey waves into a single dataset while controlling for time effects:
$$
y_i = \mathbf{x}_i \beta + \delta_1 y_1 + ... + \delta_T y_T + \epsilon_i
$$
where:
- $y_i$ is the outcome for individual $i$,
- $\mathbf{x}_i$ are explanatory variables,
- $y_t$ are time period dummies,
- $\delta_t$ captures the **average change** in outcomes across time periods.
**Key Features:**
- Allows for different intercepts across time periods, capturing shifts in baseline outcomes.
- Tracks overall population trends without assuming a constant effect of $\mathbf{x}_i$ over time.
------------------------------------------------------------------------
2. **Allowing for Structural Change in Pooled Cross-Sections (Time-Dependent Effects)**
To test whether relationships between variables change over time (**structural breaks**), interactions between time dummies and explanatory variables can be introduced:
$$
y_i = \mathbf{x}_i \beta + \mathbf{x}_i y_1 \gamma_1 + ... + \mathbf{x}_i y_T \gamma_T + \delta_1 y_1 + ...+ \delta_T y_T + \epsilon_i
$$
- **Interacting** $x_i$ with time period dummies allows for:
- **Different slopes** for each time period.
- **Time-dependent effects** of explanatory variables.
**Practical Application:**
- If $\mathbf{x}_i$ represents education level and $y_t$ represents survey year, an interaction term can test whether the effect of education on income has changed over time.
- Structural break tests help determine whether such time-varying effects are statistically significant.
- Useful for **policy analysis**, where a policy might impact certain subgroups differently across time.
------------------------------------------------------------------------
3. **Difference-in-Means Over Time**
A simple approach to comparing **aggregate trends**:
$$ \bar{y}_t - \bar{y}_{t-1} $$
- Measures whether the average outcome has changed over time.
- Common in policy evaluations (e.g., assessing the effect of minimum wage increases on average income).
------------------------------------------------------------------------
4. **Synthetic Cohort Analysis**
Since repeated cross-sections do not track individuals, a **synthetic cohort** can be created by grouping observations based on shared characteristics:
- Example: If education levels are collected over multiple waves, we can track **average income changes** within education groups to approximate trends.
------------------------------------------------------------------------
### Advantages of Repeated Cross-Sectional Data
| **Advantage** | **Explanation** |
|--------------------------------|--------------------------------------------------------------------------------------------------------------------|
| **Tracks population trends** | Useful for studying shifts in demographics, attitudes, and economic conditions over time. |
| **Lower cost than panel data** | Tracking individuals across multiple waves (as in panel studies) is expensive and prone to attrition. |
| **No attrition bias** | Unlike panel surveys, where respondents drop out over time, each wave draws a new representative sample. |
| **Easier implementation** | Organizations can design a single survey protocol and repeat it at set intervals without managing panel retention. |
------------------------------------------------------------------------
### Disadvantages of Repeated Cross-Sectional Data
| **Disadvantage** | **Explanation** |
|-------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
| **No individual-level transitions** | Cannot track **how specific individuals change** over time (e.g., income mobility, changes in attitudes). |
| **Limited causal inference** | Since we observe different people in each wave, we cannot directly infer individual **cause-and-effect** relationships. |
| **Comparability issues** | Small differences in survey design (e.g., question wording or sampling frame) can make it difficult to compare across waves. |
------------------------------------------------------------------------
To ensure valid comparisons across time:
- **Consistent Sampling**: Each wave should use the same **sampling frame and methodology**.
- **Standardized Questions**: Small variations in **question wording** can introduce inconsistencies.
- **Weighting Adjustments**: If sampling strategies change, apply **survey weights** to maintain representativeness.
- **Accounting for Structural Changes**: Economic, demographic, or social changes may impact comparability.
------------------------------------------------------------------------