TuftsDataScience/panelDiscussionNotes at master · pezLyfe/TuftsDataScience · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
The Data Deluge:
-	“All models are wrong, but some are useful”
-	View data mathematically and create a story  for it later
  - "Google used analytical tools and better data to reach the top of their competitors"
  - "We don't know why this page is better than that one, but the data says it's true"
    - Once a metric is created, assume it's being gamed (original)
   - Who knows why people do what they do? They do it, and we can measure it
   - Adovcating for obsoletion of the scientific method
   - Petabytes say "Correlation is enough"
    - Maybe for most cases, but what's the cost of getting it wrong?
      - Can it be more damaging to get it wrong than getting it right most of the time?
      - "Doesn't know what they look like or how they live, but the genome is different
      so it must be a new species"
        - Why is that important?
      - Correlation supercedes causation, science can advance without coherent models
  - Some things are beyond the capability of people to understand
    - Being able to describe them and discuss may be a good approximation
    - But what's useful coming out of it? Why is it important

A/B Testing:
- "I am a believer in reason and facts and evidence and science and feedback"
- Double blind study, with populations being shown two different things
  - Behavior is compared against certain metrics
    - Again with the gaming of metrics
    - Temporary versus sustainable advantage,IS THIS USEFUL AGAINST A LONGER TERM PLAN????
    - Flying blind based on reactionary decisions
    - Hyperpersonalization and it's advantages
  - Choosing carefully the metrics which define the success of a choice
    - If more people interacted, but your message was diluted, is this still a win?
    - GoFundMe example is a good example of it working
      - More money raised
    - Highest paid person makes the call (Hippos)
      - Experienced similar in my career
      - Objective facts backing up a decision or proposal from a younger employee
        - Aside from raw numbers, what else is the decision based on?
      - Experimentation is good, but some higher level conclusions should be drawn
    - Risks in a mistake, risks in tiny improvements
      - A/B reduced the number of big, dramatic changes to products
      - Wholesale revisions have become too risky
      - Risks of plodding incrementalism
        - Leaving susceptible to takeover from a better competitor
      - The psychology behind small wins
        - gamification, constant feedback, and lack of big motivations
        - "Disruptive innovation"
    - Number 4 Experience teaches lessons, data can make the idea of lessons obsolete
      - "Web businesses have products too dynamic to sit still for days or weeks"
      - "Big data is not enough, we need real-time data that we can act on during the course of a day"
        - No time to learn and apply lessons, no lessons no rules to extract
          - A/B to guide them, no need to worry about why users behaved in certain ways
          - Automated changeovers of traffic to the better performing option
          - Most important decisions are immune to focus-grouping, let alone A/B testing
            - "Not the consumers job to know what they want"
              - Pushing all decisions to software misses out on the big opportunities
              - Refining over creating
              - '3, 4, or 5 pixels wide for border, make your case?'
                - Ultimately, who cares? Is that important in the end?
           - Safety, confidence in decision making
            - Atrophying of decision and planning

"Towards a more useful definition of Data Science"
- Missed the swine flu pandemic in 2009, underestimated, overestimated in 2013 by 50%
  - Actually over estimated prevalence 100 out of 108 weeks since August
  - Projecting two week lag of CDC data would perform just as well
    - Search terms from the published paper didn't line up with GFT results or the CDC data
  - "Big Data studies" difference between big data and previous data sets
    - collected through observation, without prior design
    - for a purpose other than the study (adapted)
    - Merged, making the lack of definition worse
    - No controls on the data
    - "seemingly complete" assuming perfect information is not the same as having complete information
    - A size based defination of big data is unreasonable
      - Computing power makes larger files easier as it improves
   - GFT
    - Overpredicted by quite a bit, underpredicted by quite a bit
      - Does knowledge of the model change the model?
      - Difficulties in accounting for changes in behavior?
    - Don't throw the baby out with the bathwater
      - Overhyped and how it's presented

      Big Data: A Revolution Chapters 1 - 4
 -

 Numbers Rule your World Chapters 2,3
 - Modelers make general statements about the world at large from snapshots of data
  - Two ways to tell the spinach story
    - Triumph of the agency for a speedy response
    - Criticizing that the spinach was likely gone by the time it was pinpointed
      - Damage done to the spinach industry
      - Second narrative is closer to the real truth
  - Bills restricting credit scoring
    - Credit scoring as redlining, and irresponsibly overcharing insurance premiums
    - Supporters praise efficiency, reduced borrowing costs
  - How long to wait before declaring an outbreak?
    - CDC declared late in 2006 (close to the peak)
    - But correctly identified no outbreak in 2005 and 2004
    - Early in the investigation is difficult to call due to lack of information
    - Shotgun approach yielded the spinach conclusion
    - The cost of wrong predictions
      - Severe loss to spinach industry
      - Increased illness and loss of lives
  - Broad street pump
    - Launch of epidemiology
      - Epidemiology intelligence services
  - Consumer warning for all spinach
    - Tracking down the source while cases continued to rise
    - Finding the batch code on the offending bags
    - Matched to tainted river water and animal feces
 - We don't know how many cases if any were avoided due to the recall
  - Drawing attention may have prevented further contamination
  - But, spinach sales dropped incredibly
    - Further effects on other bagged salads
    - Innocent farms roped into the same category
  - The difficulty is in connecting correlations to the cause
    - People got sick, people ate spinach, therefore spinach made them sick
    - One truth, but many many wrong answers
    - Epidemiologists look beyond statistics only to find corroborative evidence
    - Educated guesses, lab work work together to test theories
 - Combined tools and evidence from across the country
  - Bringing new connections that weren't possible before
  - Problems with less common foods are easier to catch than common foods
    - Comparing the rate of spinach consumption statewide to the 80% of cases that ate
 - Hill's nine viewpoints
  - 6 of 9 in the spinach case
  - "Impossible to capture the truth, but creating useful models for understanding
  and controlling disease is the goal"
- Credit modelers have much lower stakes than epidemiologists
  - However their work is scrutinized heavily by the same people that embrace epi
  - "The miracle of instant credit"
    - Some borrowers pose more risks than others
    - Credit modelers allow them to differentiate
    - High scrutiny yields tight credit which yields slow growth
    - FICO as the first statistical model for credit
    - Make more nuanced comparisons of customers
    - Consistency across the population
 - Consumer spending accounts for 2/3 of the US economy
  - Default rates are much, much lower when compared with traditional scoring methods
  - Gave lower incomes access to credit
    - Is this inherently good?
  - Credit scores as proxy to insurance claim risk
    -

 The parable of google flu
 - Nature reported in 2013 that GFT had predicting double the amount of confirmed cases
  - Paper posits that the issues are not isolated here, but are endemic to data usage lately
  - Big data hubris and algorithm dynamics
    - Hubris: Big data are a substitute or suplement to traditional collection and analysis
    - Quantity of data does allow to ignore foundational issues of measurement, reliability and validity
    - Mostly data used not for its intended purposes (not for scientific analysis)
      - Hypothesize, then confirm
      - Use to supplement intuition, or to launch further analyses
      - Narrow down regions of study
  - Methodology
    - Find best fit in 50 million search terms to match 1152 data points
    - Weeded out seasonal search terms (High school basketball, christmas related etc)
      - Warning of overfit, first version was part winter detector, part flu detector
      - Made adjustments after 2009 failure to predict H1N1 outbreak
        - Has been persistently overestimating since then
      - As model performance degradated, projections based on two week lagging data started
      showing higher accuracy than GFT
      - Combining GFT with CDC data provided the highest accuracy
    - Correlation alone, without connecting to hard evidence
  - Algorithm Dynamics
    - Is the instrument capturing the target? Is it stable an comparable? Are errors systematic?
    - GFT model may have simply been unstable by the algorithm dynamics
      - Changes made to plain google searches likely affected user behavior
      - Google's algorithm is not static
        - Original 45 search terms were never documented or released
        - Lack of ability for peer review
    - Possible spikes due to trying to differentiate cold symptoms and flu symptoms
    - Changes made in 2011 to include additional search terms, and potential diagnoses for physical symptoms in 2012
      - Suggested searching for flu treatments if symptoms are searched (Double entries)
      - Recommended searches increase the magnitude of certain searches
        - GFT model uses relative magnitude, can be heavily biased by similar searches
      - Health based add-ons to search results
      - Assumption that relative search volume is statically related to external events
        - In reality, externally motivated but also cultivated by the provider
          - Google wants you to search more to show more ads or display a certain product
      - Changes in algorithms make replicating old studies harder or impossible
    - "Red team" attacks for economic or political gain
      - Campaigns aim to become trending on social platforms and search engines
      - Monitoring behavior of people on open information sources makes it tempting to manipulate signals
 - Transparency, Granularity, and All-Data
  - Transparency and replication
    - GFT papers did not meet standards, neither did the search terms provided
    - Lack of the ability to replicate their findings
    - Standard science requires cooperation and building on previous findings
    - The core goal and initial results of GFT were promising, but it was a first draft
  - A simple lagged model performs well
    - A modest performance gain on a national level
    - Local level study would be much more useful
      - GFT could provide such a thing
      - Local propagation connects with a larger outbreak and could truly be transformational
    - Ability to replicate (Rev history on algorithms)
      - Study the effects of the algorithm on user behavior
      - Affects of search results on worldview
    - Using innovative analytics in combination with "big data"
      - Provides a much more powerful tool
      - New analytics on traditional data for a deeper, clearer view of the world
- The wealth of information comes with responsibility
  - Responsibility to use for public interest
  - But also responsibility to be realistic and open
- Has now quietly begun offering the GFT data to the CDC and research groups
  - Is the methodology being changed?
  - Update fit to actual flu prevalence

Google Flu Trends is a signatur Big Data project, and the project that leads off the Mayer-SChonberger boook
What is innovative about GFT? Is it a success or a failure? What is the future direction of
the project? What can data scientists learn from this effort?

"Where small data deals in exactitudes, Big Data settles for a direction"
- GFT ultimately failed in trying to capture both
- It did do a good job at capturing direction though
- Related to the commercial uses of its algorithm, allowed more directed advertising for people that likely have the flu
- Looking for ways to further refine the algorithm

1) Describe the two methods of predicting flu trends, the one used by the CDC and the one used by Google
  - Correlating relative frequency of key search terms with previous CDC data
  - CDC collects confirmed cases and makes projections based on trend analysis

2) Comment on the pros and cons of the two methods
  - CDC method lags real time by two weeks, backed by hard evidence
    - Costly to collect, not real time, difficult to analyze
  - Using data collected for a different reason
    - GFT method broken by changes to algorithm dynamics
    - Not directly related to physical measurements
    - Affected by decisions made for commercial reasons

3) GFT relies purely on correlations to make predictions. Why should the frequency of some search terms
be positively correlated with flu cases? Give examples of search terms you think might be
correlated with flu
  - Parents searching to help their kids
  - People with the flu can behave similarly
    - Taking in fluids, eating bland foods, fever medication, stomach medicine
  - Immune system boosting goods

4) What are some reasons why someone searching for influenza may not have the flu?
Why might someone who has the flu not show up in Google's search logs?
  - Media coverage, frenzy
  - May have a cold and looking to differentiate
  - People concerned about catching the flu
  - Completely floored with the flu
    - Not doing anything
    - They know they have the flu

5) Suggest ways in which Google Flu Trends can be improved
  - Differential scoring (higher prevalence in specific areas) for early detection
    - Narrow down the breadth of CDC data collection and monitoring
    - Slow the spread from hot spots
  - Location and activity tracking (lower activity -> higher prevalence confirmation)