forked from swcarpentry/web-data-python
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy path01-getdata.html
More file actions
190 lines (190 loc) · 10.2 KB
/
01-getdata.html
File metadata and controls
190 lines (190 loc) · 10.2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title>Software Carpentry: Working With Data on the Web</title>
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
<link rel="stylesheet" type="text/css" href="css/swc.css" />
<link rel="alternate" type="application/rss+xml" title="Software Carpentry Blog" href="http://software-carpentry.org/feed.xml"/>
<meta charset="UTF-8" />
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body class="lesson">
<div class="container card">
<div class="banner">
<a href="http://software-carpentry.org" title="Software Carpentry">
<img alt="Software Carpentry banner" src="img/software-carpentry-banner.png" />
</a>
</div>
<article>
<div class="row">
<div class="col-md-10 col-md-offset-1">
<h1 class="title">Working With Data on the Web</h1>
<h2 class="subtitle">Getting Data</h2>
<div id="learning-objectives" class="objectives panel panel-warning">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-certificate"></span>Learning Objectives</h2>
</div>
<div class="panel-body">
<ul>
<li>Write Python programs to download data sets using simple REST APIs.</li>
</ul>
</div>
</div>
<p>A growing number of organizations make data sets available on the web in a style called <a href="reference.html#rest">REST</a>, which stands for REpresentational State Transfer. The details (and ideology) aren’t important; what matters is that when REST is used, every data set is identified by a URL.</p>
<p>For this example we’ll use data generated by 15 global circulation models that is provided through the World Bank’s <a href="http://data.worldbank.org/developers/climate-data-api">Climate Data API</a>. According to the API’s home page, the data sets containing yearly averages for various values are identified by URLs of the form:</p>
<p><code>http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/<em>var</em>/year/<em>iso3</em>.<em>ext</em></code></p>
<p>where:</p>
<ul>
<li><em>var</em> is either <code>pr</code> (for precipitation) or <code>tas</code> (for “temperature at surface”);</li>
<li><em>iso3</em> is the <a href="http://en.wikipedia.org/wiki/ISO_3166-1_alpha-3">International Standards Organization (ISO) 3-letter code for a specific country</a>, such as “CAN” for Canada or “BRA” for Brazil; and</li>
<li><em>ext</em> (short for “extension”) specifies the format we want the data in. There are several choices for format, but the simplest is <a href="reference.html#csv">comma-separated values</a> (CSV), in which each record is a row, and the values in each row are separated by commas. (CSV is frequently used for spreadsheet data.)</li>
</ul>
<p>For example, if we want the average annual temperature in Canada as a CSV file, the URL is:</p>
<pre><code>http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/tas/year/CAN.csv</code></pre>
<p>If we paste that URL into a browser, it displays:</p>
<pre><code>year,data
1901,-7.67241907119751
1902,-7.862711429595947
1903,-7.910782814025879
...
2007,-6.819293975830078
2008,-7.2008957862854
2009,-6.997011661529541</code></pre>
<div id="behind-the-scenes" class="callout panel panel-info">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pushpin"></span>Behind the Scenes</h2>
</div>
<div class="panel-body">
<p>This particular data set might be stored in a file on the server, or the server might do this:</p>
<ol style="list-style-type: decimal">
<li>Receive our URL.</li>
<li>Break it into pieces.</li>
<li>Extract the three key fields (the variable, the country code, and the desired format).</li>
<li>Fetch the desired data from a database.</li>
<li>Format the data as CSV.</li>
<li>Send that to our browser.</li>
</ol>
<p>As long as the World Bank doesn’t change its URLs, it can switch back and forth between these approaches without breaking our programs.</p>
</div>
</div>
<p>If we only wanted to look at data for two or three countries, we could just download those files one by one. But we want to compare data for many different pairs of countries, which means we should write a program.</p>
<p>Python has a library called <code>urllib2</code> for working with URLs. It is clumsy to use, though, so many people (including us) prefer a third-party library called <a href="http://docs.python-requests.org">Requests</a>. To install it, run the command:</p>
<pre class="sourceCode bash"><code class="sourceCode bash"><span class="kw">pip</span> install requests</code></pre>
<pre class="output"><code>Requirement already satisfied (use --upgrade to upgrade): requests in /Users/gwilson/anaconda/lib/python2.7/site-packages
Cleaning up...</code></pre>
<p>We get this message because we already have it installed; if you don’t, you’ll see a different message. We can now get the data we want like this:</p>
<pre class="sourceCode python"><code class="sourceCode python"><span class="ch">import</span> requests
url = <span class="st">'http://climatedataapi.worldbank.org/climateweb/rest/v1/country/cru/tas/year/CAN.csv'</span>
response = requests.get(url)
<span class="kw">if</span> response.status_code != <span class="dv">200</span>:
<span class="dt">print</span>(<span class="st">'Failed to get data:'</span>, response.status_code)
<span class="kw">else</span>:
<span class="dt">print</span>(<span class="st">'First 100 characters of data are'</span>)
<span class="dt">print</span>(response.text[:<span class="dv">100</span>])</code></pre>
<pre class="output"><code>First 100 characters of data are
year,data
1901,-7.67241907119751
1902,-7.862711429595947
1903,-7.910782814025879
1904,-8.15572929382</code></pre>
<p>The first line imports the <code>requests</code> library. The second defines the URL for the data we want; we could just pass this URL as an argument to the <code>requests.get</code> call on the third line, but assigning it to a variable makes it easier to find.</p>
<p><code>requests. et</code> actually gets our data. More specifically, it:</p>
<ul>
<li>creates a connection to the <code>climatedataapi.worldbank.org</code> server;</li>
<li>sends it the URL <code>/climateweb/rest/v1/country/cru/tas/year/CAN.csv</code>;</li>
<li>creates an object in memory on our computer to hold the response;</li>
<li>assigns a number to the object’s <code>status_code</code> member variable to tell us whether the request succeeded or not; and</li>
<li>assigns the data sent back by the web server to the object’s <code>text</code> member variable.</li>
</ul>
<p>The server can return many different <a href="reference.html#http-status-code">status codes</a>; the most common are:</p>
<table>
<thead>
<tr class="header">
<th align="left">Code</th>
<th align="left">Name</th>
<th align="left">Meaning</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">200</td>
<td align="left">OK</td>
<td align="left">The request has succeeded.</td>
</tr>
<tr class="even">
<td align="left">204</td>
<td align="left">No Content</td>
<td align="left">The server has completed the request, but doesn’t need to return any data.</td>
</tr>
<tr class="odd">
<td align="left">400</td>
<td align="left">Bad Request</td>
<td align="left">The request is badly formatted.</td>
</tr>
<tr class="even">
<td align="left">401</td>
<td align="left">Unauthorized</td>
<td align="left">The request requires authentication.</td>
</tr>
<tr class="odd">
<td align="left">404</td>
<td align="left">Not Found</td>
<td align="left">The requested resource could not be found.</td>
</tr>
<tr class="even">
<td align="left">408</td>
<td align="left">Timeout</td>
<td align="left">The server gave up waiting for the client.</td>
</tr>
<tr class="odd">
<td align="left">418</td>
<td align="left">I’m a teapot</td>
<td align="left">No, really…</td>
</tr>
<tr class="even">
<td align="left">500</td>
<td align="left">Internal Server Error</td>
<td align="left">An error occurred in the server.</td>
</tr>
</tbody>
</table>
<p>Of these, 200 is the one we really care about: if we get anything else, the response probably doesn’t contain actual data (though it might contain an error message).</p>
<div id="some-people-dont-follow-the-rules" class="callout panel panel-info">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pushpin"></span>Some People Don’t Follow the Rules</h2>
</div>
<div class="panel-body">
<p>Unfortunately, some sites don’t return a meaningful status code. Instead, they return 200 for <em>everything</em>, then put an error message (if appropriate) in the text of the response. This works when the result is being displayed to a human being, but fails miserably when the “reader” is a program that can’t actually read.</p>
</div>
</div>
<div id="how-hot-is-afghanistan" class="challenge panel panel-success">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pencil"></span>How Hot is Afghanistan?</h2>
</div>
<div class="panel-body">
<p>Read the <a href="http://data.worldbank.org/developers/climate-data-api">documentation</a> for the Climate Data API, and then write URLs to find the annual average temperature for Afghanistan between 1980 and 1999.</p>
</div>
</div>
</div>
</div>
</article>
<div class="footer">
<a class="label swc-blue-bg" href="http://software-carpentry.org">Software Carpentry</a>
<a class="label swc-blue-bg" href="https://github.com/swcarpentry/lesson-template">Source</a>
<a class="label swc-blue-bg" href="mailto:admin@software-carpentry.org">Contact</a>
<a class="label swc-blue-bg" href="LICENSE.html">License</a>
</div>
</div>
<!-- Javascript placed at the end of the document so the pages load faster -->
<script src="http://software-carpentry.org/v5/js/jquery-1.9.1.min.js"></script>
<script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
</body>
</html>