We want to find all Data.gov datasets that match a specific type of data (e.g. sea_water_temperature
), in a specified geospatial extent and time window, and that have a specific type of data endpoint (e.g. OPeNDAP). Since data.gov uses CKAN, while waiting for a CSW interface, here we try using the CKAN API with the requests
package and with the ckanclient
package.
import requests
import json
from pprint import pprint
Let's try doing a request first with the requests
package
f = requests.get("http://catalog.data.gov/api/3/search/dataset?q=%22sea_water_temperature%22").json()
print f.keys()
print f['count']
pprint(f['results'][0:5])
Now let's try the ckanclient
import ckanclient
ckan = ckanclient.CkanClient('http://catalog.data.gov/api/3')
Try the same simple search we did with requests
above
search_params = { 'q': 'tags:"sea_water_temperature" '}
d = ckan.action('package_search', **search_params)
print d['count']
Hmm... Interesting. With ckanclient
we got 51 results back, where with requests
, we got 52. Could be a 0- vs 1 based index thing?
Now let's try a more complex search, asking for only 3 results back. Below we ask for res_format:PDF
, but what are the other possible res_format
values we can query on?
search_params = {
'q': 'tags:"sea_water_temperature" AND metadata_modified:[2012-06-01T00:00:00.000Z TO NOW]',
'fq': 'res_format:PDF',
'extras': {"ext_bbox":"-121,45,-120,46"},
'rows': 3
}
d = ckan.action('package_search', **search_params)
print d['count']
search_params = {
'q': 'tags:"temperature"', 'fq': 'res_format:PDF',
'extras': {"ext_bbox":"-125,38,-124,39"},
'rows': 10
}
d = ckan.action('package_search', **search_params)
print d['count']
So what does one of these results look like? Let's take a look at the keys
print d['results'][0].keys()
pprint(d['results'][0]['resources'])
print d['results'][0]['resources'][0]['url']
Now let's see what the urls looks like for all the resources
for item in d['results']:
for member in item['resources']:
print 'url:',member['url']
print 'protocol:',member['resource_locator_protocol']
print 'resource_type:',member['resource_type']
So there are multiple resources for each record. Let's check out a some specific resource urls for all datasets
resource_number=0
for item in d['results']:
print item['resources'][resource_number]['url']
resource_number=1
for item in d['results']:
print item['resources'][resource_number]['url']
Hmm... These are not data granules, but data collections. We need the actually OPeNDAP dataset URLs. How do we get from this catalog information to the actual dataset service endpoints? Let's try looking for NetCDF
search_params = {
'q': 'tags:"sea_water_temperature"',
'fq': 'res_format:NetCDF',
'rows': 5
}
d = ckan.action('package_search', **search_params)
print d['count']
Hmmm.. We got skunked. Maybe try just temperature
instead of sea_water_temperature
?
search_params = {
'q': 'tags:"temperature"',
'fq': 'res_format:NetCDF',
'rows': 5
}
d = ckan.action('package_search', **search_params)
print d['count']
Well, that's better!
for item in d['results']:
for member in item['resources']:
print member['url']
resource_number=0
for item in d['results']:
print item['resources'][resource_number]['url']
resource_number=1
for item in d['results']:
print item['resources'][resource_number]['url']
resource_number=2
for item in d['results']:
print item['resources'][resource_number]['url']
Let's get a list of the OPeNDAP Data URLs
resource_number=3
dap_url=[]
for item in d['results']:
dap_url.append(item['resources'][resource_number]['url'].split('.html')[0])
pprint(dap_url)
Now let's try to read one of the OPeNDAP Data URLs
import netCDF4
nc = netCDF4.Dataset(dap_url[0])
nc.Conventions
print nc.variables.keys()
ncvar=nc.variables
print ncvar['yearday']
print ncvar['Temperature']
Uh, this says it's CF 1.4, but no way is it CF 1.4. There are not even units on time!
The long_name of time is just Day of the Year
, but there doesn't even seem to be information in the file to say what year:
for name in nc.ncattrs():
print name, '=', getattr(nc,name)
So if we are going to plot this, it's going to be with a human trying to figure out what is what:
len=10000
yd=ncvar['yearday'][0:len].flatten()
t=ncvar['Temperature'][0:len].flatten()
figsize(12,4)
plot(yd,t)
Still don't know what year this is. Guess 2012 since that's when the conversion script was run by NCDDC?
HTML(html)