-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathNotes
More file actions
187 lines (148 loc) · 7.58 KB
/
Notes
File metadata and controls
187 lines (148 loc) · 7.58 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
Sunday, August 09 2009
visualizations.labelcloud.refresh needs to run in a particular environment:
env PYTHONPATH=~/builds/Apollo/src/:~/builds/ DJANGO_SETTINGS_MODULE=djangotutorial.settings ./refresh.py -u <username> -p <password>
Sunday, July 26 2009
Flow for the gadget
Log-in
Use credentials (in another thread) to log-in and refresh Tag Cloud
Show current Tag Cloud
If database is empty then show the spinner and wait for Refresh Thread to complete
There should be a way to signal the client that the database has been refreshed
Wednesday, July 15 2009
Need some refactoring in the nlp and reader packages
Should really be using JSON output from reader
Took a parser with Django ('python-simplejson') but there are others
Module variables limit the tool to one user so classes will be required in most modules with "global" state
The plan now is to Label Reader Items with "top words" and redirect from a Term Cloud or Collocation entry to the Reader UI
Like the look of that UI and have no desire to duplicate the Reader back-end
Should be able to construct a data model based on a User Cookie
Saturday, July 11 2009
Going to re-locate some of the reader_pipeline functionality
Thought stopwords used in getcontent() would be updated but decided that will be up to Level 3 processing
topcontent() will return 'words' used directly by Level 3
Wednesday, July 08 2009
Looks like AppEngine is not going to workout
Need to customize Python run-time with NLTK which cannot (does seem to be) zimport-able
Thursday, July 02 2009
Will eventually make use of FirePython (http://firepython.binaryage.com/)
Hope to avoid a lot of webapp development right now: focus should be on NLTK
Wednesday, July 01 2009
Need to think about how the UI should be laid out
Should incorporate Tag Cloud and all of the related articles
Would like to order tagged entries according to PostRank
Haven't read about the API yet but suspect that the underlying link in say, a reddit post, would need to be used rather than the post itself
Want to (1) avoid duplicating Reader data and (2) run frequency and rank analysis in the background
That will mean having to query Reader on-demand comparing return values with stored (feed_name, item_id)
This might suck the performance: parse is time consuming
Proposed Model:
owner = UserProperty()
feed_list = StringListProperty("feed_list", "")
item_list = StringListProperty("item_list", "")
Should be able to relate Items and Feeds, but this might not even be necessary
If reading-list state is used then Item Model could be:
owner = UserProperty()
title = StringProperty()
id = StringProperty()
and Frequency Model could be:
owner = UserProperty()
category = CategoryProperty()
word = StringProperty()
frequency = IntegerProperty()
This way a query can be done on (owner, category) then (owner, title)
Probably going to need the continuation syntax for adequate performance:
# Pull in current infrastructure
from repo.google.reader import *
# Pull in user feed layout
reader_feeds = atom.feeds(username, password)
# Pull each feed in Model.feed_list
reader_feeds[0].refresh()
# Parse the return values
# (might need to parameterize refresh() and do this in a yield())
parsed_feed = reader_feeds[0].parse()
# Compare with each Item in Model.item_list
parsed_feed.entries[0].id
parsed_feed.entries[0].source.id # and this (although feed url is carried outside parsed_feed)
parsed_feed.feed.gr_continuation # use this in subsequent requests e.g. http://www.google.com/reader/atom/feed/http://xkcd.com/rss.xml?ck=1169900000&xt=user/-/state/com.google/read&c=CIu4qN33pJsC
There is no escaping a bit of UI design
It would be nice to simply "lens" Reader but that is not going to cut it
Thursday, June 25 2009
Could publish any time
Just want to store (title (summary?), reader_id) locally where reader_id is used to view
Would love to keep it all in Reader but the title/summary is required to match the high frequency words
Infrastructure is in place now but latency makes iteration difficult
Need a data structure that will support quick look-up of "title/summaries" contain the term(s)
This sounds like a matrix of some kind I need the particulars now
Monday, June 22 2009
Time to publish. Gating items:
Present word count of the top 50
Commenting
Logging
atom test cases
Sunday, June 21 2009
Label "version_1"
Pretty interesting feature set now:
1. Expat integration
2. FeedParse integration
3. NLTK integration
4. Google Reader API integration
Pulling data from Reader Repository and sampling a pretty large (>2000) corpus of Feed Titles.
Should probably clean-up and profile the code a little
Title Corpus alone is not interesting: NLTK analysis needs to be a little more sophisticated
Probably need to inform with a UI
Performance isn't great
Thinking of moving to Google AppEngine.
Should profile the code a little: might just be network latency but algorithms might need tweaking
Thinking of a push to GitHub
utils.fsm does some pretty cool Expat integration (IMHO).
Would like some feedback on that code, maybe some better ideas.
Could probably use some database integration
No need to go out for the Feeds every time the Titles need to analyzed: another reason to use AppEngine
Been thinking of some real-time Twitter feed analysis
Weight words according to time: if 'Iran' occurs in 10 different tweets within 15 minutes of each other then it becomes interesting for the analysis and starts a process of running deeper (further back in time)
I'm sure there is some analysis in there that would help me spot trends
Not restricted to Twitter. A similar analysis could be applied to any bookmarking feed
Monday, June 08 2009
List API could be similar:
subscriptions = feeds.google.reader.subscriptions
title = subscriptions[i].title
Friday, June 05 2009
We'll get a feeds.google.reader.Feed(title)
e.g. prog_reddit = feeds.google.reader.Feed('Programming Reddit')
feeds.google.reader.Feed attributes: title, identifier, unread_count
Have a flow like this:
# basic init: no i/o
reader_feeds = feeds.google.reader.atom.feeds('username', 'password')
prog_reddit = reader_feeds[i]
# pick-up: identifier, unread_count, and reading_list
prog_reddit.refresh
# might want to parameterize Feed.reading_list()
unread_count = prog_reddit.unread_count
# reading_list is raw rss
reading_list = prog_reddit.reading_list
prog_reddit.reading_list() is rss: pass directly to feedparser.parse()
might be handy to integrate feedparser:
prog_reddit.parse()
Saturday, May 23 2009
Should use Google Reader API so there is no need for local feed store
Using NLTK to deliver the top words
top phrase might be interesting but interaction would be more productive
a tag cloud or heat map
So it goes like this now:
1. Source Reader for certain feeds
2. Produce a cloud of terms
3. Click on term in cloud to view feed entries
Monday, May 18 2009
Going to take the oldest feed file and update it
Use a python list of objects corresponding to the most primitive RSS 2.0 feed:
[((channel:title)([(item:title, item:link)]))]
or hierarchally:
channel:title
item
title
link
item
title
link
...
Could be a little more general and use sets:
Oldest feed will be a subset of newer so we just pickle the union