reader_pipeline/Notes at master · colgur/reader_pipeline · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
Sunday, August 09 2009
visualizations.labelcloud.refresh needs to run in a particular environment:
   env PYTHONPATH=~/builds/Apollo/src/:~/builds/ DJANGO_SETTINGS_MODULE=djangotutorial.settings ./refresh.py -u <username> -p <password>

Sunday, July 26 2009
Flow for the gadget
   Log-in
      Use credentials (in another thread) to log-in and refresh Tag Cloud
   Show current Tag Cloud
      If database is empty then show the spinner and wait for Refresh Thread to complete

There should be a way to signal the client that the database has been refreshed

Wednesday, July 15 2009
Need some refactoring in the nlp and reader packages
   Should really be using JSON output from reader
      Took a parser with Django ('python-simplejson') but there are others
   Module variables limit the tool to one user so classes will be required in most modules with "global" state

The plan now is to Label Reader Items with "top words" and redirect from a Term Cloud or Collocation entry to the Reader UI
   Like the look of that UI and have no desire to duplicate the Reader back-end
   Should be able to construct a data model based on a User Cookie

Saturday, July 11 2009
Going to re-locate some of the reader_pipeline functionality
   Thought stopwords used in getcontent() would be updated but decided that will be up to Level 3 processing
   topcontent() will return 'words' used directly by Level 3

Wednesday, July 08 2009
Looks like AppEngine is not going to workout
   Need to customize Python run-time with NLTK which cannot (does seem to be) zimport-able

Thursday, July 02 2009
Will eventually make use of FirePython (http://firepython.binaryage.com/)
   Hope to avoid a lot of webapp development right now: focus should be on NLTK

Wednesday, July 01 2009
Need to think about how the UI should be laid out
   Should incorporate Tag Cloud and all of the related articles

Would like to order tagged entries according to PostRank
   Haven't read about the API yet but suspect that the underlying link in say, a reddit post, would need to be used rather than the post itself

Want to (1) avoid duplicating Reader data and (2) run frequency and rank analysis in the background
   That will mean having to query Reader on-demand comparing return values with stored (feed_name, item_id)
      This might suck the performance: parse is time consuming
   Proposed Model:
      owner = UserProperty()
      feed_list = StringListProperty("feed_list", "")
      item_list = StringListProperty("item_list", "")
         Should be able to relate Items and Feeds, but this might not even be necessary
            If reading-list state is used then Item Model could be:
               owner = UserProperty()
               title = StringProperty()
               id = StringProperty()
            and Frequency Model could be:
               owner = UserProperty()
               category = CategoryProperty()
               word = StringProperty()
               frequency = IntegerProperty()
            This way a query can be done on (owner, category) then (owner, title)

Probably going to need the continuation syntax for adequate performance:
   # Pull in current infrastructure
   from repo.google.reader import *

   # Pull in user feed layout
   reader_feeds = atom.feeds(username, password)

   # Pull each feed in Model.feed_list
   reader_feeds[0].refresh()

   # Parse the return values
   # (might need to parameterize refresh() and do this in a yield())
   parsed_feed = reader_feeds[0].parse()

   # Compare with each Item in Model.item_list
   parsed_feed.entries[0].id
   parsed_feed.entries[0].source.id # and this (although feed url is carried outside parsed_feed)
   parsed_feed.feed.gr_continuation # use this in subsequent requests e.g. http://www.google.com/reader/atom/feed/http://xkcd.com/rss.xml?ck=1169900000&xt=user/-/state/com.google/read&c=CIu4qN33pJsC

There is no escaping a bit of UI design
   It would be nice to simply "lens" Reader but that is not going to cut it

Thursday, June 25 2009
Could publish any time

Just want to store (title (summary?), reader_id) locally where reader_id is used to view
   Would love to keep it all in Reader but the title/summary is required to match the high frequency words

Infrastructure is in place now but latency makes iteration difficult
   Need a data structure that will support quick look-up of "title/summaries" contain the term(s)
      This sounds like a matrix of some kind I need the particulars now

Monday, June 22 2009
Time to publish. Gating items:
   Present word count of the top 50
   Commenting
   Logging
   atom test cases

Sunday, June 21 2009
Label "version_1"
   Pretty interesting feature set now:
      1. Expat integration
      2. FeedParse integration
      3. NLTK integration
      4. Google Reader API integration
   Pulling data from Reader Repository and sampling a pretty large (>2000) corpus of Feed Titles.
   Should probably clean-up and profile the code a little

Title Corpus alone is not interesting: NLTK analysis needs to be a little more sophisticated
   Probably need to inform with a UI

Performance isn't great
   Thinking of moving to Google AppEngine.
   Should profile the code a little: might just be network latency but algorithms might need tweaking

Thinking of a push to GitHub
   utils.fsm does some pretty cool Expat integration (IMHO).
   Would like some feedback on that code, maybe some better ideas.

Could probably use some database integration
   No need to go out for the Feeds every time the Titles need to analyzed: another reason to use AppEngine

Been thinking of some real-time Twitter feed analysis
   Weight words according to time: if 'Iran' occurs in 10 different tweets within 15 minutes of each other then it becomes interesting for the analysis and starts a process of running deeper (further back in time)
      I'm sure there is some analysis in there that would help me spot trends
   Not restricted to Twitter. A similar analysis could be applied to any bookmarking feed

Monday, June 08 2009
List API could be similar:
   subscriptions = feeds.google.reader.subscriptions
   title = subscriptions[i].title

Friday, June 05 2009
We'll get a feeds.google.reader.Feed(title)
   e.g. prog_reddit = feeds.google.reader.Feed('Programming Reddit')

feeds.google.reader.Feed attributes: title, identifier, unread_count

Have a flow like this:
   # basic init: no i/o
   reader_feeds = feeds.google.reader.atom.feeds('username', 'password')
   prog_reddit = reader_feeds[i]

   # pick-up: identifier, unread_count, and reading_list
   prog_reddit.refresh

   # might want to parameterize Feed.reading_list()
   unread_count = prog_reddit.unread_count

   # reading_list is raw rss
   reading_list = prog_reddit.reading_list

   prog_reddit.reading_list() is rss: pass directly to feedparser.parse()
      might be handy to integrate feedparser:
         prog_reddit.parse()

Saturday, May 23 2009
Should use Google Reader API so there is no need for local feed store

Using NLTK to deliver the top words
   top phrase might be interesting but interaction would be more productive
   a tag cloud or heat map

So it goes like this now:
   1. Source Reader for certain feeds
   2. Produce a cloud of terms
   3. Click on term in cloud to view feed entries

Monday, May 18 2009
Going to take the oldest feed file and update it
   Use a python list of objects corresponding to the most primitive RSS 2.0 feed:
      [((channel:title)([(item:title, item:link)]))]
   or hierarchally:
      channel:title
         item
            title
            link
         item
            title
            link
         ...

Could be a little more general and use sets:
   Oldest feed will be a subset of newer so we just pickle the union