Fixed adress parsing and page extraction by TheRadialActive · Pull Request #3 · CodeforKarlsruhe/mietmap-scraper

TheRadialActive · 2017-04-02T11:54:51Z

No description provided.

torfsen

Nice work! There are some minor things that I'd like to see fixed or don't understand, otherwise this is great. However, I still find some error messages in my logs when I run this:

Traceback (most recent call last):
  File "./scrape.py", line 409, in <module>
    get_new_listings(db)
  File "./scrape.py", line 352, in get_new_listings
    listings = extract_listings(page)
  File "./scrape.py", line 159, in extract_listings
    street_span = entry.find('div', class_='result-list-entry__address').find('span').contents[0]
AttributeError: 'NoneType' object has no attribute 'contents'

Looks like we either need to check whether that element we're looking for is really there (in case it's optional) or we need to fix the way we're looking for it (in case it's always there but our selector sometimes fails).

torfsen · 2017-04-06T15:42:24Z

@@ -47,7 +47,7 @@
 # Immobilienscout24 URLs for listings in Karlsruhe
 BASE_URL = 'http://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Baden-Wuerttemberg/Karlsruhe'
 PAGE_URL = 'http://www.immobilienscout24.de/Suche/S-T/P-%d/Wohnung-Miete/Baden-Wuerttemberg/Karlsruhe?pagerReporting=true'


I guess these URLs need to be changed as well? The best way would be if one could set just the city using a single variable, perhaps we can use a generic search URL (if one exists)?

torfsen · 2017-04-06T15:43:17Z

+            dd = dl.find('dd')
            content = unicode(dd.string).strip()
-            if content.endswith('€'):
+            if content.endswith(' €'):


The new check covers less cases then the old (it needs an additional space). Why is that necessary?

torfsen · 2017-04-06T15:43:27Z

+            if content.endswith(' €'):
                rent = parse_german_float(content.split()[0])
-            elif content.endswith('m²'):
+            elif content.endswith(' m²'):


Same as above.

torfsen · 2017-04-06T15:44:27Z

            'rent': rent,
            'area': area,
        }
+        print(listings)


Right now the scraper doesn't output anything to STDOUT, all output goes into the database or into the log file. I'd like to keep it that way for the moment.

torfsen · 2017-04-06T15:45:53Z

            logger.info("Fetching page %d" % page_index)
            page = get_page(page_index)
-            num_pages = num_pages or extract_number_of_pages(page)
+            num_pages = extract_number_of_pages(page)


Why is this change necessary?

torfsen · 2017-04-06T15:58:55Z

 BASE_URL = 'http://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Baden-Wuerttemberg/Karlsruhe'
 PAGE_URL = 'http://www.immobilienscout24.de/Suche/S-T/P-%d/Wohnung-Miete/Baden-Wuerttemberg/Karlsruhe?pagerReporting=true'
-
+CITY = 'Karlsruhe'


And can we please make this a command line parameter while we're at it?

d2ns added 3 commits April 2, 2017 13:13

extract page list working again

1625175

address parsing fixed

506711a

added variable for hardcoded city

73a9463

torfsen requested changes Apr 6, 2017

View reviewed changes

torfsen self-assigned this Apr 6, 2017

torfsen reviewed Apr 6, 2017

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed adress parsing and page extraction#3

Fixed adress parsing and page extraction#3
TheRadialActive wants to merge 3 commits into
CodeforKarlsruhe:masterfrom
TheRadialActive:master

TheRadialActive commented Apr 2, 2017

Uh oh!

torfsen left a comment

Uh oh!

torfsen Apr 6, 2017

Uh oh!

torfsen Apr 6, 2017

Uh oh!

torfsen Apr 6, 2017

Uh oh!

torfsen Apr 6, 2017

Uh oh!

torfsen Apr 6, 2017

Uh oh!

torfsen Apr 6, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TheRadialActive commented Apr 2, 2017

Uh oh!

torfsen left a comment

Choose a reason for hiding this comment

Uh oh!

torfsen Apr 6, 2017

Choose a reason for hiding this comment

Uh oh!

torfsen Apr 6, 2017

Choose a reason for hiding this comment

Uh oh!

torfsen Apr 6, 2017

Choose a reason for hiding this comment

Uh oh!

torfsen Apr 6, 2017

Choose a reason for hiding this comment

Uh oh!

torfsen Apr 6, 2017

Choose a reason for hiding this comment

Uh oh!

torfsen Apr 6, 2017

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants