Skip to content

Fixed adress parsing and page extraction#3

Open
TheRadialActive wants to merge 3 commits into
CodeforKarlsruhe:masterfrom
TheRadialActive:master
Open

Fixed adress parsing and page extraction#3
TheRadialActive wants to merge 3 commits into
CodeforKarlsruhe:masterfrom
TheRadialActive:master

Conversation

@TheRadialActive

Copy link
Copy Markdown

No description provided.

@torfsen torfsen left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! There are some minor things that I'd like to see fixed or don't understand, otherwise this is great. However, I still find some error messages in my logs when I run this:

Traceback (most recent call last):
  File "./scrape.py", line 409, in <module>
    get_new_listings(db)
  File "./scrape.py", line 352, in get_new_listings
    listings = extract_listings(page)
  File "./scrape.py", line 159, in extract_listings
    street_span = entry.find('div', class_='result-list-entry__address').find('span').contents[0]
AttributeError: 'NoneType' object has no attribute 'contents'

Looks like we either need to check whether that element we're looking for is really there (in case it's optional) or we need to fix the way we're looking for it (in case it's always there but our selector sometimes fails).

Comment thread scrape.py
@@ -47,7 +47,7 @@
# Immobilienscout24 URLs for listings in Karlsruhe
BASE_URL = 'http://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Baden-Wuerttemberg/Karlsruhe'
PAGE_URL = 'http://www.immobilienscout24.de/Suche/S-T/P-%d/Wohnung-Miete/Baden-Wuerttemberg/Karlsruhe?pagerReporting=true'

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess these URLs need to be changed as well? The best way would be if one could set just the city using a single variable, perhaps we can use a generic search URL (if one exists)?

Comment thread scrape.py
dd = dl.find('dd')
content = unicode(dd.string).strip()
if content.endswith('€'):
if content.endswith(' €'):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new check covers less cases then the old (it needs an additional space). Why is that necessary?

Comment thread scrape.py
if content.endswith(' €'):
rent = parse_german_float(content.split()[0])
elif content.endswith('m²'):
elif content.endswith(' m²'):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Comment thread scrape.py
'rent': rent,
'area': area,
}
print(listings)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the scraper doesn't output anything to STDOUT, all output goes into the database or into the log file. I'd like to keep it that way for the moment.

Comment thread scrape.py
logger.info("Fetching page %d" % page_index)
page = get_page(page_index)
num_pages = num_pages or extract_number_of_pages(page)
num_pages = extract_number_of_pages(page)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change necessary?

@torfsen torfsen self-assigned this Apr 6, 2017
Comment thread scrape.py
BASE_URL = 'http://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Baden-Wuerttemberg/Karlsruhe'
PAGE_URL = 'http://www.immobilienscout24.de/Suche/S-T/P-%d/Wohnung-Miete/Baden-Wuerttemberg/Karlsruhe?pagerReporting=true'

CITY = 'Karlsruhe'

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And can we please make this a command line parameter while we're at it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants