Use javascript to download webpages

Wed 27 February 2013
By balrok

This article is about how I've used javascript to download from real estate websites, parsing the data and put it into a table. The whole process is done on clientside to have some fun with javascript.

The webpage

On balrok.com/wohnung I've uploaded this page. Wohnung is the German word for apartment in case you wonder :). I only have tested this with firefox 8 and opera 12. The functionality is this: 1. You enter a website (currently only immobilienscout24 and wohnungsmarkt24.de are supported. And click on the button next to it. It will then download the page and update the form fields below. If the website isn't supported you still can update the fields by hand. Once you click save the data is saved to a table and can be sorted or edited.

The downloading

For me this was the most interesting part. As you may know ajax only works on the current server for security reasons. There is only one possibility to get data from other servers: jsonp. The change to ajaxrequests is: you need a ?callback=javascriptFunction parameter at your url, which will call this javascriptFunction and the data from the url must be in json format (else it will be dropped before reaching the javascript). To get a foreign website in json format there are two possibilities. You can create yourself a serverside script which downloads from a given url (inside the parameter) and returns the webpages content in json format (with all escaping). Or you can reuse yahoo's query language which allows you to do sql-like queries on webpages and specify the return format. The wohnung-script uses the later and looks like this: [gist]https://gist.github.com/balrok/5047151[/gist] With jQuery you can specify the dataType property which will automatically set the callBack function.

Extracting the data

After downloading, I will look at the hostname and decide which parser to use. [gist]https://gist.github.com/balrok/5047168[/gist] The parsing is mostly simple jQuery selector usage and sometimes formatting of the data (for example ca. 50m² must extract the 50 or images should be rescaled to maximum size). A speciality is, that you would use jQuery("#testdiv", html) for selecting - the second parameter is the html from the website. I first experimented with inserting the website temporary into the page, but that brings big security issues with it since it would then parse all <script> tags.

Persistence

I used this to experiment a bit with localstorage, but there's not much to write about it. Only thing perhaps: for editing/deleting entries I needed a unique id. For simplicities sake I just used the array-key as id. So when deleting I had to set the value for this index to "undefined" and when loading from localstorage I have to strip all undefined values first. Also I had to use a JSON-library for loading and saving.

Code Architecture

This is still one of my biggest problems with javascript. While I like the language I often create a hard to maintain one-big file of javascript. In this project I tried to reduce the mentally load by finding good ways to structure the code: wohnung-data.js: for persistence - loading,saving mainly (theoretically performing actions on the data and id-management too) wohnung.js: first global helpers, then html-interaction, then site-specific functions

To explain "global helpers" - those are functions which could theoretically be used in other projects too since they are not specific to this page. "html-interaction" are event listener (for example click events). And "site-specific functions" are project specific functions which only make sense for this page.

Some future scenarios I already have thought about: - Adding more parsers for different real-estate sites: probably make it oop and create an extra file for all kinds of parsers - Making attributes configurable (like removing "size" attribute or adding a "street" attribute.. Also formulas).. The configurable part would be a set of html-events which could be splitted into another file and then I find a generic way inside the parsers and table-creation

Used libraries

Besides the downloading and architecture I also found it great to use different libraries. So here is a (hopefully) complete list of libraries used:

  • Yii - only used, because I wanted to integrate it into my balrok.com site
  • Twitter Bootstrap - for a visual appealing design
  • jQuery - the library which made using js fun!
  • colorbox - a nice, simplicistic lightbox for jQuery
  • stupidtable - a tablesorter - not as the name suggests stupid. The author explicitly wants to keep it minimal by providing many features
  • Json for persistence inside localstorage where only strings are allowed

The code

Currently only available when you download the source of the page - I don't have anything minified so this shouldn't be a big problem.

Commentaires: