Portal Home > Knowledgebase > Articles Database > Most effective way to parse a website
Most effective way to parse a website
Posted by Steven F, 11-12-2011, 11:35 AM |
I'm looking for the most effective way and language to do the following (it's basically a crawler):
Visit URLParse URLGather some basic informationSave that information to database
The only thing is that this is going to be on a large scale. I want to run it once every 1 - 4 hours, and it will be visiting 3,000 - 5,000 URLs. I'm currently using Java (JSoup), but it returns errors often and is very slow (1.5 sites per second on average). Any ideas? My goal is to be able to do around 6 sites per second.
Preferred languages:
JavaPHPCPythonPerl
Last edited by Steven F; 11-12-2011 at 11:42 AM.
|
Posted by webstartavenue, 11-12-2011, 09:43 PM |
Generally speaking code written in Java or C will be faster and consume less memory than that written in PHP, Python, or Perl. (See the Programming Language Shootout for examples)
Have you benchmarked your code to determine that the greatest bottleneck is indeed the HTML parsing? It could also be the database (large number of writes) or the webpage retrieval (HTTP roundtrip + download time).
If the HTML parsing is indeed the bottleneck, typically the most performant XML/HTML parsers wrap the libxml2 library and for Java that looks like it might be available via libxmlj in the Classpath Project.
Given that you are already working in Java you might also try looking at other HTML parsing libraries to see if you can get some more performance:
-NekoHTML
-jTidy
-HtmlCleaner
|
Posted by breton, 11-13-2011, 04:19 AM |
Try python with beautifulsoup for your case.
|
Posted by zahid_r_i, 11-15-2011, 12:48 AM |
I used the HTMLAgilityPack (C#) on a project for a previous employer and it was pretty effective.
|
Posted by TailoredVPS, 11-18-2011, 12:13 PM |
I would recommend using Java or C and have your code run on multiple threads rather than a single thread.
|
Posted by Preetam, 11-18-2011, 03:37 PM |
I'll suggest Node.js and CouchDB if you're feeling adventurous.
|
Posted by Ersan, 11-18-2011, 04:05 PM |
It really depends on which language your strengths lie in. I tend to use PHP and DOM to parse websites in my programs if I can get away with it because I know how to write pretty efficient PHP code.
C would have the least overhead if you are a capable C programmer.
|
Add to Favourites Print this Article
Also Read