The task of the crawler is to keep on getting information from the internet into the database of the search engine. It literally crawls over the internet from page to page, link by link and downloads all the information to the database. A search engine is made up of basically four parts:
- Web Crawler
- Database
- Search Algorithm
- Search system that binds all the above together
For more information on crawler visit the wiki page for
web crawlers A crawler development can be planned out into phases as we will be doing.
- To begin with, we would develop a very trivial crawler that will just crawl the url spoon fed to it.
- Then we will make a crawler with capability to extract urls from the downloaded web page.
- Next we can also make a queue system in the crawler that will track no of urls still to be downloaded.
- We can then add capability to the crawler to extract only the user visible text from the web page.
- There after we will make a multi-threaded downloader that will utilize our network bandwidth to the maximum.
- And we will also add some kind of front end to it, probably in php.
In this part of the article we will make a simple java crawler which will crawl a single page over the internet. Net-beans is primarily used for the crawler development, the database would be implemented in Mysql . Make a new project in Net-beans and save it by the name something like “WebC” or “w1”,etc. By default there will be a class called Main.java in the default package of the project. Write the following code in it’s main() function. This class will later be worked upon and new classes will be added once we get going.
package net.viralpatel.java.webcrawler;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
public class Main {
public static void main(String[] args) {
try {
URL my_url = new URL("http://www.vimalkumarpatel.blogspot.com/");
BufferedReader br = new BufferedReader(new InputStreamReader(my_url.openStream()));
String strTemp = "";
while(null != (strTemp = br.readLine())){
System.out.println(strTemp);
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
Code language: Java (java)
viola, there is your first baby crawler :) Watch the output when you first run it, when runing successfully it will show you the HTML code for the web page ‘
www.vimalkumarpatel.blogspot.com‘ .
Trouble Shooting in Web Crawler
It may give some hiccups or may stumble upon some errors, most probably network errors related to proxy settings on your Net-beans and JVM. In such a case you can change the proxy IP & port for the Net-beans at Tools>>options>>general>>proxy settings.Also you may need to feed the same to the JVM via command line, that can be done in Net-Beans at File>>’w1′ Properties>>Run>>VM options: write the following in the text box over there.
-Dhttp.proxyHost=<your proxy IP> -Dhttp.proxyPort=<port for the same> example:
-Dhttp.proxyHost=172.16.3.1 -Dhttp.proxyPort=3128 Future Work
Keep on visiting this site for the next article following soon, wherein we will discuss possible improvements in our crawler along the plan we chalked out earlier. Also watch out for an article on how to integrate your eclipse IDE and google android’s ADT for android application development.
View Comments
Coolll, i\'m looking forward to see the thread implementation.
This is the first step for the crawler, but you still need to add an HTML parser to extract the meaningful and "readable" from the "code" data... From there, you can play with search algorithms, indexing, etc... As Google bots does, they save the entire output as your code snippet as a cache and apply the other techniques... Others also use this technique for Web Scrapping or data harvest, which may be illegal depending on the Terms and Conditions on a given website...
Marcello de Sales
nice starter guide, probably next part you could add Libxml for HTML parsing and URL rebuild. also robots.txt exclusion rules.
Very interesting !
But i can\'t found the next part ... It is already written ?
hi , im working on similar project , my aim is to build a high capacity web crawler , just wanted to ask what would it be the average speed of links checked per second for a fast crawler, what i did is a mysql based crawler , and maximum i did is 10 checked links per 1 sec, on arraylist based loop in the java code, with mysql retrieving loop this speed is 2 checked links per a second .
How do you parse contents which are rendered by Javascript after load. Thanks Arnab
please provide the next tutorial on this topic
Hey I just wanted to ask if anyone would be able to give me a good starting point for developing a web crawler to crawl the 'deep web', basically to performing deep searching. I was just wondering if anyone had any idea about this aspect of web crawling? If so any guidance with regard to a good starting point for the deep web crawler development or a good website to refer to for more information or a good book to read for this task, would be highly appreciated!!!
This is something I have been looking for days.Thanks a lot.
Is there anyway we can develop the webcrawler to visit all web sites that contains say an item may be a book or any sporting goods?Please post a tutorial if we can build that in less then a 100lines of codes.
Thanks again
have you already posted the second part?