How to write a Web Crawler in Java. Part-1

The task of the crawler is to keep on getting information from the internet into the database of the search engine. It literally crawls over the internet from page to page, link by link and downloads all the information to the database. A search engine is made up of basically four parts:
  1. Web Crawler
  2. Database
  3. Search Algorithm
  4. Search system that binds all the above together
For more information on crawler visit the wiki page for web crawlers A crawler development can be planned out into phases as we will be doing.
  1. To begin with, we would develop a very trivial crawler that will just crawl the url spoon fed to it.
  2. Then we will make a crawler with capability to extract urls from the downloaded web page.
  3. Next we can also make a queue system in the crawler that will track no of urls still to be downloaded.
  4. We can then add capability to the crawler to extract only the user visible text from the web page.
  5. There after we will make a multi-threaded downloader that will utilize our network bandwidth to the maximum.
  6. And we will also add some kind of front end to it, probably in php.
In this part of the article we will make a simple java crawler which will crawl a single page over the internet. Net-beans is primarily used for the crawler development, the database would be implemented in Mysql . Make a new project in Net-beans and save it by the name something like “WebC” or “w1”,etc. By default there will be a class called Main.java in the default package of the project. Write the following code in it’s main() function. This class will later be worked upon and new classes will be added once we get going.
/* * To change this template, choose Tools | Templates * and open the template in the editor. */ package net.viralpatel.java.webcrawler; import java.io.BufferedReader; import java.io.InputStreamReader; import java.net.URL; /** * * @author vimal */ public class Main { /** * @param args the command line arguments */ public static void main(String[] args) { try { URL my_url = new URL("http://www.vimalkumarpatel.blogspot.com/"); BufferedReader br = new BufferedReader(new InputStreamReader(my_url.openStream())); String strTemp = ""; while(null != (strTemp = br.readLine())){ System.out.println(strTemp); } } catch (Exception ex) { ex.printStackTrace(); } } }
Code language: Java (java)
viola, there is your first baby crawler :) Watch the output when you first run it, when runing successfully it will show you the HTML code for the web page ‘www.vimalkumarpatel.blogspot.com‘ .

Trouble Shooting in Web Crawler

It may give some hiccups or may stumble upon some errors, most probably network errors related to proxy settings on your Net-beans and JVM. In such a case you can change the proxy IP & port for the Net-beans at Tools>>options>>general>>proxy settings.Also you may need to feed the same to the JVM via command line, that can be done in Net-Beans at File>>’w1′ Properties>>Run>>VM options: write the following in the text box over there. -Dhttp.proxyHost=<your proxy IP> -Dhttp.proxyPort=<port for the same> example: -Dhttp.proxyHost=172.16.3.1 -Dhttp.proxyPort=3128

Future Work

Keep on visiting this site for the next article following soon, wherein we will discuss possible improvements in our crawler along the plan we chalked out earlier. Also watch out for an article on how to integrate your eclipse IDE and google android’s ADT for android application development.
Get our Articles via Email. Enter your email address.

You may also like...

21 Comments

  1. Coolll, i\’m looking forward to see the thread implementation.

  2. This is the first step for the crawler, but you still need to add an HTML parser to extract the meaningful and “readable” from the “code” data… From there, you can play with search algorithms, indexing, etc… As Google bots does, they save the entire output as your code snippet as a cache and apply the other techniques… Others also use this technique for Web Scrapping or data harvest, which may be illegal depending on the Terms and Conditions on a given website…

    Marcello de Sales

  3. spiderwick says:

    nice starter guide, probably next part you could add Libxml for HTML parsing and URL rebuild. also robots.txt exclusion rules.

  4. Sheppounet says:

    Very interesting !
    But i can\’t found the next part … It is already written ?

  5. justonefix says:

    hi , im working on similar project , my aim is to build a high capacity web crawler , just wanted to ask what would it be the average speed of links checked per second for a fast crawler, what i did is a mysql based crawler , and maximum i did is 10 checked links per 1 sec, on arraylist based loop in the java code, with mysql retrieving loop this speed is 2 checked links per a second .

  6. Arnab says:

    How do you parse contents which are rendered by Javascript after load. Thanks Arnab

  7. amit says:

    please provide the next tutorial on this topic

    • Nikhil says:

      Hey I just wanted to ask if anyone would be able to give me a good starting point for developing a web crawler to crawl the ‘deep web’, basically to performing deep searching. I was just wondering if anyone had any idea about this aspect of web crawling? If so any guidance with regard to a good starting point for the deep web crawler development or a good website to refer to for more information or a good book to read for this task, would be highly appreciated!!!

  8. darren says:

    This is something I have been looking for days.Thanks a lot.

  9. darren says:

    Is there anyway we can develop the webcrawler to visit all web sites that contains say an item may be a book or any sporting goods?Please post a tutorial if we can build that in less then a 100lines of codes.
    Thanks again

  10. adcha says:

    have you already posted the second part?

  11. ravi says:

    hai , I am ravi sankar.
    I have to write a java application, Which give me all the web pages which are avialable as part of the website. Example : if give http://www.google.com/ i should be able to get all the webpages which are there in google site. Can any one help me . Thanks in advance.

  12. saleh najafzadeh says:

    java web crawler source code?

  13. Asya_K says:

    is it possible to embed java and PHP code for making a web crawler??? plz help

  14. Jeremy says:

    Where can i find the other parts of this guide.

  15. where is part 2???
    i want to auto download all images of this site http://gaga.vn

  16. Dhillip kumar says:

    hi friends…. i want to build a web crawler in my web page but i did’t get any idea about that please say how to build the web crawler…..

  17. I’m assuming you aren’t going to make any more parts to this? I was really into this.

  18. benglish says:

    hi,
    i just ran the code, but the following error i receive. i have set the proxy to system proxy. could you please help me?

    java.net.ConnectException: Connection timed out: connect
    at java.net.DualStackPlainSocketImpl.connect0(Native Method)
    at java.net.DualStackPlainSocketImpl.socketConnect(DualStackPlainSocketImpl.java:69)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
    at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:157)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
    at java.net.Socket.connect(Socket.java:578)
    at java.net.Socket.connect(Socket.java:527)
    at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:422)
    at sun.net.www.http.HttpClient.openServer(HttpClient.java:517)
    at sun.net.www.http.HttpClient.(HttpClient.java:204)
    at sun.net.www.http.HttpClient.New(HttpClient.java:301)
    at sun.net.www.http.HttpClient.New(HttpClient.java:319)
    at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:998)
    at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:934)
    at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:852)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1302)
    at java.net.URL.openStream(URL.java:1038)
    at webcrawlertest.WebcrawlerTest.main(WebcrawlerTest.java:29)

    • wonderq says:

      hi,I have encountered the same problem,have your problem been solved?

  19. sunny says:

    nice article . i am searching this thing on internet from long time. thanks a lot for sharing such valuable post.

Leave a Reply

Your email address will not be published. Required fields are marked *