Programming an HTML Downloader in C++

Written By: Alwyn Malachi Berkeley

- 25 Sep 2006 -
















Description: This program requests the source code (HTML) of a webpage and then gives the user multiple options for viewing the web server's response. We will use the Winsock API to manipulate sockets on the system, and create reusable code along the way.

  1. Creating the Project
  2. main.cpp Headers and the Internet Namespace
  3. The Webpage Class
  4. The Internet Namespace's Functions
  5. GetDomain
  6. Custom_itoa
  7. GetWebpage
  8. SaveFile
  9. Internet.hpp and Writing main.cpp
  10. Compile and Conclusion

Defining the Webpage class

The class that is inside the namespace is called BasicWebpage. It is a simple class that will encapsulate information related to a webpage. The reason it is called "BasicWebpage" as opposed to just naming it "Webpage" is because we are striving to create reusable code. Although we have no desire to inherit the class here, by calling the class BasicWebpage we can give it a virtual destructor and reuse the class repeatedly in future projects where we might desire to inherit it later. This ability to do abstraction is what C++ is renowned for and one of the focusing factors of this article.

To create the header simply click File > New > Source File on the menu. The IDE will prompt you by asking if you want to add a new source file to the project, click the Yes button. The code inside the header file that defines the BasicWebpage class is as followed:

#ifndef BASIC_WEBPAGE_HPP
#define BASIC_WEBPAGE_HPP
 
#include 
 
namespace Internet {
        class BasicWebpage {
                public:
                        BasicWebpage();
                        BasicWebpage(const BasicWebpage &WebpageParam);
                        explicit BasicWebpage(const std::string &strURLParam);
                        BasicWebpage(const std::string &strURLParam, const std::string &strServerResponseParam);
                        virtual ~BasicWebpage();
                
                        // accessor methods
                        const char* getURL() const;
                        void setURL(const char* strURLParam);
                        const char* getHeader() const;
                        void setHeader(const char* strHeaderParam);
                        const char* getHTML() const;
                        void setHTML(const char* strHTMLParam);
                        const char* getResponse() const;
                        void setResponse(const char* strResponseParam);
                        
                        // operator methods
                        const BasicWebpage& operator= (const BasicWebpage &Rhs);
                        const bool operator== (const BasicWebpage &Rhs) const;
                
                protected:
                        void splitResponse();
                        void clear();
                    
                private:
                        std::string strURL;
                        std::string strHeader;
                        std::string strHTML;
                        std::string strServerResponse;
        };
}
 
#endif 

When you have that copied to the new source file you created go to File > Save As… and save the file under the name "BasicWebpage.hpp".

Implementing the Webpage class

HTML Downloader (Click For Full Version)

Now that you have an understanding of how the BasicWebpage class looks let us now write the implementation. Create another new source file. Add the proper header directives and declare the Internet namespace. If you are unsure as to what that entails then simply follow the lead of the picture above. The implementation of the constructors and destructor that goes in the Internet namespace are as followed:

 BasicWebpage::BasicWebpage():strURL(""), strHeader(""), strHTML(""), strServerResponse("") {
                
        }
 
        BasicWebpage::BasicWebpage(const BasicWebpage &WebpageParam) {
                // copy all parts of the Webpage object passed
                strURL = WebpageParam.strURL;
                strURL = WebpageParam.strHeader;
                strURL = WebpageParam.strHTML;
                strURL = WebpageParam.strServerResponse;
        }
 
        BasicWebpage::BasicWebpage(const string &strURLParam):strURL(strURLParam), strHeader(""), strHTML(""), strServerResponse("") {
                
        }
 
        BasicWebpage::BasicWebpage(const string &strURLParam, const string &strServerResponseParam):strURL(strURLParam), strHeader(""), strHTML(""), strServerResponse(strServerResponseParam) {
                // split the response
                splitResponse();
        }
        
        BasicWebpage::~BasicWebpage() {
                // nothing to do
        }

As you can see I am a fan of using initialization lists to initialize variables. It is a nice short hand way of initializing variables. The second constructor in the series was the copy constructor. Although we won't be using the copy constructor in this project we define it simply out of good programming practice. The destructor is pretty self explicit. I will explain the splitResponse() method in due time, for now lets just take a look at the implementation for the accessor methods:

 const char* BasicWebpage::getURL() const {
        // returns the URL being used by the object
        
            return strURL.c_str();
        }
 
        void BasicWebpage::setURL(const char* cstrURLParam) {
        // sets the URL
        
            strURL = cstrURLParam;
        }
 
        const char* BasicWebpage::getHeader() const {
        // returns the header that was sent by the server
        
                return strHeader.c_str();
        }
 
        void BasicWebpage::setHeader(const char* cstrHeaderParam) {
        // sets the header
        
            strHeader = cstrHeaderParam;
        }
 
        const char* BasicWebpage::getHTML() const {
        // returns the HTML that was sent by the server
        
            return strHTML.c_str();
        }
 
        void BasicWebpage::setHTML(const char* cstrHTMLParam) {
        // sets the HTML
        
            strHTML = cstrHTMLParam;
        }
        
        const char* BasicWebpage::getResponse() const {
        // returns the server response
        
            return strServerResponse.c_str();
        }
 
        void BasicWebpage::setResponse(const char* cstrResponseParam) {
        // sets the server response
        
                strServerResponse = cstrResponseParam;
        
            // splits the server response
            splitResponse();
        }

This is pretty straight forward. All these methods do is simply get and set the internal variables that BasicWebpage holds. The only exception to that is the setResponse() method which does a little bit more. I am still not quite ready to commit to telling you what splitResponse() does so let us hold off on that for a few minutes more. I will explain that method soon enough. For now, take a peek at the class's operators:

 const BasicWebpage& BasicWebpage::operator= (const BasicWebpage &Rhs) {
                // if they are the same object just return this, no assignment needed
                if (*this == Rhs) return *this;
 
                // assignment
                strURL = Rhs.strURL;
                strHeader = Rhs.strHeader;
                strHTML = Rhs.strHTML;
                strServerResponse = Rhs.strServerResponse;
                
                return *this;
        }
        
        const bool BasicWebpage::operator== (const BasicWebpage &Rhs) const {
                return ((strURL == Rhs.strURL) && (strHeader == Rhs.strHeader) && (strHTML == Rhs.strHTML) && (strServerResponse == Rhs.strServerResponse));
        }

These two methods define two operators that can be used with BasicWebpage. The first operator is "=", the assignment sign. This operator simply sets the members of the BasicWebpage to the same values of the BasicWebpage object it is being compared too. In layman's terms, an example would be the following. The expression "Object1 = Object2" means that Object1's members are now set to the values of the members in Object2. The second operator defined is simply the equal operator. It returns a Boolean telling whether or not two Webpage objects both contain exactly the same values. So for example "Object1 == Object2" would return true if the member variables in both matched. Now for the remaining methods not accessible to the public:

 void BasicWebpage::splitResponse() {
        // seperate the response's header from the body(HTML)
        
            // find where the header and response are separated (I use "" as the delimiter)
            int intDelimiter = strServerResponse.find("\r\n\r\n", 0);
            
            // extract the header from the response
            string strBuffer = strServerResponse.substr(0, intDelimiter + 4);
            setHeader(strBuffer.c_str());
            
            // extract the HTML from the response
            strBuffer = strServerResponse.substr(intDelimiter + 4, strServerResponse.length() - intDelimiter - 4);
            setHTML(strBuffer.c_str());
        }
        
        void BasicWebpage::clear() {
        // clears all member variables
        
            strURL = "";
            strHeader = "";
            strHTML = "";
            strServerResponse = "";
        }

These methods are not made public because they are the actions that happen behind the scenes internally in the object. They are used to regulate the class's behavior. The clear() simply clears the member variables of BasicWebpage. We don't even use that method in this project but as I have been stressing we are aiming to create code that can be recycled. Although we don't use it, that isn't to say that some other project in the future that inherits this class won't need the method. That is why we prudently place the clear() method in the class now.

Now for the splitResponse() method! When a web server responds to a request it sends back a header followed by the character series "\r\n\r\n". Since the BasicWebpage class needs to have the ability to return only the header or only the body(HTML); we use the splitResponse() method to separate the header from the body in the web server's response. So by using the "\r\n\r\n" character series as a delimiter we separate the response's header and from the body's actual HTML source code. Then we save those two values so that the BasicWebpage class can have the feature of only returning the header or only returning the HTML.

This source file should all be saved under the name "BasicWebpage.cpp". Also before we leave this topic, you can see this code is neatly encapsulated. The members of the class have a "has a" relationship with the class they are within. For example every webpage "has a" URL, so we made the member variable called strURL. Keep that in mind when developing classes.

<< Previous

Next >>