Programming an HTML Downloader in C++

Written By: Alwyn Malachi Berkeley

- 25 Sep 2006 -
















Description: This program requests the source code (HTML) of a webpage and then gives the user multiple options for viewing the web server's response. We will use the Winsock API to manipulate sockets on the system, and create reusable code along the way.

  1. Creating the Project
  2. main.cpp Headers and the Internet Namespace
  3. The Webpage Class
  4. The Internet Namespace's Functions
  5. GetDomain
  6. Custom_itoa
  7. GetWebpage
  8. SaveFile
  9. Internet.hpp and Writing main.cpp
  10. Compile and Conclusion

Implementing GetWebpage

We have finally arrived at the most immaculate function of them all. The GetWebpage procedure is basically the foundation of our entire project. This function uses the GET command to request a webpage from a web server and returns the information in a BasicWebpage object. Furthermore the webpage object that it returns is held within a smart pointer defined in the std namespace called auto_ptr. This is done in order to ensure that the BasicWebpage object gets properly deleted and no memory leaks occur. Once again this function should be added to the "InternetHelperFunctions.cpp" file. Read the code below so we can begin "tippy-toeing" through this body of code:

 auto_ptr GetWebpage(const string &strWebpageURLParam) {
        // downloads the source code of the webpage URL passed
        // using a "GET" request
        using std::memset;
        using std::auto_ptr;
        using std::runtime_error;
        
                 // declaring constants
                const char* WINSOCK_ERROR_STRING = Custom_itoa(WSAGetLastError());
            
                // initializing winsock
                WSAData wsaData; // structure contains information about Win32 Socket API
                const WORD theVersionWord = MAKEWORD(1,1); // contains the word "1.1"
                int intReturnValue;
                intReturnValue = WSAStartup(theVersionWord, &wsaData);
 
                // handle a possible error when loading
                if (intReturnValue != NO_ERROR) {
                        WSACleanup();
                        throw runtime_error("Could not load Winsock DLL.");
                }
 
                // confirm that we are using v1.1 of winsock API
                if (LOBYTE(wsaData.wVersion) != 1 || HIBYTE(wsaData.wVersion) != 1) {
                        WSACleanup();
                                throw runtime_error("Incorrect Winsock Control loaded.");
                }

This code is simple despite how it looks. We started off by declaring a function and two objects from the std namespace that we will be using within this function. I assume you are familiar with those parts of the std namespace.

Next we initialize Winsock itself by declaring a variable of type WSAData and using it in the function WSAStartup. The first WORD variable contains the word "1.1". We use the WORD variable to specify the version of Winsock that we would like to initiate.

The two bodies of code following the initiation of Winsock are there simply for error-checking purposes. If an error occurred, then a runtime_error is thrown with a description of what went wrong. The caller can catch the exception in a try-catch block. You should take note of the WSACleanup() function that precedes the throw statements however. The WSACleanup() function cleans up the memory associated with the Winsock library, unregisters Winsock components, etc. When the program is thrown the function never completes. Nonetheless the Winsock library that was initialized still needs to be unintialized regardless. Normally the WSACleanup() function occurs at the end of the function but if the program encounters an error WSACleanup() never gets called. Therefore we clean up after the Winsock library by running WSACleanup() before our throw statements. If we didn't do this, then the program would leak memory or end up depending on the Windows OS to clean up after our faulty program.

Now it is time to create a socket that we can use for communication:

         // creates the socket to be used for TCP/IP
                SOCKET sckConnectingSocket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
 
                // confirm that there was no error while creating the socket
                if (sckConnectingSocket == INVALID_SOCKET) {
                        WSACleanup();
                        throw runtime_error(WINSOCK_ERROR_STRING);
                }

First we declare a socket. To make a long story short the parameters being sent to the socket() define that we want to use the socket for the TCP/IP protocol. If you really want to get deep into the "nitty-gritty" of what is going on then I suggest browsing MSDN for every little technical detail. Most of us are probably content with just knowing the fundamentals. So moving on, the second block of code is simply error-checking once again.

Now we need to define two address structures:

         // create a local address structure(for the local PC)
                int intLocalPort = 1869;
                sockaddr_in addrLocalAddress;
                addrLocalAddress.sin_family = AF_INET; // setting connection family
                addrLocalAddress.sin_addr.s_addr = inet_addr("127.0.0.1"); // local IP address
                addrLocalAddress.sin_port = htons(intLocalPort); // local port
 
                // create a remote address structure(for the remote PC)
                const int intRemotePort = 80;
                sockaddr_in addrRemoteAddress;
                addrRemoteAddress.sin_family = AF_INET; // setting connection family
                // deciphering remote IP address
                const string strWebpageDomain = GetDomain(strWebpageURLParam);
                hostent RemoteHostInfo = *gethostbyname(strWebpageDomain.c_str());
                addrRemoteAddress.sin_addr.S_un.S_un_b.s_b1 = static_cast(RemoteHostInfo.h_addr_list[0][0]);
                addrRemoteAddress.sin_addr.S_un.S_un_b.s_b2 = static_cast(RemoteHostInfo.h_addr_list[0][1]);
                addrRemoteAddress.sin_addr.S_un.S_un_b.s_b3 = static_cast(RemoteHostInfo.h_addr_list[0][2]);
                addrRemoteAddress.sin_addr.S_un.S_un_b.s_b4 = static_cast(RemoteHostInfo.h_addr_list[0][3]);
                addrRemoteAddress.sin_port = htons(intRemotePort); // remote port to connect too
 
        // confirm that the remote address was set
        if (addrRemoteAddress.sin_addr.S_un.S_addr == INADDR_NONE) {
                WSACleanup();
                        throw runtime_error("Couldn't assign remote IP address.");
        }

Address structures provide basic information essential to making connections between two computers. For starters we create an address structure for the local computer (the one the program is running on). This holds information about us such as our local IP address and the port number we want to use for the connection, etc. The port used for the connection should be a port that is free. I figure that port 1869 isn't used for anything special. Therefore I set the local address structure to use local port 1869 for talking to the World Wide Web. By the way, if you are behind a firewall/router you may have to explicitly allow port 1869 to receive connections through your firewall/router's settings.

Next we need to create an address structure for the remote computer (the computer being connected to) which is the web server in this particular case. The port we would like to connect to on the web server is port 80. The reason we use port 80 is because that is the default port that web pages are served from on the word wide web. The series of expressions utilizing C++'s static_cast operator is done in order to find the IP address of a domain from the DNS name (www.xxxxxxx.com).

Here is a quick summary. We created two structures that hold information about internet communication. In layman's terms it is as simple as that. Now we progress onward:

         // inform Winsock that the socket created should be used with the local address structure created
        if (bind(sckConnectingSocket, (sockaddr*) &addrLocalAddress, sizeof(sockaddr_in)) == SOCKET_ERROR) {
                WSACleanup();
                        throw runtime_error(WINSOCK_ERROR_STRING);
        }
 
        // connect to the server computer (webserver)
        if (connect(sckConnectingSocket, (sockaddr*) &addrRemoteAddress, sizeof(sockaddr_in)) == SOCKET_ERROR) {
                WSACleanup();
                        throw runtime_error(WINSOCK_ERROR_STRING);
        }

This is fairly simple. We use the bind() function to tell the socket use the local address structure we created. Now with the socket we created earlier using the local address structure we can continue onward with connecting to the remote computer. To connect we use the connect() function in collaboration with the socket and remote address structure. Then the connect function uses the socket to connect to the web server's port 80 by using the local port 1869. Naturally throughout all this time we are still error-checking to ensure that no errors arise.

Once we are connected we have to send the web server some particular data. That is where the code below comes in:

 // ensure that there is an "http://" prefix on the URL
        string strURLWithHttp = "";
        if (strURLWithHttp == strWebpageURLParam) {
                strURLWithHttp = strWebpageURLParam;
        } else if (strWebpageURLParam.substr(0, 7) != "http://") {
                strURLWithHttp = "http://" + strWebpageURLParam;
        }
 
// send the client request via HTTP protocol
        string strClientRequest = "GET " + strURLWithHttp + " HTTP/1.1\r\nHost: " + strWebpageDomain + "\r\n\r\n";
        if (send(sckConnectingSocket, strClientRequest.c_str(), strClientRequest.length(), 0) == SOCKET_ERROR) {
                WSACleanup();
                throw runtime_error(WINSOCK_ERROR_STRING);
        }

First things first, for our purposes we need to have the "http://" on the URL. So first we write a simple piece of code to ensure that the URL has the prefix. Then using that URL with the prefix we construct that "special data" I was referring to just a minute ago. The special data is a string known as the "request string". HTTP protocol uses the request string to notify the web server of your request. We send the HTTP protocol's "GET" command and the name of the webpage we want to retrieve through the request string. The web server receives the request string and processes it. When it is done processing it the web server transmits the corresponding web page back to the program. Meaning we have to receive the data. This is how we do that:

         // receive the response from the server
        const int CHUNK_SIZE = 2048; // 2 kilobyte chunks
        char cstrServerResponse[CHUNK_SIZE]; // variable to hold the incoming data
        int intBytesReceived;
        string strResponse = ""; // the server's response
        do {
                // initialize(reset) the character array and bytes received
                memset(cstrServerResponse, '\0', CHUNK_SIZE + 1);
                        intBytesReceived = -1;
 
                // put the incoming information in our variable
                intBytesReceived = recv(sckConnectingSocket, cstrServerResponse, CHUNK_SIZE, 0);
 
                // show the server's response
                if (intBytesReceived > 0) strResponse += cstrServerResponse;
        } while(intBytesReceived != 0 && intBytesReceived != SOCKET_ERROR && intBytesReceived != WSAECONNABORTED && intBytesReceived != WSAECONNRESET);

The CHUNK_SIZE is how many bytes we would like to receive at a time. The cstrServerResponse is the actual character string with the text transmitted within that chunk. The intBytesReceived is just a variable that remembers how many bytes were received from the web server. Then lastly the strResponse is just a variable that saves all the text transmitted by the accumulating the data in cstrServerResponse.

So for example if the webpage the web server is sending us is "11 kilobytes" then the loop will do the following: reset it's variables, find out how many bytes were received, find what the characters received were, add the characters to strResponse, repeat. The loop will continuously repeat and each time the little chunk of data saved in cstrServerResponse will be given to strResponse. The last loop will stop repeating when there are no more bytes to receive or some other socket error occurs.

Now we only have a few more pesky little details to worry about and were done:

         // close all open ports
        closesocket(sckConnectingSocket);
 
        // clean up winsock
        WSACleanup();
        
        // remove the trailing 0
        strResponse[strResponse.rfind("0")] =  '\0';
        
        // create a BasicWebpage object
        auto_ptr theWebpagePtr(new BasicWebpage(strWebpageURLParam.c_str()));
 
        // give the webpage object the server's response
        theWebpagePtr->setResponse(strResponse.c_str());
                
        return theWebpagePtr;

Since all the transmitting has been done we can now close the all the open sockets/ports. Furthermore we are done using the Winsock library in this function so next we write the WSACleanup() function which cleans up after Winsock and renders it uninitialized once again.

Now for some reason at the end of the response transmitted by web servers I always find that there is a trailing 0 character. I have been unable to track down what is making the trailing 0 character but I have found simple way of dealing with it, simply remove it. So I do that next with a simple expression that finds the 0 character in the source code and replaces that position with a blank space instead.

Next we create a BasicWebpage object on the heap using the auto_ptr and give that BasicWebpage object the response that was sent back by the web server. Now you see why the setReponse() method in the BasicWebpage class did a little extra work as opposed to the other accessor methods. It separated the server's response between the header and the body right there on the spot in "one clean sweep" so to speak. That allows for the BasicWebpage class to have a simplified interface. Anyhow, the auto_ptr we created that holds the web server's response is then returned, marking the end of the GetWebpage function.

I know it was rather intricate to write this function but you will be glad to know that the rest of the project is pretty much smooth sailing!

<< Previous

Next >>