Friday, 25 November 2011

Squid Cache Server


Squid Cache Server



1. An Overview

Squid acts as an agent, accepting requests from clients such as browsers, and passing them onto an appropriate Internet server. It stores a copy of the returned data in an on-disk cache. Squid’s real benefit emerges when the same data is requested multiple times, and a copy of the on-disk data is returned to the client, speeding up Internet access, and saving bandwidth. Small amounts of disk space can significantly impact bandwidth usage and browsing speed.

Internet firewalls, which are used to protect company networks, often have a proxy component. The Squid proxy is different from a firewall proxy in that most firewall proxies do not store copies of the returned data; they re-fetch requested data from the remote Internet server each time.

Squid differs from firewall proxies in other ways as well:
  • It supports many protocols (firewalls often have specific proxies for specific protocols; it is difficult to ensure the code security of a large program)
  • Hierarchies of proxies, arranged in complex relationships, are possible

A 'cache', actually refers to a 'caching proxy' - something that keeps copies of returned data. A 'proxy', on the other hand, is a program which does not cache replies.

The web consists of HTML pages, graphics, and sound files, etc. Since only a very small portion of the web constitutes text, referring to all cached data as pages is misleading. Caches store objects, not pages.

Many Internet servers support more than one protocol. A given server can support more than one type of query protocol. A web server uses the HyperText Transfer Protocol (HTTP) to serve data. An older protocol, the File Transfer Protocol (FTP) often runs on web servers as well. Caching an FTP response and returning the same data to the client on a subsequent HTTP request can cause confusion, and should, therefore, be avoided. Squid uses the complete Universal Resource Locator (URL) to uniquely identify everything stored in the cache. Objects must be expired in order to avoid returning outdated data to clients. Squid allows you to set refresh times for objects, ensuring that this does not happen.

2. Why cache?

Small Internet Service Providers (ISPs) cache in order to reduce their line costs, since a large portion of their operating costs is infrastructural rather than staff-related.

Large organizations are usually not short of bandwidth, but their customers occasionally experience slow response times. There are many reasons for this:

2.1. Origin Server Load

Raw bandwidth is increasing faster than overall computer performance. Many servers act as a back-end for one site, load-balancing incoming requests. Where this is not done, the result is a slow response. Caches prevent slow response times.

2.2. Quick Abort

Squid can be configured to continue fetching objects, within certain size limits, even if a user who has initiated a download aborts it midway. Since there is a chance of more than one user requesting the same file, it is useful to have the object’s copy in your cache, even if the first user aborts. Where you have access to plenty of bandwidth, this continuous fetching process ensures that you will have a local copy of the object available, just in case someone else requests it. This can drastically reduce latency at the cost of higher bandwidth usage.

2.3. Peer Congestion

Router speed needs to increase at the same rate as bandwidth increases. Many peering points - where huge volumes of Internet traffic are exchanged - often do not have the router horsepower to support their ever-increasing load.

2.4. Traffic spikes

Large sport, television and political events can cause spikes in Internet traffic.

Plan ahead for sports events, but it is difficult to estimate the load that they will eventually cause. If you are a local ISP, and your local team reaches the finals, you are likely to experience a huge peak in traffic. Companies can also be adversely affected by traffic spikes, with the bulk transfer of large databases or presentations flooding the lines at random intervals. Caching cannot completely solve this problem, but it can reduce its impact.

2.5. Unreachable sites

If Squid attempts to connect to an origin server, only to find that it is down, it will log an error and return the object from a disk, even if there is a chance of sending outdated data to the client. This reduces the impact of large-scale Internet outage, and can help when a backhoe digs up a major segment of your network backbone.

2.6. Costs

Since Internet connectivity is so expensive, ISPs and their customers reduce their bandwidth requirements with caches.

3. Supported Protocols

3.1. Supported Client Protocols

Squid supports the following incoming protocol request types, when the proxy requests are sent in HTTP format:

  • File Transfer Protocol (FTP)
  • Gopher
  • HyperText Transfer Protocol (HTTP)
  • Secure Socket Layer (SSL)
  • Wide Area Information Server (WAIS)

3.2 Inter-cache and Management Protocols

  • Cache Digests: Used to retrieve an index of objects in another cache's store
  • HyperText Transfer Protocol (HTTP): Used for retrieving copies of objects from other caches
  • Hyper Text Caching Protocol (HTCP):
  • Internet Cache Protocol (ICP): Used to find out if a specific object is in another cache's store
  • Simple Network Management Protocol (SNMP): Can be used to retrieve information about your cache

3.3 Inter-cache Communication Protocols

Squid enables you to share data between caches. Just as there is a benefit to connecting individual PCs to a network, and this network to the Internet, there is also an advantage to linking your cache to other people's networks of caches.

User Base: The larger your user base, the more objects requested, the higher the chances of an object being requested twice. In order to increase your hit rate, add more clients.

Reduced Load: If you have a large network, one cache might be unable to handle all incoming requests. Rather than having to continuously upgrade one machine, it makes sense to divide the load between multiple servers. This reduces an individual server’s load, while increasing the overall number of queries your cache system can handle.

Squid implements inter-cache protocols very efficiently, through ICP multi-cast queries and cache digests, which allow for large networks of caches (hierarchies).

Disk Space: If you load-balance between multiple caches, avoid duplication of data. Duplicated objects reduce the quantity of objects in the overall store, which reduces your chances of a hit. Using the Cache Array Routing Protocol (CARP) or other inter-cache communication protocols reduces duplication.

Raw bandwidth is not the only issue affecting the efficiency and speed of your cache system. Choosing the right hardware and software also presents its own challenges.

4. Squid Configuration

4.1 The Configuration File

All Squid configuration files are kept in the directory /usr/local/squid/etc. Although there is more than one file in this directory, the squid.conf file is of primary importance to most administrators. Although there are hundreds of option tags in this file, change only eight options in order to get Squid up and running. The other options provide amazing flexibility, but you can learn more about them once you have Squid functioning smoothly, by experimenting with the options.

Squid assumes that you wish to use the default value if there is no occurrence of a tag in the squid.conf file. Theoretically, you can even run Squid with a zero-length configuration file.

4.2 Setting Squid's HTTP Port

The first option in the squid.conf file sets the HTTP port(s) that Squid will listen to for incoming requests. Squid's default HTTP port is 3129.

You can also use multiple ports, appending a second port number to the http_port variable. Consider the following example:

http_port 3128 8080

4.3 Storing Cached Data

The cache_dir operator in the squid.conf file is used to configure specific storage areas. If you use more than one disk for the cached data, you may need more than one mount point (/usr/local/squid/cache1 for the first disk, /usr/local/squid/cache2 for the second, for example). Squid allows you to have more than one cache_dir option in your config file:

                cache_dir /usr/local/squid/cache/ 100 16 256 

The first option for the cache_dir tag sets the directory where the data will be stored. The prefix value simply has /cache/ tagged onto the end, and it is used as the default directory.

The second option for the cache_dir tag is a size value. Squid will store up the specified amount of data in that directory. The value is in megabytes, as is that of the cache store. The default is 100 megabytes.

The other two options are more complex: they set the number of sub-directories (first and second-tier) to create in this directory.

4.4 E-mail for the Cache Administrator

If Squid dies, an e-mail notification is sent to the specified address with the cache_mgr tag. This address is also appended at the end of error pages returned to users if, for example, the remote machine is unreachable.

5. Access Control Lists and Access Control Operators

Squid cannot be used in an ISP environment without a sophisticated access control system. Indeed, Squid should not be used in any environment without some kind of basic authentication system. It is amazing how fast other Internet users discover that they can relay requests through your cache, and then proceed to do so. Sometimes they do this in order to obfuscate their real identity, and at other times since they have a fast line to you, but a slow line to the remainder of the Internet.

5.1 Simple Access Control

Assume that you have a list of IP addresses that are allowed to have access to your cache. If you want them to be able to access it with both HTTP and ICP, you will have to enter the list of IP addresses twice:

Example 1. Theoretical Access List

http_access deny 10.0.1.0/255.255.255.0
http_access allow 10.0.0.0/255.0.0.0
icp_access allow 10.0.0.0/255.0.0.0

Rule sets like the ones given above are straightforward enough for small organizations. For large organizations, however, it is more convenient to create classes of users. You can then allow or deny classes of users in more complex relationships. Consider the following example, in which you duplicate the above-mentioned example with classes of users:

Example 2. Access lists using Classes

# classes
acl mynetwork src 10.0.0.0/255.0.0.0
acl servernet src 10.0.1.0/255.255.255.0
# what HTTP access to allow classes
http_access deny servernet
http_access allow mynet
# what ICP access to allow classes
icp_access deny servernet
icp_access allow mynet

The squid.conf file that comes with Squid includes ACLs, which denies all HTTP requests. Explicitly allow incoming requests to use your cache from an appropriate range. The squid.conf file includes text that reads:

#
# INSERT YOUR OWN RULE(S) HERE TO ALLOW ACCESS FROM YOUR CLIENTS
#

In order to allow access to your client machines, add rules similar to the ones given below. The default access-control rules prevent users from exploiting your cache; it is advisable to leave them in.

Example 3. Example Complete ACL list

#
# INSERT YOUR OWN RULE(S) HERE TO ALLOW ACCESS FROM YOUR CLIENTS
#
# acls for my network addresses
acl my-iplist-1 src 192.168.1.0/24
acl my-iplist-2 src 10.0.0.0/255.255.0.0
# Check that requests are from users on our network
http_access allow my-iplist-1
http_access allow my-iplist-2
icp_access allow my-iplist-1
icp_access allow my-iplist-2
# allow requests from the local machine (for testing and the like)
http_access allow localhost
# End of locally-inserted rules
http_access deny all

6. Step-by-step Configuration Guide

Note: The following configuration will enable you to run a cache with base-line configuration. To learn more about advanced options, please visit http://www.squid-cache.org.

Install the Squid rpm from the Red Hat CD. This will install its configuration file (squid.conf) in the directory /etc.

Open the file /etc/squid.conf and follow the steps given below:

  • Search for the line http_port, and change the port if required:

http_port   8080

This is the port that your users will use in order to access the Internet.

  • Look for the line:

cache_mem    8 MB

Assign the value to cach_mem in MBs according to your resources.

  • Look for the lines:

cache_swap_low   90
cache_swap_high  95

Uncomment these lines if they are commented, and leave the values unchanged.

  • ip_cache_size    1024
ipcache_low       90
ipcache_high      95

Uncomment these lines, and leave their default values.

  • cache_dir  ufs  /var/spool/squid    100   16   56

Uncomment this line, and leave its default values.

  • cache_access_log  /var/log/squid/access.log

access.log is used to see the web requests going through your proxy server, as in winproxy in MS Windows. Uncomment it.

  • cache_log  /var/log/squid/cache.log

Uncomment it.

  • cache_store_log   /var/log/squid/store.log

Uncomment it.

  • pid_filename   /var/run/squid.pid

Uncomment it.

  • log_fqdn   off

Uncomment it, and change it into its “On” state, so that it looks like this:

log_fqdn   on

  • Search for the line:

acl   localhost   src   127.0.0.1/255.255.255.255

Assume that your network ID is 192.168.1 with a sub-net mask of 255.255.255.0. Name your network; “lan”, for example. Enter the following line next to the above-mentioned line:

acl    lan      src    192.168.1.0/255.255.255.0

  • Search for the line:

http_access  allow   localhost
http_access  deny    all

Add permissions to your group, i.e. lan

http_access  allow   localhost
http_access  allow   lan
http_access deny    all

  • Start the proxy server with the following command:

/etc/rc.d/init.d/squid start

Open a browser on any workstation, and enter the proxy address along with the port, i.e. 8080.

You are now ready to browse the Internet.

No comments:

Post a Comment