Squid Cache Server
1. An Overview
Squid acts as an agent, accepting requests
from clients such as browsers, and passing them onto an appropriate Internet
server. It stores a copy of the returned data in an on-disk cache. Squid’s real
benefit emerges when the same data is requested multiple times, and a copy of
the on-disk data is returned to the client, speeding up Internet access, and
saving bandwidth. Small amounts of disk space can significantly impact
bandwidth usage and browsing speed.
Internet firewalls, which are used to protect
company networks, often have a proxy component. The Squid proxy is different
from a firewall proxy in that most firewall proxies do not store copies of the
returned data; they re-fetch requested data from the remote Internet server
each time.
Squid differs from firewall proxies in other
ways as well:
- It supports many protocols (firewalls often have specific proxies for specific protocols; it is difficult to ensure the code security of a large program)
- Hierarchies of proxies, arranged in complex relationships, are possible
A 'cache', actually refers to a 'caching
proxy' - something that keeps copies of returned data. A 'proxy', on the other
hand, is a program which does not cache replies.
The web consists of HTML pages, graphics, and
sound files, etc. Since only a very small portion of the web constitutes text,
referring to all cached data as pages is misleading. Caches store objects, not
pages.
Many Internet servers support more than one
protocol. A given server can support more than one type of query protocol. A
web server uses the HyperText Transfer Protocol (HTTP) to serve data. An older
protocol, the File Transfer Protocol (FTP) often runs on web servers as well.
Caching an FTP response and returning the same data to the client on a
subsequent HTTP request can cause confusion, and should, therefore, be avoided.
Squid uses the complete Universal Resource Locator (URL) to uniquely identify
everything stored in the cache. Objects must be expired in order to avoid
returning outdated data to clients. Squid allows you to set refresh times for
objects, ensuring that this does not happen.
2. Why cache?
Small
Internet Service Providers (ISPs) cache in order to reduce their line costs,
since a large portion of their operating costs is infrastructural rather than
staff-related.
Large
organizations are usually not short of bandwidth, but their customers
occasionally experience slow response times. There are many reasons for this:
2.1. Origin Server Load
Raw
bandwidth is increasing faster than overall computer performance. Many servers
act as a back-end for one site, load-balancing incoming requests. Where this is
not done, the result is a slow response. Caches prevent slow response times.
2.2. Quick Abort
Squid can be configured to continue fetching
objects, within certain size limits, even if a user who has initiated a
download aborts it midway. Since there is a chance of more than one user
requesting the same file, it is useful to have the object’s copy in your cache,
even if the first user aborts. Where you have access to plenty of bandwidth,
this continuous fetching process ensures that you will have a local copy of the
object available, just in case someone else requests it. This can drastically
reduce latency at the cost of higher bandwidth usage.
2.3. Peer Congestion
Router speed needs to increase at the same
rate as bandwidth increases. Many peering points - where huge volumes of
Internet traffic are exchanged - often do not have the router horsepower to
support their ever-increasing load.
2.4. Traffic spikes
Large sport, television and political events
can cause spikes in Internet traffic.
Plan ahead for sports events, but it is
difficult to estimate the load that they will eventually cause. If you are a
local ISP, and your local team reaches the finals, you are likely to experience
a huge peak in traffic. Companies can also be adversely affected by traffic
spikes, with the bulk transfer of large databases or presentations flooding the
lines at random intervals. Caching cannot completely solve this problem, but it
can reduce its impact.
2.5. Unreachable sites
If Squid attempts to connect to an origin
server, only to find that it is down, it will log an error and return the
object from a disk, even if there is a chance of sending outdated data to the
client. This reduces the impact of large-scale Internet outage, and can help
when a backhoe digs up a major segment of your network backbone.
2.6. Costs
Since
Internet connectivity is so expensive, ISPs and their customers reduce their
bandwidth requirements with caches.
3. Supported Protocols
3.1. Supported Client Protocols
Squid
supports the following incoming protocol request types, when the proxy requests
are sent in HTTP format:
- File Transfer Protocol (FTP)
- Gopher
- HyperText Transfer Protocol (HTTP)
- Secure Socket Layer (SSL)
- Wide Area Information Server (WAIS)
3.2 Inter-cache and Management Protocols
- Cache Digests: Used to retrieve an index of objects in another cache's store
- HyperText Transfer Protocol (HTTP): Used for retrieving copies of objects from other caches
- Hyper Text Caching Protocol (HTCP):
- Internet Cache Protocol (ICP): Used to find out if a specific object is in another cache's store
- Simple Network Management Protocol (SNMP): Can be used to retrieve information about your cache
3.3 Inter-cache Communication Protocols
Squid enables you to share data between
caches. Just as there is a benefit to connecting individual PCs to a network,
and this network to the Internet, there is also an advantage to linking your
cache to other people's networks of caches.
User Base: The larger your user base, the more
objects requested, the higher the chances of an object being requested twice.
In order to increase your hit rate, add more clients.
Reduced Load: If you have a large
network, one cache might be unable to handle all incoming requests. Rather than
having to continuously upgrade one machine, it makes sense to divide the load
between multiple servers. This reduces an individual server’s load, while
increasing the overall number of queries your cache system can handle.
Squid
implements inter-cache protocols very efficiently, through ICP multi-cast
queries and cache digests, which allow for large networks of caches
(hierarchies).
Disk
Space:
If you load-balance between multiple caches, avoid duplication of data.
Duplicated objects reduce the quantity of objects in the overall store,
which reduces your chances of a hit. Using the Cache Array Routing Protocol
(CARP) or other inter-cache communication protocols reduces duplication.
Raw
bandwidth is not the only issue affecting the efficiency and speed of your
cache system. Choosing the right hardware and software also presents its own
challenges.
4. Squid Configuration
4.1 The Configuration File
All
Squid configuration files are kept in the directory /usr/local/squid/etc.
Although there is more than one file in this directory, the squid.conf
file is of primary importance to most administrators. Although there are
hundreds of option tags in this file, change only eight options in order to get
Squid up and running. The other options provide amazing flexibility, but you
can learn more about them once you have Squid functioning smoothly, by experimenting with the options.
Squid
assumes that you wish to use the default value if there is no occurrence of a
tag in the squid.conf file. Theoretically, you can even run Squid with a
zero-length configuration file.
4.2 Setting Squid's HTTP Port
The
first option in the squid.conf file sets the HTTP port(s) that Squid
will listen to for incoming requests. Squid's default HTTP port is 3129.
You
can also use multiple ports, appending a second port number to the http_port
variable. Consider the following example:
http_port 3128 8080
4.3 Storing Cached Data
The cache_dir operator in the squid.conf file
is used to configure specific storage areas. If you use more than one disk for
the cached data, you may need more than one mount point
(/usr/local/squid/cache1 for the first disk, /usr/local/squid/cache2 for the
second, for example). Squid allows you to have more than one cache_dir option
in your config file:
cache_dir /usr/local/squid/cache/ 100 16 256
cache_dir /usr/local/squid/cache/ 100 16 256
The first option for the cache_dir tag sets
the directory where the data will be stored. The prefix value simply has /cache/
tagged onto the end, and it is used as the default directory.
The second option for the cache_dir tag is a
size value. Squid will store up the specified amount of data in that directory.
The value is in megabytes, as is that of the cache store. The default is 100
megabytes.
The other two options are more complex: they set the number of sub-directories (first and second-tier) to create in this directory.
4.4 E-mail for the Cache Administrator
If Squid dies, an e-mail notification is sent
to the specified address with the cache_mgr tag. This address is also appended
at the end of error pages returned to users if, for example, the remote machine
is unreachable.
5. Access Control Lists and Access Control Operators
Squid cannot be used in an ISP environment
without a sophisticated access control system. Indeed, Squid should not be used
in any environment without some kind of basic authentication system. It is
amazing how fast other Internet users discover that they can relay requests
through your cache, and then proceed to do so. Sometimes they do this in order
to obfuscate their real identity, and at other times since they have a fast
line to you, but a slow line to the remainder of the Internet.
5.1 Simple Access Control
Assume
that you have a list of IP addresses that are allowed to have access to your
cache. If you want them to be able to access it with both HTTP and ICP, you
will have to enter the list of IP addresses twice:
Example
1. Theoretical Access List
http_access
deny 10.0.1.0/255.255.255.0
http_access
allow 10.0.0.0/255.0.0.0
icp_access
allow 10.0.0.0/255.0.0.0
|
Rule sets like the ones given above are
straightforward enough for small organizations. For large organizations,
however, it is more convenient to create classes of users. You can then allow or
deny classes of users in more complex relationships. Consider the following
example, in which you duplicate the above-mentioned example with classes of
users:
Example
2. Access lists using Classes
#
classes
acl
mynetwork src 10.0.0.0/255.0.0.0
acl
servernet src 10.0.1.0/255.255.255.0
#
what HTTP access to allow classes
http_access
deny servernet
http_access
allow mynet
#
what ICP access to allow classes
icp_access
deny servernet
icp_access
allow mynet
|
The squid.conf file that comes with Squid
includes ACLs, which denies all HTTP requests. Explicitly allow incoming
requests to use your cache from an appropriate range. The squid.conf file
includes text that reads:
#
# INSERT YOUR OWN RULE(S) HERE TO ALLOW ACCESS FROM YOUR CLIENTS # |
In order to allow access to your client
machines, add rules similar to the ones given below. The default access-control
rules prevent users from exploiting your cache; it is advisable to leave them
in.
Example
3. Example Complete ACL list
#
#
INSERT YOUR OWN RULE(S) HERE TO ALLOW ACCESS FROM YOUR CLIENTS
#
#
acls for my network addresses
acl
my-iplist-1 src 192.168.1.0/24
acl
my-iplist-2 src 10.0.0.0/255.255.0.0
#
Check that requests are from users on our network
http_access
allow my-iplist-1
http_access
allow my-iplist-2
icp_access
allow my-iplist-1
icp_access
allow my-iplist-2
#
allow requests from the local machine (for testing and the like)
http_access
allow localhost
#
End of locally-inserted rules
http_access
deny all
|
6. Step-by-step Configuration Guide
Note: The following configuration will enable you
to run a cache with base-line configuration. To learn more about advanced
options, please visit http://www.squid-cache.org.
Install the Squid rpm from the Red Hat CD.
This will install its configuration file (squid.conf) in the directory /etc.
Open the file
/etc/squid.conf and follow the steps given below:
- Search for the line http_port, and change the port if required:
http_port 8080
This
is the port that your users will use in order to access the Internet.
- Look for the line:
cache_mem 8 MB
Assign
the value to cach_mem in MBs according to your resources.
- Look for the lines:
cache_swap_low 90
cache_swap_high 95
Uncomment
these lines if they are commented, and leave the values unchanged.
- ip_cache_size 1024
ipcache_low 90
ipcache_high 95
Uncomment
these lines, and leave their default values.
- cache_dir ufs /var/spool/squid 100 16 56
Uncomment
this line, and leave its default values.
- cache_access_log /var/log/squid/access.log
access.log
is used to see the web requests going through your proxy server, as in winproxy
in MS Windows. Uncomment it.
- cache_log /var/log/squid/cache.log
Uncomment
it.
- cache_store_log /var/log/squid/store.log
Uncomment
it.
- pid_filename /var/run/squid.pid
Uncomment
it.
- log_fqdn off
Uncomment
it, and change it into its “On” state, so that it looks like this:
log_fqdn on
- Search for the line:
acl localhost
src 127.0.0.1/255.255.255.255
Assume
that your network ID is 192.168.1 with a sub-net mask of 255.255.255.0. Name
your network; “lan”, for example. Enter the following line next to the
above-mentioned line:
acl lan
src 192.168.1.0/255.255.255.0
- Search for the line:
http_access allow
localhost
http_access deny
all
Add
permissions to your group, i.e. lan
http_access allow
localhost
http_access allow
lan
http_access
deny all
- Start the proxy server with the following command:
/etc/rc.d/init.d/squid
start
Open a browser on
any workstation, and enter the proxy address along with the port, i.e. 8080.
You are now ready
to browse the Internet.
No comments:
Post a Comment