urlparse()
使用urlparse库会将url分解成6部分,返回的是一个元组 (scheme, netloc, path, parameters, query, fragment)。可以再使用urljoin、urlsplit、urlunsplit、urlparse把分解后的url拼接起来。
def urlparse(url, scheme='', allow_fragments=True):
"""Parse a URL into 6 components:
:///;?#
Return a 6-tuple: (scheme, netloc, path, params, query, fragment).
Note that we don't break the components up in smaller bits
(e.g. netloc is a single string) and we don't expand % escapes."""
url, scheme, _coerce_result = _coerce_args(url, scheme)
splitresult = urlsplit(url, scheme, allow_fragments)
scheme, netloc, url, query, fragment = splitresult
if scheme in uses_params and ';' in url:
url, params = _splitparams(url)
else:
params = ''
result = ParseResult(scheme, netloc, url, params, query, fragment)
return _coerce_result(result)
注意:通过urlparse库返回的元组可以用来确定网络协议(HTTP、FTP等)、服务器地址、文件路径等。
示例代码:
from urllib.parse import urlparse
url = urlparse('http://www.baidu.com/index.php?username=dgw')
print(url)
print(url.netloc)
使用urlunparse库将一个元组(scheme, netloc, path, parameters, query, fragment)组成一个具有正确格式的URL。
def urlunparse(components):
"""Put a parsed URL back together again. This may result in a
slightly different, but equivalent URL, if the URL that was parsed
originally had redundant delimiters, e.g. a ? with an empty query
(the draft states that these are equivalent)."""
scheme, netloc, url, params, query, fragment, _coerce_result = (
_coerce_args(*components))
if params:
url = "%s;%s" % (url, params)
return _coerce_result(urlunsplit((scheme, netloc, url, query, fragment)))
示例代码:
from urllib.parse import urlparse, urlunparse
url = urlparse('http://www.baidu.com/index.php?username=dgw')
print(url)
url_join1 = urlunparse(url)
print(url_join1)
url_tuple = ("http", "www.baidu.com", "index.php", "", "username=dgw", "")
url_join2 = urlunparse(url_tuple)
print(url_join2)
使用urlsplit库只要用来分析urlstring,返回包含5个参数的元组(scheme, netloc, path, query, fragment)。urlsplit()和urlparse()差不多。不过它不切分URL的参数。
def urlsplit(url, scheme='', allow_fragments=True):
"""Parse a URL into 5 components:
:///?#
Return a 5-tuple: (scheme, netloc, path, query, fragment).
Note that we don't break the components up in smaller bits
(e.g. netloc is a single string) and we don't expand % escapes."""
url, scheme, _coerce_result = _coerce_args(url, scheme)
allow_fragments = bool(allow_fragments)
key = url, scheme, allow_fragments, type(url), type(scheme)
cached = _parse_cache.get(key, None)
......
示例代码:
from urllib.parse import urlparse, urlsplit
url = urlparse('http://www.baidu.com/index.php?username=dgw')
print(url)
url2 = urlsplit('http://www.baidu.com/index.php?username=dgw')
print(url2)
def urlunsplit(components):
"""Combine the elements of a tuple as returned by urlsplit() into a
complete URL as a string. The data argument can be any five-item iterable.
This may result in a slightly different, but equivalent URL, if the URL that
was parsed originally had unnecessary delimiters (for example, a ? with an
empty query; the RFC states that these are equivalent)."""
scheme, netloc, url, query, fragment, _coerce_result = (
_coerce_args(*components))
if netloc or (scheme and scheme in uses_netloc and url[:2] != '//'):
if url and url[:1] != '/': url = '/' + url
示例代码:
from urllib.parse import urlparse, urlsplit, urlunsplit
url = urlparse('http://www.baidu.com/index.php?username=dgw')
print(url)
url2 = urlsplit('http://www.baidu.com/index.php?username=dgw')
print(url2)
url3 = urlunsplit(url2)
print(url3)
urljoin()将一个基本URL和一个可能的相对URL连接起来,形成对后者的绝对地址。
注意:如果基本URL并非以字符/结尾的话,那么URL基地址最右边部分就会被这个相对路径所替换。
def urljoin(base, url, allow_fragments=True):
"""Join a base URL and a possibly relative URL to form an absolute
interpretation of the latter."""
if not base:
return url
if not url:
return base
base, url, _coerce_result = _coerce_args(base, url)
......
示例代码:
from urllib.parse import urljoin
url = urljoin('http://www.baidu.com/test/', 'index.php?username=dgw')
print(url)
url2 = urljoin('http://www.baidu.com/test', 'index.php?username=dgw')
print(url2)