ó õùPc@sdZddlZddlZejdƒZejdƒZejdƒZejdƒZejdƒZejdƒZ ejd ƒZ ejd ƒZ ejd ƒZ ejd ƒZ ejd ejƒZejd ƒZejdƒZdefd„ƒYZdejfd„ƒYZdS(sA parser for HTML and XHTML.iÿÿÿÿNs[&<]s<(/|\Z)s &[a-zA-Z#]s%&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]s)&#(?:[0-9]+|[xX][0-9a-fA-F]+)[^0-9a-fA-F]s <[a-zA-Z]t>s--\s*>s[a-zA-Z][-.a-zA-Z0-9:_]*s_\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?sê <[a-zA-Z][-.a-zA-Z0-9:_]* # tag name (?:\s+ # whitespace before attribute name (?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name (?:\s*=\s* # value indicator (?:'[^']*' # LITA-enclosed value |\"[^\"]*\" # LIT-enclosed value |[^'\">\s]+ # bare value ) )? ) )* \s* # trailing whitespace s#tHTMLParseErrorcBs#eZdZdd„Zd„ZRS(s&Exception raised for all parse errors.cCs3|s t‚||_|d|_|d|_dS(Nii(tAssertionErrortmsgtlinenotoffset(tselfRtposition((s..\python\lib\HTMLParser.pyt__init__4s   cCsW|j}|jdk r,|d|j}n|jdk rS|d|jd}n|S(Ns , at line %ds , column %di(RRtNoneR(Rtresult((s..\python\lib\HTMLParser.pyt__str__:s  N(NN(t__name__t __module__t__doc__R RR (((s..\python\lib\HTMLParser.pyR1s t HTMLParsercBsøeZdZdZd„Zd„Zd„Zd„Zd„ZdZ d„Z d „Z d „Z d „Zd „Zd „Zd„Zd„Zd„Zd„Zd„Zd„Zd„Zd„Zd„Zd„Zd„Zd„ZdZd„ZRS(sÇFind tags and other markup and call handler functions. Usage: p = HTMLParser() p.feed(data) ... p.close() Start tags are handled by calling self.handle_starttag() or self.handle_startendtag(); end tags by self.handle_endtag(). The data between tags is passed from the parser to the derived class by calling self.handle_data() with the data as argument (the data may be split up in arbitrary chunks). Entity references are passed by calling self.handle_entityref() with the entity reference as the argument. Numeric character references are passed to self.handle_charref() with the string containing the reference as the argument. tscripttstylecCs|jƒdS(s#Initialize and reset this instance.N(treset(R((s..\python\lib\HTMLParser.pyRZscCs/d|_d|_t|_tjj|ƒdS(s1Reset this instance. Loses all unprocessed data.ts???N(trawdatatlasttagtinteresting_normalt interestingt markupbaset ParserBaseR(R((s..\python\lib\HTMLParser.pyR^s   cCs!|j||_|jdƒdS(sFeed data to the parser. Call this as often as you want, with as little or as much text as you want (may include ' '). iN(Rtgoahead(Rtdata((s..\python\lib\HTMLParser.pytfeedescCs|jdƒdS(sHandle any buffered data.iN(R(R((s..\python\lib\HTMLParser.pytclosenscCst||jƒƒ‚dS(N(Rtgetpos(Rtmessage((s..\python\lib\HTMLParser.pyterrorrscCs|jS(s)Return full source of start tag: '<...>'.(t_HTMLParser__starttag_text(R((s..\python\lib\HTMLParser.pytget_starttag_textwscCs t|_dS(N(tinteresting_cdataR(R((s..\python\lib\HTMLParser.pytset_cdata_mode{scCs t|_dS(N(RR(R((s..\python\lib\HTMLParser.pytclear_cdata_mode~sc Csø|j}d}t|ƒ}xŽ||kr«|jj||ƒ}|rT|jƒ}n|}||kr}|j|||!ƒn|j||ƒ}||krŸPn|j}|d|ƒrÅtj ||ƒrÛ|j |ƒ}n¯|d|ƒrü|j |ƒ}nŽ|d|ƒr|j |ƒ}nm|d|ƒr>|j |ƒ}nL|d|ƒr_|j|ƒ}n+|d|kr‰|jdƒ|d}nP|dkr°|r¬|jdƒnPn|j||ƒ}q|d |ƒrtj ||ƒ}|rP|jƒd d !} |j| ƒ|jƒ}|d |dƒs8|d}n|j||ƒ}qq¨d ||kr‰|j|dd !ƒ|j|d ƒ}nPq|d |ƒr–tj ||ƒ}|r|jdƒ} |j| ƒ|jƒ}|d |dƒsü|d}n|j||ƒ}qntj ||ƒ}|r\|rX|jƒ||krX|jdƒnPq¨|d|kr’|jd ƒ|j||dƒ}q¨Pqdstdƒ‚qW|rç||krç|j|||!ƒ|j||ƒ}n|||_dS(Nits s junk characters in start tag: %ri(Rs/>(R R!tcheck_for_whole_start_tagRttagfindR0RR9tlowerRtattrfindR7tunescapetappendtstripRtcountR)trfindR tendswiththandle_startendtagthandle_starttagtCDATA_CONTENT_ELEMENTSR$(RR=tendposRtattrsR0R@ttagtmtattrnametrestt attrvalueR9RR((s..\python\lib\HTMLParser.pyR1ãsP     $$     cCs|j}tj||ƒ}|rò|jƒ}|||d!}|dkrR|dS|dkr²|jd|ƒrx|dS|jd|ƒrŽdS|j||dƒ|jdƒn|dkrÂdS|d krÒdS|j||ƒ|jd ƒntd ƒ‚dS( NiRt/s/>iiÿÿÿÿsmalformed empty start tagRs6abcdefghijklmnopqrstuvwxyz=/ABCDEFGHIJKLMNOPQRSTUVWXYZsmalformed start tagswe should not get here!(RtlocatestarttagendR0R9R.R-R R(RR=RRUR?tnext((s..\python\lib\HTMLParser.pyREs*      cCs¾|j}|||d!dks,tdƒ‚tj||dƒ}|sLdS|jƒ}tj||ƒ}|sŽ|jd|||!fƒn|jdƒ}|j |j ƒƒ|j ƒ|S(Niss&