CYWORLD

래브님의 싸이홈

알림

regular expression to catch html tags

first of all, to catch html tags using only regular expression, that is not good approach.
a better way to parse html is using well-made module such as HTML::Parser in perl.
but i need *SINGLE-LINE* regular expression for my works.
so, i am working on it.
format of html tags,
B : begin of html tags, '<' and tag elements (e.g. '<img', '<!DOCTYPE', '
<?xml')
A : middle of html tag, some attributes and its value (e.g. 'src=foo', 'alt=bar')
E : end of html tags, equivalent '>'
regular expression for HTML tags is,
BA*E
add symbol 's' as space-separator for more precision,
B(s|sA)*E
and, on the assumption that consists of A is,
n : any possible characters except for ['">\s]
q : any possible characters between single quotation marks
w : any possible charaters between double quotation marks
then more plausible regular expression is,
B(s|(s(n|q|w)*))*E
go to the real world,
A :<[\/@!?#]?[^\W_]+
s : \s
n : [^'">\s]
q : '[^']*'
w : "[^"]*"
E :>
finally, regular expression of html tags is,
(<([\/@!?#]?[^\W_]+)(?:\s|(?:\s(?:[^'">\s]|'[^']*'|"[^"]*")*))*>)|(<\!--[^-]*-->)
this regexp will capture html complete tags in $1, tags element in $2, comment in $3.
I wrote simple code in perl to test this pattern,
#!/usr/bin/perl
use strict
;
#no strict 'refs';
use warnings
;
use
Carp
qw(
confess
)
;
$SIG
{__DIE__} =
\&confess
;
$SIG
{__WARN__} =
\&confess
;
#use Carp::Always;
use
diagnostics;
use
Time::HiRes
qw(
gettimeofday tv_interval
)
;
my
$input
=
`
lynx --source http://www.cyworld.com
`
;
my
$tagrex
=
qr/
(
<
(
[
\/
@!
?
#]
?
[^
\W
_]
+)(?:
\s
|
(?:
\s
(?:
[^'">
\s
]|'
[^']*
'|"
[^"]*
"
)*))*
>
)
|
(
<
\!
--
[^-]*
-->
)
/
;
eval
{
"
not_match_string
"
=~
/
$tagrex
/
};
die
"
invalid pattern :
$@
\n
"
if
$@
;
my
(
$t0
,
$t1
,
$t02t1
);
my
(
@taglists
,
@cmmtlists
);
$t0
= [gettimeofday];
while
(
$input
=~
m/
$tagrex
/ig
) {
push
@taglists
,
$1
if
defined
$1
;
push
@cmmtlists
,
$3
if
defined
$3
;
}
$t1
= [gettimeofday];
$t02t1
= tv_interval
$t0
,
$t1
;
print
"
Elapsed time :
"
.
sprintf
(
"
%f
"
,
$t02t1
)
.
"
seconds for
"
.
length
(
$input
)
.
"
length string.
\n
"
;
print
"
\n
"
;
print
'
=
'
x
50
.
"
\n
"
;
print
"
HTML TAG LISTS :
"
.
scalar
@taglists
.
"
items
\n
"
;
print
'
=
'
x
50
.
"
\n
"
;
print
join
"
\n
"
,
@taglists
;
print
"
\n
"
;
print
'
=
'
x
50
.
"
\n
"
;
print
"
HTML COMMENT LISTS :
"
.
scalar
@cmmtlists
.
"
items
\n
"
;
print
'
=
'
x
50
.
"
\n
"
;
print
join
"
\n
"
,
@cmmtlists
;
exit
;
result,
Elapsed time : 0.026232 seconds for 96155 length string.
==================================================
HTML TAG LISTS : 2046 items
==================================================
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko" lang="ko"><head><meta http-equiv="Content-Type" content="text/html; charset=euc-kr" /><meta name="robots" content="nofollow" /><title></title><link rel="stylesheet" type="text/css" href="/css/main/cymain8.css?ver=200901071000" /><link rel="shortcut icon" href="http://c1img.cyworld.co.kr/img/favicon.ico" /><script type="text/javascript"></script><script type="text/javascript" src="http://www.cyworld.com/MainNET/Service/Main/Interlock/NewsRank.aspx"></script><script type="text/javascript" src="http://www.cyworld.com/MainNET/Service/Main/Interlock/jsonEmpasRealKeyword.aspx"></script><script type="text/javascript" src="/cymain/v8/main/data/json/json_SearchKeyword.js?ver=200901110505"></script><script type="text/javascript" src="/cymain/v8/main/data/json/DiffService_EmpasRank.js?ver=200901110450"></script><script type="text/javascript" src="/cymain/v8/main/data/json/json_SvcDic.js?ver=200901081048"></script><script type="text/javascript" src="/cymain/v8/main/data/json/json_eyefocus.js?ver=200901110502"></script><script type="text/javascript" src="/cymain/v8/main/data/json/json_lifestyle.js?ver=200901110502"></script><script type="text/javascript" src="/cymain/v8/main/data/json/json_main_market_auction.js?ver=200901110430"></script><script type="text/javascript" src="/cymain/v8/main/data/json/json_main_market_11st.js?ver=200901110431"></script></head><body oncontextmenu="return false" ondragstart="return false" onselectstart="return false"><script type="text/javascript" src="/js/cy_prototype_light.js"></script><script type="text/javascript" src="/js/cyworld_script_7.0_plus_mini.js"></script><script type="text/javascript" src="http://www.cyworld.com/js/NameUI/CyworldScript.aspx?t=M"></script><script type="text/javascript" src="/cymain/v8/main/js/cyworld_main_v8.0.1.js?ver=200811101033"></script><script type="text/javascript" src="/js/pcid_cy.asp?w=main"></script><script type="text/javascript" src="/css/main/mainui.js?ver=200809181350"></script><div id="wrap"><div id="header"><h1><a href="#" title="사이좋은 사람들 싸이월드" onmousedown="setNDRClick('RBI01');"></a></h1><div id="flashBI"><script type="text/javascript"></script></div><script type="text/javascript"></script><ul class="navigation"><li id="gnb_minihompy" class="minihompy"><a href="http://www.cyworld.com/pims/mhsection/mh_index.asp" onmousedown="setNDRClick('RGN01');" title="미니홈피">
(... 생략 ...)
</a><a href="#" class="cancel" onclick="showCloseLayer(null, false);"></a></p><form id="" name=""><fieldset><legend></legend><input type="checkbox" id="alert_check" name="alert_check" value="" onclick="alertCheck();" /><label for="alert_check"></label></fieldset></form><div class="close"><a href="#" onclick="showCloseLayer(null, false);"></a></div></div></div><iframe width="0" height="0" scrolling="no" frameborder="0" src="/event/ppl/ppl_info.html"></iframe><iframe src="/pims/Tsection/townmall/itemgift/item_gift_receive_iframe.asp" style="width: 0px; height: 0px; position: absolute; visibility: hidden;"></iframe><img src="http://stat.cyworld.com/stat/stat.tiff?cp_url=[cyworld_ndr.nate.com/main/start_a/]" width="0" height="0" border="0" alt="" /><br /><script type="text/javascript"></script></body></html>
==================================================
HTML COMMENT LISTS : 97 items
==================================================
<!-- JSON Data --><!-- ***** 헤더영역 : S ***** --><!-- BI and GNB : S --><!-- 기본 링크 --><!-- class : blog / blog_up / blog_new / 뮤직의 경우는 onmouseout의 속성도 변경해주어야합니다--><!-- BI and GNB : E --><!-- 통합검색 : S --><!-- 자동완성 키패드 iframe --><!-- 자동완성 버튼 --><!-- 통합검색 : E --><!-- 화제의 미니홈피 --><!-- 화제의 미니홈피 : E --><!-- 왜떴을까? --><!-- 왜떴을까? : E --><!-- ***** 헤더영역 : E ***** --><!-- ***** Left 컨텐츠 : S ***** --><!-- 로그인후 개인화 영역 --><!-- 로그인전 --><!-- 로그인전 : E --><!-- Xecure Key --><!-- 키보드 보안 기능을 사용할 수 없습니다 --><!-- 키보드 보안 서비스를 이용하시려면 프로그램을 설치해 주세요 --><!-- 보안수준 3단계를 이용하시려면 프로그램을 설치해 주세요 --><!--
<script type="text/javascript" src="http://cyxso.cyworld.com/cryptlogin/base64.js"></script><script type="text/javascript" src="http://cyxso.cyworld.com/cryptlogin/jsbn.js"></script><script type="text/javascript" src="http://cyxso.cyworld.com/cryptlogin/prng4.js"></script><script type="text/javascript" src="http://cyxso.cyworld.com/cryptlogin/rng.js"></script><script type="text/javascript" src="http://cyxso.cyworld.com/cryptlogin/rsa.js"></script><script type="text/javascript" src="http://cyxso.cyworld.com/cryptlogin/enclogin.js"></script>
--><!-- 동영상 : S --><!-- 동영상 : E --><!-- 배너광고 : S --><!-- 배너광고 : E -->
(... 생략 ...)
<!-- ***** 광고 : S ***** --><!-- 메인중앙 --><!-- 메인좌측 --><!-- 메인우측 --><!-- ***** 광고 : E ***** --><!-- 마이커버스토리 레이어 팝업 --><!-- 마이커버스토리 레이어 팝업 : E --><!-- Session --><!-- PPL 시작 --><!-- 타운알림 --><!-- Stat --><!-- mybase 처리 --><!-- mybase 처리 : E -->

댓글 0

TOP
TOP