一直对正则表达式很感兴趣,刚刚做了一个小实验,用正则表达式分析某网站中包含的所有超链接并存入txt文件中,代码如下
import java.io.BufferedReader;
import java.io.File;
import java.io.FileWriter;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class test {
public static void main(String[] args) {
String url = "http://www.baidu.com"; //目标网站
try{ //连接到网站
URL u = new URL(url);
HttpURLConnection conn = (HttpURLConnection)u.openConnection();
conn.connect();
InputStream is = conn.getInputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
File f = new File("C:\\Users\\Administrator\\Desktop\\links.txt"); //将超链接放入桌面的links.txt文件中
FileWriter fw = new FileWriter(f);
String str = "";
while((str = br.readLine())!=null){ //匹配文件中的超链接
Pattern p = Pattern.compile("(href=\")(http://.*?)(\")");
Matcher m = p.matcher(str);
while(m.find()){
String links = m.group().replace("href=\"", "").replace("\"", "");
fw.write(links+"\r\n");
System.out.println(links); //作为测试,在控制台打印出超链接
}
}
br.close();
is.close();
fw.close();
}catch (Exception e) {
e.printStackTrace();
}
}
}
运行结果能够显示出baidu首页中的所有超链接并存入txt文件中:
http://www.baidu.com/gaoji/preferences.html
http://news.baidu.com/ns?cl=2&rn=20&tn=news&word=
http://tieba.baidu.com/f?kw=&fr=wwwt
http://zhidao.baidu.com/q?ct=17&pn=0&tn=ikaslist&rn=10&word=&fr=wwwt
http://music.baidu.com/search?fr=ps&key=
http://image.baidu.com/i?tn=baiduimage&ct=201326592&lm=-1&cl=2&nc=1&word=
http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&word=
http://map.baidu.com/m?word=&fr=ps01000
http://wenku.baidu.com/search?word=&lm=0&od=0
http://www.baidu.com/more/
http://shouji.baidu.com/baidusearch/mobisearch.html?ref=pcjg&from=1000139w
http://www.baidu.com/gaoji/preferences.html
http://news.baidu.com
http://tieba.baidu.com
http://zhidao.baidu.com
http://music.baidu.com
http://image.baidu.com
http://v.baidu.com
http://map.baidu.com
http://baike.baidu.com
http://wenku.baidu.com
http://www.hao123.com
http://www.baidu.com/more/
http://www.baidu.com/cache/sethelp/index.html
http://e.baidu.com/?refer=888
http://top.baidu.com
http://home.baidu.com
http://ir.baidu.com
|
|