A股上市公司传智教育(股票代码 003032)旗下技术交流社区北京昌平校区

 找回密码
 加入黑马

QQ登录

只需一步,快速开始

© 梦缠绕的时候 黑马粉丝团   /  2018-12-14 11:09  /  422 人查看  /  1 人回复  /   0 人收藏 转载请遵从CC协议 禁止商业使用本文

实例三:
数据提取
要求:从一段HTML代码中提取出所有的email地址和<a href...>tag中的链接地址
public class HtmlTest {
public static void main(String[] args) {
String htmlText = "<html>"
+ "<a href=\"testone@163.com\">163test</a>\n"
+ "<a href='www.163.com@163-com.com'>163news</a>\n"
+ "<a href=http://www.163.com>163lady</a>\n"
+ "<a href = http://sports.163.com>网易体育</a>\n"
+ "<a href = \"http://gz.house.163.com\">网易房产</a>\n"
+ ".leemaster@163" + "luckdog.com" + "</html>";
System.out.println("开始检查email");
for (String email : extractEmail(htmlText)) {
System.out.println("邮箱是:" + email);
}
System.out.println("开始检查超链接");
for (String link : extractLink(htmlText)) {
System.out.println("超链接是:" + link);
}
}
private static List<String> extractLink(String htmlText) {
List<String> result = new ArrayList<String>();
Pattern p = Pattern.compile(Regexes.HREF_LINK_REGEX);
Matcher m = p.matcher(htmlText);
while (m.find()) {
result.add(m.group());
}
return result;
}
private static List<String> extractEmail(String htmlText) {
List<String> result = new ArrayList<String>();
Pattern p = Pattern.compile(Regexes.EMAIL_REGEX);
Matcher m = p.matcher(htmlText);
while (m.find()) {
result.add(m.group());
}
return result;
}
}
public class Regexes {
public static final String EMAIL_REGEX =
"(?i)(?<=\\b)[a-z0-9][-a-z0-9_.]+[a-z0-9]@([a-z0-9][-a-z0-9]+\\.)+[a-z]{2,4}(?=\\b)";
public static final String HREF_LINK_REGEX
= "(?i)<a\\s+href\\s*=\\s*['\"]?([^'\"\\s>]+)['\"\\s>]";
}
运行结果:
开始检查email
邮箱是:testone@163.com
邮箱是:www.163.com@163-com.com
邮箱是:leemaster@163luckdog.com
开始检查超链接
超链接是:<a href="testone@163.com"
超链接是:<a href='www.163.com@163-com.com'
超链接是:<a href=http://www.163.com>
超链接是:<a href = http://sports.163.com>
超链接是:<a href = "http://gz.house.163.com"
实例四:
查找重复单词
要求:查找一段文本中是否存在重复单词,如果存在,去掉重复单词。
public class FindWord {
public static void main(String[] args) {
String[] sentences = new String[] { "this is a normal sentence",
"Oh,my god!Duplicate word word",
"This sentence contain no duplicate word words" };
for(String sentence:sentences){
System.out.println("校验句子:"+sentence);
if(containDupWord(sentence)){
System.out.println("Duplicate word found!!");
System.out.println("正在去除重复单词"+removeDupWords(sentence));
}
System.out.println("");
}
}
private static String removeDupWords(String sentence) {
String regex = Regexes.DUP_WORD_REGEX;
return sentence.replaceAll(regex,"$1");
}
private static boolean containDupWord(String sentence) {
String regex = Regexes.DUP_WORD_REGEX;
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(sentence);
if(m.find()){
return true;
}else{
return false;
}
}
}
public class Regexes {
public static final String DUP_WORD_REGEX
= "(?<=\\b)(\\w+)\\s+\\1(?=\\b)";
}
运行结果:
校验句子:this is a normal sentence
校验句子:Oh,my god!Duplicate word word
Duplicate word found!!
正在去除重复单词Oh,my god!Duplicate word
校验句子:This sentence contain no duplicate word words

1 个回复

倒序浏览
奈斯
回复 使用道具 举报
您需要登录后才可以回帖 登录 | 加入黑马