这是我写的一个网络爬虫程序,在黑马的入学测试的网页中找“黑马”这个关键字,为什么爬虫只能爬到11个,但在网页源代码中确有51呢?- import java.io.BufferedReader;
- import java.io.InputStreamReader;
- import java.net.URL;
- import java.util.regex.Matcher;
- import java.util.regex.Pattern;
- public class HeiMa {
- public static void main(String[] args) throws Exception{
- URL url = new URL("http://bbs.itheima.com/forum-19-1.html");
-
- BufferedReader br = new BufferedReader(new InputStreamReader(url.openConnection().getInputStream()));
- String regex = "[黑][马]";
- Pattern heimaregex = Pattern.compile(regex);
- String line = null;
- int x = 0;
- while((line = br.readLine())!=null){
- Matcher heimamacher = heimaregex.matcher(line);
- if(heimamacher.find()){
- x++;
- System.out.println(heimamacher.group());
- }
-
- }
- System.out.println(x);
- }
- }
复制代码 |