兴趣使然，想做个自动登录某网站并且爬取一些数据的的功能，然而虽然其他功能做完了，但是需要手动输验证码…于是了解了Tesseract-OCR尝试自动识别。

准备工作

下载Tesseract https://digi.bib.uni-mannheim.de/tesseract/ 此处需要注意，64位版本的可能在训练模型时失败，提示缺少icuuc63.dll，不想麻烦的话就直接下32位的吧
将安装目录添加到环境变量Path中，例如32位默认路径是C:\Program Files (x86)\Tesseract-OCR
创建项目，导入 Tess4j的maven依赖 https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j
下载https://sourceforge.net/projects/tess4j/files/tess4j/ 将tessdata文件夹复制到项目的src文件夹下

java代码

1
Tesseract tesseract = new Tesseract();
2
tesseract.setDatapath("src/tessdata");
3
//识别库文件，默认就是eng，在src/tessdata下
4
//扩充识别库：https://github.com/tesseract-ocr/tessdata
5
tesseract.setLanguage("eng");
6
//使用旧版本的模式 并设置只识别为数字/字母 不是用于识别验证码可不加下面两行
7
tesseract.setOcrEngineMode(1);
8
tesseract.setTessVariable("tessedit_char_whitelist","0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ");
9

10
String result = tesseract.doOCR(ImageIO.read(new File("图片路径"));
11
System.out.println(result);

二值化验证码图片

上面已经能够正常识别了，但是如果是验证码的话准确度仍然不是很高，所以我们可以对验证码图片先进行二值化处理

贴个工具方法，择情修改

1
public static void toBinary(File imageFile) throws IOException {
2
    BufferedImage image = ImageIO.read(imageFile);
3
    int w = image.getWidth();
4
    int h = image.getHeight();
5
    float[] rgb = new float[3];
6
    double[][] pos = new double[w][h];
7
    BufferedImage bi = new BufferedImage(w, h, BufferedImage.TYPE_BYTE_BINARY);
8
    for (int x = 0; x < w; x++) {
9
      for (int y = 0; y < h; y++) {
10
        int pixel = image.getRGB(x, y);
11
        rgb[0] = (pixel & 0xff0000) >> 16;
12
        rgb[1] = (pixel & 0xff00) >> 8;
13
        rgb[2] = (pixel & 0xff);
14
        float avg = (rgb[0] + rgb[1] + rgb[2]) / 3;
15
        pos[x][y] = avg;
16
      }
17
    }
18
    double SW = 170;
19
    for (int x = 0; x < w; x++) {
20
      for (int y = 0; y < h; y++) {
21
        if (pos[x][y] <= SW) {
22
          int max = new Color(0, 0, 0).getRGB();
23
          bi.setRGB(x, y, max);
24
        } else {
25
          int min = new Color(255, 255, 255).getRGB();
26
          bi.setRGB(x, y, min);
27
        }
28
      }
29
    }
30
    ImageIO.write(bi, "png", imageFile);
31
  }

训练识别库

推荐一下参考地址，蛮详细的

下载jTessBoxEditor https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/
打开jTessBoxEditor.jar文件
准备一些验证码样本，必须是白底黑字，可以使用二值化处理后的验证码图片
选择Tools -> Merge TIFF，进入训练样本所在文件夹，选中要参与训练的样本图片
点击打开后弹出保存对话框，选择保存在当前路径下，文件命名为 zwp.test.exp0.tif 。 tif文面命名格式[语言名称].[字体名称].exp[自定义数字].tif
生成.box文件执行 tesseract zwp.test.exp0.tif zwp.test.exp0 batch.nochop makebox
使用jTessBoxEditor矫正.box文件的错误
选择Box Editor -> Open，打开步骤2中生成的 zwp.test.exp0.tif ，会自动关联到 zwp.test.exp0.box 文件，这两文件要求在同一目录下。调整完点击save保存修改。
生成font_properties文件：执行echo test 0 0 0 0 0 >font_properties test 0 0 0 0 0表示字体test的粗体、倾斜等共计5个属性。这里的test必须与zwp.test.exp0.box中的test名称一致。
最后，生成一坨文件并合并训练文件

1
:: 生成字符集文件
2
tesseract zwp.test.exp0.tif zwp.test.exp0 nobatch box.train
3
:: 生成shape文件
4
unicharset_extractor zwp.test.exp0.box
5
:: 生成聚字符特征文件
6
mftraining -F font_properties -U unicharset -O zwp.unicharset zwp.test.exp0.tr
7
:: 生成字符正常化特征文件
8
cntraining zwp.test.exp0.tr
9
:: 文件重命名
10
rename normproto zwp.normproto
11
rename inttemp zwp.inttemp
12
rename pffmtable zwp.pffmtable
13
rename shapetable zwp.shapetable
14
:: 合并.traineddata训练文件
15
combine_tessdata zwp.

怎么用就不需要我多说了吧令人遗憾的是，我调整了30个验证码样本，仍旧是识别不够准确，暂时先放弃吧

Tess4j使用

准备工作

java代码

二值化验证码图片

训练识别库

评论