您现在的位置是：网站首页> 编程资料编程资料

使用Ruby来处理文本的教程_ruby专题_

2023-05-26 459人已围观

简介使用Ruby来处理文本的教程_ruby专题_

与 Perl 和 Python 类似，Ruby 拥有出色的功能，是一种强大的文本处理语言。本文简单介绍了 Ruby 的文本数据处理功能，以及如何使用 Ruby 语言有效处理不同格式的文本数据，无论是 CSV 数据还是 XML 数据。
Ruby 字符串
常用缩略词

CSV：逗号分隔值
REXML：Ruby Electric XML
XML：可扩展标记语言

Ruby 中的 String 是容纳、比较和操作文本数据的一种强大方法。在 Ruby 中，String 是一个类，可以通过调用 String::new 或向它分配一个字面值将它实例化。

向 Strings 赋值时，可以使用单引号（'）或双引号（"）来包围值。单引号和双引号在为 Strings 赋值时有几个差别。双引号支持转义序列使用一个前置反斜杠（\）并支持在字符串中使用 #{} 操作符计算表达式。而单引号引用的字符串则是简单直接的文字。

清单 1 是一个示例。
清单 1. 处理 Ruby 字符串：定义字符串

 message = 'Heal the World…' puts message message1 = "Take home Rs #{100*3/2} " puts message1 Output : # ./string1.rb # Heal the World… # Take home Rs 150

这里，第一个字符串使用一对单引号定义，第二个字符串使用一对双引号定义。在第二个字符串中，#{} 中的表达式在显示前计算。

另一种有用的字符串定义方法通常用于多行字符串定义。

从现在开始，我将使用交互式 Ruby 控制台 irb>> 进行说明。您的 Ruby 安装也应该安装该控制台。如果没有安装，建议您获取 irb Ruby gem 并安装它。Ruby 控制台是学习 Ruby 及其模块的一个非常有用的工具。安装之后，可以使用 irb>> 命令运行它。
清单 2. 处理 Ruby 字符串：定义多个字符串

 irb>> str = >>EOF irb>> "hello world irb>> "how do you feel? irb>> "how r u ? irb>> EOF "hello, world\nhow do you feel?\nhow r u?\n" irb>> puts str hello, world how do you feel? how r u?

在清单 2 中，>>EOF 和 EOF 中的所有内容都视为字符串的一部分，包括 \n（换行）字符。

Ruby String 类有一组强大的方法用于操作和处理存储在它们之中的数据。清单 3、4 和 5 中的示例展示了部分方法。
清单 3. 处理 Ruby 字符串：连接字符串

 irb>> str = "The world for a horse" # String initialized with a value The world for a horse irb>> str*2 # Multiplying with an integer returns a # new string containing that many times # of the old string. The world for a horseThe world for a horse irb>> str + " Who said it ? " # Concatenation of strings using the '+' operator The world for a horse Who said it ? irb>> str<<" is it? " # Concatenation using the '<<' operator The world for a horse is it?

提取子字符串并操作字符串的多个部分
清单 4. 处理 Ruby 字符串：提取并操作

 irb>> str[0] # The '[]' operator can be used to extract substrings, just # like accessing entries in an array. # The index starts from 0. 84 # A single index returns the ascii value # of the character at that position irb>> str[0,5] # a range can be specified as a pair. The first is the starting # index , second is the length of the substring from the # starting index. The w irb>> str[16,5]="Ferrari" # The same '[]' operator can be used # to replace substrings in a string # by using the assignment like '[]=' irb>>str The world for a Ferrari Irb>> str[10..22] # The range can also be specified using [x1..x2] for a Ferrari irb>> str[" Ferrari"]=" horse" # A substring can be specified to be replaced by a new # string. Ruby strings are intelligent enough to adjust the # size of the string to make up for the replacement string. irb>> s The world for a horse irb>> s.split # Split, splits the string based on the given delimiter # default is a whitespace, returning an array of strings. ["The", "world", "for", "a", "horse"] irb>> s.each(' ') { |str| p str.chomp(' ') } # each , is a way of block processing the # string splitting it on a record separator # Here, I use chomp() to cut off the trailing space "The" "world" "for" "a" "horse"

Ruby String 类还可以使用许多其他实用方法，这些方法可以更改大小写、获取字符串长度、删除记录分隔符、扫描字符串、加密、解密等。另一个有用的方法是 freeze，该方法可以使字符串变得不可修改。对 String str 调用该方法（str.freeze）之后，str 将不能被修改。

Ruby 还有一些称为 “析构器（destructor）” 的方法。以感叹号（!）结尾的方法将永久修改字符串。常规方法（结尾没有感叹号）修改并返回调用它们的字符串的副本。而带有感叹号的方法直接修改调用它们的字符串。
清单 5. 处理 Ruby 字符串：永久修改字符串

 irb>> str = "hello, world" hello, world irb>> str.upcase HELLO, WORLD irb>>str # str, remains as is. Hello, world irb>> str.upcase! # here, str gets modified by the '!' at the end of # upcase. HELLO, WORLD irb>> str HELLO, WORLD

在清单 5 中，str 中的字符串由 upcase! 方法修改，但 upcase 方法只返回大小写修改后的字符串副本。这些 ! 方法有时很有用。

Ruby Strings 的功能非常强大。数据被捕获进 Strings 中后，您就能够任意使用多种方法轻松有效地处理这些数据。

处理 CSV 文件

CSV 文件是表示表格式的数据的一种很常见的方法，表格式通常用作从电子表格导出的数据（比如带有详细信息的联系人列表）的格式。

Ruby 有一个强大的库，可以用于处理这些文件。csv 是负责处理 CSV 文件的 Ruby 模块，它拥有创建、读取和解析 CSV 文件的方法。

清单 6 展示了如何创建一个 CSV 文件并使用 Ruby csv 模块来解析文件。
清单 6. 处理 CSV 文件：创建并解析一个 CSV 文件

 require 'csv' writer = CSV.open('mycsvfile.csv','w') begin print "Enter Contact Name: " name = STDIN.gets.chomp print "Enter Contact No: " num = STDIN.gets.chomp s = name+" "+num row1 = s.split writer << row1 print "Do you want to add more ? (y/n): " ans = STDIN.gets.chomp end while ans != "n" writer.close file = File.new('mycsvfile.csv') lines = file.readlines parsed = CSV.parse(lines.to_s) p parsed puts "" puts "Details of Contacts stored are as follows..." puts "" puts "-------------------------------" puts "Contact Name | Contact No" puts "-------------------------------" puts "" CSV.open('mycsvfile.csv','r') do |row| puts row[0] + " | " + row[1] puts "" end

清单 7 显示了输出：
清单 7. 处理 CSV 文件：创建并解析一个 CSV 文件输出

 Enter Contact Name: Santhosh Enter Contact No: 989898 Do you want to add more ? (y/n): y Enter Contact Name: Sandy Enter Contact No: 98988 Do you want to add more ? (y/n): n Details of Contacts stored are as follows... --------------------------------- Contact Name | Contact No --------------------------------- Santhosh | 989898 Sandy | 98988

让我们快速检查一下这个示例。

首先，包含 csv 模块（require 'csv'）。

要创建一个新的 CSV 文件 mycsvfile.csv，使用 CSV.open() 调用打开它。这返回一个写入器（writer）对象。

这个示例创建了一个 CSV 文件，该文件包含一个简单的联系人列表，存储联系人姓名及其电话号码。在循环中，用户被要求输入联系人姓名和电话号码。姓名和电话号码被连接为一个字符串，然后分割为含两个字符串的数组。这个数组传递到写入器对象以便写入 CSV 文件。这样，一对 CSV 值就存储为文件中的一行。

循环结束后，任务也就完成了。现在关闭写入器，文件中的数据得以保存。

下一步是解析创建的 CSV 文件。

打开和解析该文件的一种方法是使用新的 CSV 文件名称创建一个新的 File 对象。

调用 readlines 方法将文件中的所有行读入一个名为 lines 的数组。

通过调用 lines.to_s 将 lines 数组转换为一个 String 对象，然后将这个 String 对象传递到 CSV.parse 方法，该方法解析 CSV 数据并将其内容返回为一个包含数组的数组。

下面介绍打开和解析该文件的另一种方法。以读取模式使用 CSV.open 调用再次打开文件。这返回一个行数组。使用某种格式打印每个行以显示联系人细节。这里的每个行对应文件中的行。

如您所见，Ruby 提供一个强大的模块来处理 CSV 文件和数据。

处理 XML 文件

对于 XML 文件，Ruby 提供一个名为 REXML 的强大的内置库。这个库可以用于读取和解析 XML 文档。

查看以下 XML 文件并试图用 Ruby 和 REXML 来解析它。

下面是一个简单的 XML 文件，列示一个在线购物中心的典型购物车中的内容。它拥有以下元素：

cart —— 根元素
user —— 购货用户
item —— 用户添加到购物车中的商品项
id, price 和 quantity —— 项目的子元素

清单 8 展示了这个 XML 的结构：
清单 8. 处理 XML 文件：示例 XML 文件

从下载部分获取这个示例 XML 文件。现在，加载这个 XML 文件并使用 REXML 解析文件树。
清单 9. 处理 XML 文件：解析 XML 文件

 require 'rexml/document' include REXML file = File.new('shoppingcart.xml') doc = Document.new(file) root = doc.root puts "" puts "Hello, #{root.attributes['id']}, Find below the bill generated for your purchase..." puts "" sumtotal = 0 puts "-----------------------------------------------------------------------" puts "Item\t\tQuantity\t\tPrice/unit\t\tTotal" puts "-----------------------------------------------------------------------" root.each_element('//item') { |item| code = item.attributes['code'] qty = item.elements["qty"].text.split(' ') price = item.elements["price"].text.split(' ') total = item.elements["price"].text.to_i * item.elements["qty"].text.to_i puts "#[code]\t\t #{qty}\t\t #{price}\t\t #{total}" puts "" sumtotal += total } puts "-----------------------------------------------------------------------" puts "\t\t\t\t\t\t Sum total : " + sumtotal.to_s puts "-----------------------------------------------------------------------"

清单 10 显示输出。
清单 10. 处理 XML 文件：解析 XML 文件输出

 Hello, santhosh, Find below the bill generated for your purchase... ------------------------------------------------------------------------- Item Quantity Price/unit Total ------------------------------------------------------------------------- CS001 2 100 200 CS002 5 200 1000 CS003 3 500 1500 CS004 5 150 750 ------------------------------------------------------------------------- Sum total : 3450 --------------------------------------------------------------------------

清单 9 解析这个购物车 XML 文件并生成一个账单，该账单显示项目合计和采购总计（见清单 10）。

下面我们具体介绍操作过程。

首先，包含 Ruby 的 REXML 模块，该模块拥有解析 XML 文件的方法。

打开 shoppingcart.xml 文件并从该文件创建一个 Document 对象，该对象包含解析后的 XML 文件。

将文档的根分配给元素对象 root。这将指向 XML 文件中的 cart 标记。

每个元素对象拥有一个属性对象，该属性对象是元素属性的 hash 表，其中属性名称作为键名，属性值作为键值。这里，root.attributes['id'] 将提供 root 元素的 id 属性的值（本例中为 userid）。

下面，将 sumtotals 初始化为 0 并打印标头。

每个元素对象还有一个对象 elements，该对象拥有 each 和 [] 方法，以便访问子元素。这个对象遍历所有带有 item 名称（通过 XPath 表达式 //item 指定）的 root 元素的子元素。每个元素还有一个属性 text，该属性容纳元素的文本值。

下一步，获取 item 元素的 code 属性以及 price 和 qty 元素的文本值，然后计算项目合计（Total）。将详细信息打印到账单并将项目合计添加到采购总计（Sum total）。

最后，打印采购总计。

这个示例展示了使用 REXML 和 Ruby 解析 XML 文件有多么简单！同样，在运行中生成 XML 文件，添加和删除元素及它们的属性也很简单。
清单 11. 处理 XML 文件：生成 XML 文件

 doc = Document.new doc.add_element("cart1", {"id" => "user2"}) cart = doc.root.elements[1] item = Element.new("item") item.add_element("price") item.elements["price"].text = "100" item.add_element("qty") item.elements["qty"].text = "4" cart .elements << item

清单 11 中的代码通过创建一个 cart 元素、一个 item 元素和它的子元素来创建 XML 结构，然后使用值填充这些子元素并将它们添加到 Document 根。

类似地，要删除元素和属性，使用 Elements 对象的 delete_element 和 delete_attribute 方法。

以上示例中的方法称为树解析（tree parsing）。另一种 XML 文档解析方法称为流解析（stream parsing）。“流解析” 比 “树解析” 更快，可以用于要求快速解析的情况。“流解析” 是基于事件的，它使用监听器。当解析流遇到一个标记时，它将调用监听器并执行处理。

清单 12 展示了一个示例：
清单 12. 处理 XML 文件：流解析

require 'rexml/document' require 'rexml/streamlistener' include REXML class

上一篇：使用Ruby编写脚本进行系统管理的教程_ruby专题_

下一篇：利用RJB在Ruby on Rails中使用Java代码的教程_ruby专题_

您现在的位置是：网站首页> 编程资料编程资料

使用Ruby来处理文本的教程_ruby专题_

相关内容

点击排行

本栏推荐

猜你喜欢