3. Text Filtering: In a corpus of s

3. Text Filtering: In a corpus of several thousands of documents, you will likely have many terms that are

irrelevant to either differentiating documents from each other or to summarizing the documents. You

will have to manually browse through the terms to eliminate irrelevant terms. This is often one of the

most time-consuming and subjective tasks in all of the text mining steps. It requires a fair amount of

subject matter knowledge (or domain expertise). In addition to term filtering, documents irrelevant to

the analysis are searched using keywords. Documents are filtered if they do not contain some of the

terms or filtered based on one of the other document variables such as date, category, etc. Term

filtering or document filtering alters the term-by-document matrix. As shown in Table 1.1, the term-
by-document matrix contains the frequency of the occurrence of the term in the document as the value

of each cell. Instead, you could have a log of the frequency or just a 1 or 0 value indicating the presence

of the term in a document as the value for each cell. From this frequency matrix, a weighted term-by-
document matrix is generated using various term-weighting techniques.

3. Text Filtering: In a corpus of several thousands of documents, you will likely have many terms that are

irrelevant to either differentiating documents from each other or to summarizing the documents. You

will have to manually browse through the terms to eliminate irrelevant terms. This is often one of the

most time-consuming and subjective tasks in all of the text mining steps. It requires a fair amount of

subject matter knowledge (or domain expertise). In addition to term filtering, documents irrelevant to

the analysis are searched using keywords. Documents are filtered if they do not contain some of the

terms or filtered based on one of the other document variables such as date, category, etc. Term

filtering or document filtering alters the term-by-document matrix. As shown in Table 1.1, the term-
by-document matrix contains the frequency of the occurrence of the term in the document as the value

of each cell. Instead, you could have a log of the frequency or just a 1 or 0 value indicating the presence

of the term in a document as the value for each cell. From this frequency matrix, a weighted term-by-
document matrix is generated using various term-weighting techniques.

0/5000

จาก: -

เป็น: -

ผลลัพธ์ (ไทย) 1: [สำเนา]

คัดลอก!

3. การกรองข้อความ: ในคอร์พัสคริของเอกสารหลายพัน คุณจะมีเงื่อนไขมากมายที่เกี่ยวข้องกับเอกสารใด differentiating กัน หรือสรุปเอกสาร คุณจะต้องดูเงื่อนไขที่จะขจัดเงื่อนไขที่ไม่เกี่ยวข้องด้วยตนเอง นี้มักจะเป็นหนึ่งในสุดตามอัตวิสัย และใช้งานในขั้นตอนการทำเหมืองข้อความทั้งหมด ต้องสมควรความรู้เรื่อง (หรือโดเมนความเชี่ยวชาญ) นอกจากเงื่อนไขการกรอง เอกสารที่ไม่เกี่ยวข้องการวิเคราะห์จะถูกค้นหาโดยใช้คำสำคัญ เอกสารถูกกรองถ้าพวกเขาประกอบด้วยบางอย่างเงื่อนไขหรือกรองตามหนึ่งตัวแปรเอกสารอื่น ๆ เช่นวัน ประเภท เป็นต้น ระยะกรองหรือกรองเอกสารเปลี่ยนแปลงเมตริกซ์ระยะโดยเอกสาร ดังแสดงในตาราง 1.1 ระยะ-เมตริกซ์โดยเอกสารประกอบด้วยความถี่ของการเกิดขึ้นของคำในเอกสารเป็นค่าของแต่ละเซลล์ แทน คุณมีล็อกความถี่หรือเพียง 1 หรือ 0 ค่าบ่งชี้สถานะของคำในเอกสารเป็นค่าสำหรับแต่ละเซลล์ จากเมทริกซ์นี้ความถี่ ระยะถ่วงน้ำหนัก-โดย-สร้างเมทริกซ์เอกสารใช้เทคนิคน้ำหนักระยะต่าง ๆ

การแปล กรุณารอสักครู่..

ผลลัพธ์ (ไทย) 2:[สำเนา]

คัดลอก!

3. การกรองข้อความ: ในคลังของหลายพันของเอกสารที่คุณมีแนวโน้มที่จะมีหลายคำที่มีความเกี่ยวข้องกับความแตกต่างของทั้งเอกสารจากแต่ละอื่นๆ หรือการสรุปเอกสาร คุณจะต้องเรียกดูด้วยตนเองผ่านทางข้อตกลงในการขจัดเงื่อนไขที่ไม่เกี่ยวข้อง นี้มักจะเป็นส่วนหนึ่งของการใช้เวลานานที่สุดและงานส่วนตัวในทุกขั้นตอนการทำเหมืองข้อความ มันต้องมีจำนวนเงินที่ยุติธรรมของความรู้เรื่อง (หรือความเชี่ยวชาญโดเมน) นอกจากนี้ในการกรองคำเอกสารที่เกี่ยวข้องกับการวิเคราะห์มีการค้นหาโดยใช้คำหลัก เอกสารจะถูกกรองถ้าพวกเขาไม่ได้มีบางส่วนของคำหรือกรองขึ้นอยู่กับหนึ่งในตัวแปรเอกสารอื่น ๆ เช่นวันที่ประเภทอื่น ๆ ระยะการกรองหรือกรองเอกสารalters ระยะโดยเอกสารเมทริกซ์ ดังแสดงในตารางที่ 1.1, term- เมทริกซ์โดยเอกสารมีความถี่ของการเกิดขึ้นของคำในเอกสารที่เป็นค่าของแต่ละเซลล์ แต่คุณอาจจะมีการเข้าสู่ระบบของความถี่หรือเพียงแค่ค่า 1 หรือ 0 แสดงให้เห็นการปรากฏของคำในเอกสารเป็นค่าสำหรับแต่ละเซลล์ จากเมทริกซ์ความถี่นี้ในระยะชีวินถ่วงน้ำหนักเมทริกซ์เอกสารถูกสร้างขึ้นโดยใช้เทคนิคระยะน้ำหนักที่แตกต่างกัน

การแปล กรุณารอสักครู่..

ผลลัพธ์ (ไทย) 3:[สำเนา]

คัดลอก!

การแปล กรุณารอสักครู่..

ภาษาอื่น ๆ

การสนับสนุนเครื่องมือแปลภาษา: กรีก, กันนาดา, กาลิเชียน, คลิงออน, คอร์สิกา, คาซัค, คาตาลัน, คินยารวันดา, คีร์กิซ, คุชราต, จอร์เจีย, จีน, จีนดั้งเดิม, ชวา, ชิเชวา, ซามัว, ซีบัวโน, ซุนดา, ซูลู, ญี่ปุ่น, ดัตช์, ตรวจหาภาษา, ตุรกี, ทมิฬ, ทาจิก, ทาทาร์, นอร์เวย์, บอสเนีย, บัลแกเรีย, บาสก์, ปัญจาป, ฝรั่งเศส, พาชตู, ฟริเชียน, ฟินแลนด์, ฟิลิปปินส์, ภาษาอินโดนีเซี, มองโกเลีย, มัลทีส, มาซีโดเนีย, มาราฐี, มาลากาซี, มาลายาลัม, มาเลย์, ม้ง, ยิดดิช, ยูเครน, รัสเซีย, ละติน, ลักเซมเบิร์ก, ลัตเวีย, ลาว, ลิทัวเนีย, สวาฮิลี, สวีเดน, สิงหล, สินธี, สเปน, สโลวัก, สโลวีเนีย, อังกฤษ, อัมฮาริก, อาร์เซอร์ไบจัน, อาร์เมเนีย, อาหรับ, อิกโบ, อิตาลี, อุยกูร์, อุสเบกิสถาน, อูรดู, ฮังการี, ฮัวซา, ฮาวาย, ฮินดี, ฮีบรู, เกลิกสกอต, เกาหลี, เขมร, เคิร์ด, เช็ก, เซอร์เบียน, เซโซโท, เดนมาร์ก, เตลูกู, เติร์กเมน, เนปาล, เบงกอล, เบลารุส, เปอร์เซีย, เมารี, เมียนมา (พม่า), เยอรมัน, เวลส์, เวียดนาม, เอสเปอแรนโต, เอสโทเนีย, เฮติครีโอล, แอฟริกา, แอลเบเนีย, โคซา, โครเอเชีย, โชนา, โซมาลี, โปรตุเกส, โปแลนด์, โยรูบา, โรมาเนีย, โอเดีย (โอริยา), ไทย, ไอซ์แลนด์, ไอร์แลนด์, การแปลภาษา.