일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 | 12 | 13 | 14 |
15 | 16 | 17 | 18 | 19 | 20 | 21 |
22 | 23 | 24 | 25 | 26 | 27 | 28 |
29 | 30 | 31 |
- idapro
- commandline
- svn update
- data distribution
- MySQL
- x64
- Analysis
- ida pro
- javascript
- Injection
- debugging
- mock.patch
- malware
- open office xml
- why error
- hex-rays
- Rat
- error
- NumPy Unicode Error
- 포인터 매핑
- error fix
- Python
- Ransomware
- pytest
- idb2pat
- ida
- idapython
- h5py.File
- TensorFlow
- ecma
- Today
- Total
13 Security Lab
Learn about TF-IDF 본문
What is TF-IDF ?
TF-IDF (Term Frequency-Inverse Document Frequency) is a weight used in information retrieval and text mining, and is a statistical value indicating how important a word is in a specific document when there is a document group consisting of multiple documents.
TF (term frequency) is a value that indicates how often a specific word appears in a document, and the higher this value, the more important it can be in the document. However, when the word itself is used frequently within a document family, this means that the word appears common. This is called DF (document frequency), and the reciprocal of this value is called IDF (inverse document frequency). TF-IDF is the product of TF and IDF.
Term Frequency
How often the word appears in the document.
TF (term frequency) is a value that indicates how often a specific word appears in a document. The higher this value, the more important it can be in the document. However, if it does not appear much in one document and appears frequently in another document, the importance of the word decreases.
Inverse Document Frequency
Weighing words that appear more frequently (such as and, the, or which are stop words are typically disregarded) and prioritizing unique words that appear commonly across documents.
It is called DF (document frequency), and the reciprocal of this value is called IDF (inverse document frequency).
TF-DF is the product of TF and IDF, and the higher the score, the less often it is in other documents, and it means words that appear frequently in the document.