Elasticsearch

Utilities / Application Utilities / Search as a Service

Managing Director at Bettison.org Limited·Mar 25, 2019

Shared insights

In 2012 we made the very difficult decision to entirely re-engineer our existing monolithic LAMP application from the ground up in order to address some growing concerns about it's long term viability as a platform.

Full application re-write is almost always never the answer, because of the risks involved. However the situation warranted drastic action as it was clear that the existing product was going to face severe scaling issues. We felt it better address these sooner rather than later and also take the opportunity to improve the international architecture and also to refactor the database in. order that it better matched the changes in core functionality.

PostgreSQL was chosen for its reputation as being solid ACID compliant database backend, it was available as an offering AWS RDS service which reduced the management overhead of us having to configure it ourselves. In order to reduce read load on the primary database we implemented an Elasticsearch layer for fast and scalable search operations. Synchronisation of these indexes was to be achieved through the use of Sidekiq's Redis based background workers on Amazon ElastiCache. Again the AWS solution here looked to be an easy way to keep our involvement in managing this part of the platform at a minimum. Allowing us to focus on our core business.

Rails ls was chosen for its ability to quickly get core functionality up and running, its MVC architecture and also its focus on Test Driven Development using RSpec and Selenium with Travis CI providing continual integration. We also liked Ruby for its terse, clean and elegant syntax. Though YMMV on that one!

Unicorn was chosen for its continual deployment and reputation as a reliable application server, nginx for its reputation as a fast and stable reverse-proxy. We also took advantage of the Amazon CloudFront CDN here to further improve performance by caching static assets globally.

We tried to strike a balance between having control over management and configuration of our core application with the convenience of being able to leverage AWS hosted services for ancillary functions (Amazon SES , Amazon SQS Amazon Route 53 all hosted securely inside Amazon VPC of course!).

Whilst there is some compromise here with potential vendor lock in, the tasks being performed by these ancillary services are no particularly specialised which should mitigate this risk. Furthermore we have already containerised the stack in our development using Docker environment, and looking to how best to bring this into production - potentially using Amazon EC2 Container Service

8 upvotes·768.9K views

Needs advice

and

I would like to assess search functionality along with some analytical use cases like aggregating, faceting etc.,. I would like to know which is the best database to go with among Elasticsearch, MongoDB and FaunaDB.

3 upvotes·105.9K views

Needs advice

and

Hi, I need advice on which Database tool to use in the following scenario:

I work with Cesium, and I need to save and load CZML snapshot and update objects for a recording program that saves files containing several entities (along with the time of the snapshot or update). I need to be able to easily load the files according to the corresponding timeline point (for example, if the update was recorded at 13:15, I should be able to easily load the update file when I click on the 13:15 point on the timeline). I should also be able to make geo-queries relatively easily.

I am currently thinking about Elasticsearch or PostgreSQL, but I am open to suggestions. I tried looking into Time Series Databases like TimescaleDB but found that it is unnecessarily powerful than my needs since the update time is a simple variable.

Thanks for your advice in advance!

4 upvotes·130.8K views

Replies (1)

Václav Hodek

CEO, lead developer at Localazy·Jun 9, 2020

Recommends

PostgreSQL

In your situation, PostgreSQL seems to be better option. Why? 1) Saving structured data is possible in both PostgreSQL and Elasticsearch. In PostgreSQL, there is JSONB column available and you can build indexes on top of it. 2) If you are able to specify the time as a primary key, both Elasticsearch and PostgreSQL are great options. 3) PostgreSQL allows you to do a lot more with your data and handle them in a relation way. You are not clear whether it's a benefit or not but let's consider extensibility to be an advantage. 4) PostgreSQL comes with PostGIS extension to work with geo data. May be useful for your situation. 5) PostgreSQL may serve for other needs of your app. Managing one database is always better than having two of them.

Thanks to JSONB column type, PostgreSQL is a sweet combination of relational and noSQL database, but there are also drawbacks coming from ACID compliancy and WAL overhead for rapid changes.

4 upvotes·4.2K views

Nilesh Akhade

Technical Architect at Self Employed·Jul 8, 2020

Needs advice

and

We have a Kafka topic having events of type A and type B. We need to perform an inner join on both type of events using some common field (primary-key). The joined events to be inserted in Elasticsearch.

In usual cases, type A and type B events (with same key) observed to be close upto 15 minutes. But in some cases they may be far from each other, lets say 6 hours. Sometimes event of either of the types never come.

In all cases, we should be able to find joined events instantly after they are joined and not-joined events within 15 minutes.

5 upvotes·525.7K views

Replies (2)

lvhuyen

Jul 9, 2020

Recommends

Elasticsearch

The first solution that came to me is to use upsert to update ElasticSearch:

Use the primary-key as ES document id
Upsert the records to ES as soon as you receive them. As you are using upsert, the 2nd record of the same primary-key will not overwrite the 1st one, but will be merged with it.

Cons: The load on ES will be higher, due to upsert.

To use Flink:

Create a KeyedDataStream by the primary-key
In the ProcessFunction, save the first record in a State. At the same time, create a Timer for 15 minutes in the future
When the 2nd record comes, read the 1st record from the State, merge those two, and send out the result, and clear the State and the Timer if it has not fired
When the Timer fires, read the 1st record from the State and send out as the output record.
Have a 2nd Timer of 6 hours (or more) if you are not using Windowing to clean up the State

Pro: if you have already having Flink ingesting this stream. Otherwise, I would just go with the 1st solution.

Averell Huyen Levan – Medium (medium.com)

5 upvotes·2 comments·434.8K views

Nilesh Akhade

July 10th 2020 at 4:04PM

In flink approach, we cant query the data while its being processed (in flink memory). Consequently we have to wait for 6 hours for event to be available. Although this can be worked around by maintaining copy of data being processed for 15mins.

Thank you so much for detailed solution.

What are your views on preferring Apache Flink over Kafka Streams and Apache Spark for this use case?

Ashwani Agarwal

July 16th 2020 at 9:41AM

What do you think about having MongoDB for 1st case i/o ES? My point of view is that it's easier to get started with MongoDB.

Akshaya Rawat

Senior Specialist Platform at Publicis Sapient·Sep 4, 2020

Recommends

Apache Spark

Please refer "Structured Streaming" feature of Spark. Refer "Stream - Stream Join" at https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#stream-stream-joins . In short you need to specify "Define watermark delays on both inputs" and "Define a constraint on time across the two inputs"

3 upvotes·368.6K views

Chose

over

(

)

Elasticsearch Used for powering full-text search applications.

The latest in payments technology and company news | Checkout.com (checkout.com)

1 upvote·87.5K views

Sunil Chaudhari

Team Lead at XYZ·Jun 15, 2020

Needs advice

Metricbeat

and

Prometheus

Hi, We have a situation, where we are using Prometheus to get system metrics from PCF (Pivotal Cloud Foundry) platform. We send that as time-series data to Cortex via a Prometheus server and built a dashboard using Grafana. There is another pipeline where we need to read metrics from a Linux server using Metricbeat, CPU, memory, and Disk. That will be sent to Elasticsearch and Grafana will pull and show the data in a dashboard.

Is it OK to use Metricbeat for Linux server or can we use Prometheus?

What is the difference in system metrics sent by Metricbeat and Prometheus node exporters?

Regards, Sunil.

2 upvotes·541.1K views

Replies (2)

Matthew Rothstein

CTO at Final·Jul 16, 2020

Recommends

Prometheus

If you're already using Prometheus for your system metrics, then it seems like standing up Elasticsearch just for Linux host monitoring is excessive. The node_exporter is probably sufficient if you'e looking for standard system metrics.

Another thing to consider is that Metricbeat / ELK use a push model for metrics delivery, whereas Prometheus pulls metrics from each node it is monitoring. Depending on how you manage your network security, opting for one solution over two may make things simpler.

5 upvotes·1 comment·331.6K views

Manish Sharma

July 23rd 2021 at 9:41AM

This is perfect answer.

talaverant

Jul 2, 2020

Recommends

Instana

Hi Sunil! Unfortunately, I don´t have much experience with Metricbeat so I can´t advise on the diffs with Prometheus...for Linux server, I encourage you to use Prometheus node exporter and for PCF, I would recommend using the instana tile (https://www.instana.com/supported-technologies/pivotal-cloud-foundry/). Let me know if you have further questions! Regards Jose

2 upvotes·331.8K views

Rana Usman Shahid

Chief Technology Officer at TechAvanza·Jun 4, 2020

Needs advice

Algolia

Elasticsearch

and

Firebase

Hey everybody! (1) I am developing an android application. I have data of around 3 million record (less than a TB). I want to save that data in the cloud. Which company provides the best cloud database services that would suit my scenario? It should be secured, long term useable, and provide better services. I decided to use Firebase Realtime database. Should I stick with Firebase or are there any other companies that provide a better service?

(2) I have the functionality of searching data in my app. Same data (less than a TB). Which search solution should I use in this case? I found Elasticsearch and Algolia search. It should be secure and fast. If any other company provides better services than these, please feel free to suggest them.

Thank you!

6 upvotes·374.8K views

Replies (2)

Josh Dzielak

Co-Founder & CTO at Orbit·Jul 10, 2020

Recommends

Algolia

Hi Rana, good question! From my Firebase experience, 3 million records is not too big at all, as long as the cost is within reason for you. With Firebase you will be able to access the data from anywhere, including an android app, and implement fine-grained security with JSON rules. The real-time-ness works perfectly. As a fully managed database, Firebase really takes care of everything. The only thing to watch out for is if you need complex query patterns - Firestore (also in the Firebase family) can be a better fit there.

To answer question 2: the right answer will depend on what's most important to you. Algolia is like Firebase is that it is fully-managed, very easy to set up, and has great SDKs for Android. Algolia is really a full-stack search solution in this case, and it is easy to connect with your Firebase data. Bear in mind that Algolia does cost money, so you'll want to make sure the cost is okay for you, but you will save a lot of engineering time and never have to worry about scale. The search-as-you-type performance with Algolia is flawless, as that is a primary aspect of its design. Elasticsearch can store tons of data and has all the flexibility, is hosted for cheap by many cloud services, and has many users. If you haven't done a lot with search before, the learning curve is higher than Algolia for getting the results ranked properly, and there is another learning curve if you want to do the DevOps part yourself. Both are very good platforms for search, Algolia shines when buliding your app is the most important and you don't want to spend many engineering hours, Elasticsearch shines when you have a lot of data and don't mind learning how to run and optimize it.

8 upvotes·279.2K views

Mike Endale

Founder at Moxit·Jul 15, 2020

Recommends

Cloud Firestore

Rana - we use Cloud Firestore at our startup. It handles many million records without any issues. It provides you the same set of features that the Firebase Realtime Database provides on top of the indexing and security trims. The only thing to watch out for is to make sure your Cloud Functions have proper exception handling and there are no infinite loop in the code. This will be too costly if not caught quickly.

For search; Algolia is a great option, but cost is a real consideration. Indexing large number of records can be cost prohibitive for most projects. Elasticsearch is a solid alternative, but requires a little additional work to configure and maintain if you want to self-host.

Hope this helps.

5 upvotes·281.4K views

Eric Richner

Owner at The Richner Group·Sep 24, 2020

Needs advice

and

We are starting to work on a web-based platform aiming to connect investors/wholesalers (clients) and buyers (service providers). A third service provider, lenders, will be added in the future.

The ability to create profiles of buyers w/ their buying criteria, to create saved records of properties for sale (provided by client) to be cross-referenced against the buyers' criteria is our core functionality.

In-app, timeline-based, real-time communication between users (& storing it), file transfers, and push notifications are post MVP features we would like as well.

We are considering using React, Elasticsearch / App Search w/ their Search UI, and using Real-Time Database and functionalities of Firebase.

3 upvotes·27.5K views

André Ribeiro

Federal University of Rio de Janeiro·Nov 20, 2020

Needs advice

Amazon DynamoDB

Amazon Elasticsearch Service

and

Elasticsearch

Hi, community, I'm planning to build a web service that will perform a text search in a data set off less than 3k well-structured JSON objects containing config data. I'm expecting no more than 20 MB of data. The general traits I need for this search are: - Typo tolerant (fuzzy query), so it has to match the entries even though the query does not match 100% with a word on that JSON - Allow a strict match mode - Perform the search through all the JSON values (it can reach 6 nesting levels) - Ignore all Keys of the JSON; I'm interested only in the values.

The only thing I'm researching at the moment is Elasticsearch, and since the rest of the stack is on AWS the Amazon ElasticSearch is my favorite candidate so far. Although, the only knowledge I have on it was fetched from some articles and Q&A that I read here and there. Is ElasticSearch a good path for this project? I'm also considering Amazon DynamoDB (which I also don't know of), but it does not look to cover the requirements of fuzzy-search and ignore the JSON properties. Thank you in advance for your precious advice!

4 upvotes·48.9K views

Replies (3)

Roel van den Brand

Lead Developer at Di-Vision Consultion·Dec 6, 2020

Recommends

Amazon Athena

Maybe you can do it with storing on S3, and query via Amazon Athena en AWS Glue. Don't know about the performance though. Fuzzy search could otherwise be done with storing a soundex value of the fields you want to search on in a MongoDB. In DynamoDB you would need indexes on every searchable field if you want it to be efficient.

3 upvotes·38.8K views

Ted Elliott

Computer Science ·Dec 19, 2020

Recommends

Amazon Elasticsearch Service

Elasticsearch

I think elasticsearch should be a great fit for that use case. Using the AWS version will make your life easier. With such a small dataset you may also be able to use an in process library for searching and possibly remove the overhead of using a database. I don’t if it fits the bill, but you may also want to look into lucene.

I can tell you that Dynamo DB is definitely not a good fit for your use case. There is no fuzzy matching feature and you would need to have an index for each field you want to search or convert your data into a more searchable format for storing in Dynamo, which is something a full text search tool like elasticsearch is going to do for you.

3 upvotes·38.2K views

View all (3)

Juan Felipe

Developer ·Oct 20, 2021

Needs advice

Golang

Grafana

and

Logstash

Hi everyone. I'm trying to create my personal syslog monitoring.

To get the logs, I have uncertainty to choose the way: 1.1 Use Logstash like a TCP server. 1.2 Implement a Go TCP server.
To store and plot data. 2.1 Use Elasticsearch tools. 2.2 Use InfluxDB and Grafana.

I would like to know... Which is a cheaper and scalable solution?

Or even if there is a better way to do it.

10 upvotes·180.1K views

Replies (3)

Recommends

Hi Juan

A very simple and cheap (resource usage) option here would be to use promtail to send syslog data to Loki and visualise Loki with Grafana using the native Grafana Loki data source. I have recently put together this set up and promtail and Loki are less resource intensive than Logstash/ES and it is a simple set up and configuration and works very nicely.

4 upvotes·2 comments·4.4K views

Sunil Chaudhari

October 27th 2021 at 1:23AM

Hi,

Does promtel available for PCF?

Gary Wilson

October 27th 2021 at 1:38PM

Hi @sunilmchaudhari I do not know. I assume by PCF you are refering to Pivot Cloud Foundry, which I have no knowledge of sorry. Promtail is a go binary so if you can add log data to a syslog, then you can process it with Promtail.

Sunil Chaudhari

Team Lead at XYZ·Oct 26, 2021

Recommends

For Syslog, you can certainly use TCP Input. Really interested to know what is your syslog client( which will ship logs to logstash). Anyways you can check that and see if that client has capability to configure multiple logstash host ports so that it works as a load balancer. This will increase throughput. Also check pipeline-to-pipeline communcation of logstash: https://www.elastic.co/guide/en/logstash/current/pipeline-to-pipeline.html This helps to implement distributor pattern of pipeline where multiple type of data is coming to same input and you may want to route filtering and processing based on types. It increases parallelism. About Elasticsearch: Its a native component and perfectly fits with logstash so you can use elasticsearch for storage and search. Its one of the datasource of grafana.

3 upvotes·2.1K views

View all (3)